2014-04-24

Elastic MapReduce, Hive and Input Files

We're using Hive and Amazon's Elastic MapReduce to process sizable data sets. Today, I was wondering why a simple count query on a table with under a billion rows was taking a long time. The table file is in a single gzipped file in an S3 bucket, and Hive was only using a single mapper. So I thought, hrm, it looks like the job isn't distributed at all, so let's try splitting the input file into a bunch of smaller files to see if Hive will be able to put more mappers to work.

This is the initial slow job, with a single gzipped file for the table in S3:

-- SINGLE .gz FILE AS HIVE TABLE
hive> select count(*) FROM mytable;

Job 0: Map: 1  Reduce: 1   Cumulative CPU: 254.84 sec   HDFS Read: 207 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 14 seconds 840 msec
OK
239370915
Time taken: 274.51 seconds, Fetched: 1 row(s)

This is the same job run against 240 non-gzipped files for the table in S3:

-- MULTIPLE FILES, not gzipped
hive> select count(*) FROM mytable_multiple_files_no_gzip;

Job 0: Map: 48  Reduce: 1   Cumulative CPU: 538.05 sec   HDFS Read: 25536 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 8 minutes 58 seconds 50 msec
OK
239370915
Time taken: 55.071 seconds, Fetched: 1 row(s)

Not bad, eh?

Then I tried the same split schema, except each file was gzipped individually (240 gzipped input files):

-- MULTIPLE FILES, gzip
hive> select count(*) FROM mytable_multiple_files_gzip;

Job 0: Map: 240  Reduce: 1   Cumulative CPU: 1552.43 sec   HDFS Read: 52080 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 25 minutes 52 seconds 430 msec
OK
239370915
Time taken: 112.735 seconds, Fetched: 1 row(s)

So with gzipped input files, I had a one mapper-one file relationship; with uncompressed input files, I had a one mapper-five files relationship.

These numbers were obtained on a cluster with 8 i2.2xlarge data nodes and an m3.xlarge name node.

Typically (at least that's what a cursory Google search suggests), people have the opposite problem--too many small-ish files in S3, and too many mappers. Too many mappers can delay your reducers' work. So I'll do some testing on different splitting schemas for the same data set and update.

2014-04-04

McCarthy was self-righteous too

Brendan Eich, inventor of JavaScript, just resigned from his brand new position as CEO of the Mozilla foundation, after it was discovered he made a $1000 donation to the anti-gay-marriage campaign in California known as Prop 8.

That discovery caused uproar among the self-righteous bien-pensants who work for Mozilla, and a number of employees posted tweets about how they thought he should resign.

I'm angry about this because this isn't very different from McCarthyism in reverse. A guy was forced out of a job because his political views don't agree with the majority.

I feel opposing gay marriage is bigoted, wrong, indefensible and on the wrong side of history. I don't know Eich. For all I know he's a raging asshole with ultra-right-wing views. He might even hate kittens and burp at the dinner table. I don't know.

But what I do know is that getting forced out of a job by a self-righteous San Francisco mob of entitled nerds who have probably never even seen a Republican in the flesh is just as indefensible. It's not what America and California are about. And it shows liberals can be assholes, too, when they put their minds to it.

I'd venture to say a very large number of CEOs are raging right-wing Republicans with questionable ethics. If you don't like your CEO's politics, you're free to work somewhere else. Your job isn't in grave danger if you and your CEO don't see eye-to-eye in terms of politics--there are laws on the books protecting you from discrimination. Why should your CEO's job be in jeopardy for that very same reason?

Eich's contributions to Web tech are immense and he may well be as capable as anyone of running Mozilla, a company he's been with for years. Yet he lost his job because of his politics. And that's not right, whether you agree with him or not.