2014-04-24

Elastic MapReduce, Hive and Input Files

We're using Hive and Amazon's Elastic MapReduce to process sizable data sets. Today, I was wondering why a simple count query on a table with under a billion rows was taking a long time. The table file is in a single gzipped file in an S3 bucket, and Hive was only using a single mapper. So I thought, hrm, it looks like the job isn't distributed at all, so let's try splitting the input file into a bunch of smaller files to see if Hive will be able to put more mappers to work.

This is the initial slow job, with a single gzipped file for the table in S3:

-- SINGLE .gz FILE AS HIVE TABLE
hive> select count(*) FROM mytable;

Job 0: Map: 1  Reduce: 1   Cumulative CPU: 254.84 sec   HDFS Read: 207 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 14 seconds 840 msec
OK
239370915
Time taken: 274.51 seconds, Fetched: 1 row(s)

This is the same job run against 240 non-gzipped files for the table in S3:

-- MULTIPLE FILES, not gzipped
hive> select count(*) FROM mytable_multiple_files_no_gzip;

Job 0: Map: 48  Reduce: 1   Cumulative CPU: 538.05 sec   HDFS Read: 25536 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 8 minutes 58 seconds 50 msec
OK
239370915
Time taken: 55.071 seconds, Fetched: 1 row(s)

Not bad, eh?

Then I tried the same split schema, except each file was gzipped individually (240 gzipped input files):

-- MULTIPLE FILES, gzip
hive> select count(*) FROM mytable_multiple_files_gzip;

Job 0: Map: 240  Reduce: 1   Cumulative CPU: 1552.43 sec   HDFS Read: 52080 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 25 minutes 52 seconds 430 msec
OK
239370915
Time taken: 112.735 seconds, Fetched: 1 row(s)

So with gzipped input files, I had a one mapper-one file relationship; with uncompressed input files, I had a one mapper-five files relationship.

These numbers were obtained on a cluster with 8 i2.2xlarge data nodes and an m3.xlarge name node.

Typically (at least that's what a cursory Google search suggests), people have the opposite problem--too many small-ish files in S3, and too many mappers. Too many mappers can delay your reducers' work. So I'll do some testing on different splitting schemas for the same data set and update.

No comments:

Post a Comment