Enabling LZO compression for Hive to avoid cannot find class com.hadoop.mapred.DeprecatedLzoTextInputFormat error

I just spent a bunch of time reading through documentation and Google Group postings about how to enable LZO compression in Hive, only to find none of them was the right solution. In the end I did find something that worked, so hopefully this can help someone.

Goal: enable LZO-compressed files to be used for Hive tables.

Environment: Hadoop cluster managed with Cloudera Manager version 5.

What's missing from the instructions and the Google Group postings about that error is how to tell Hive where to find the Hadoop LZO jar. The instructions about classpath settings above are not sufficient, and you'll have this error when running a Hive query against an LZO table:

cannot find class com.hadoop.mapred.DeprecatedLzoTextInputFormat

To fix this:
  • Go to your Cloudera Manager UI home page
  • Click Hive
  • Click Configuration > View and Edit
  • Under Service-Wide > Advanced, look for Hive Auxiliary JARs Directory
  • Set the value to /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib
  • Restart the Hive service (and any related services)
Now you can run queries against LZO-compressed files.

As a reminder, to create a table backed by LZO-compressed files in HDFS, do something like this:

CREATE EXTERNAL TABLE `my_lzo_table`(`something` string)


Cloudera Manager Fails to Restart

I've been experimenting with Cloudera Manager to manage Hadoop clusters on EC2. So far it seems to be working a little better than Ambari, which managed to install its agent software on all my nodes but always failed to start the required services.
Cloudera Manager did fail as well, but that seemed to be due to my security group settings. I changed the configuration and restarted the service on the host, but the restart always failed with this error:

Caused by: java.io.FileNotFoundException: /usr/share/cmf/python/Lib/site$py.class (Permission denied)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.(FileInputStream.java:146)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:126)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:134)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:134)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.doProcessElements(ExplodedJarVisitor.java:92)
        at org.hibernate.ejb.packaging.AbstractJarVisitor.getMatchingEntries(AbstractJarVisitor.java:149)
        at org.hibernate.ejb.packaging.NativeScanner.getClassesInJar(NativeScanner.java:128)
        ... 31 more

Odd, I thought, since by default the service runs as root and should have free rein. So I poked around in that Python library directory, and lo and behold:

I chmod'ed 644 the .class files (in /usr/share/cmf/python/Lib and /usr/share/cmf/python/Lib/simplejson) and sure enough everything is working again.
Hopefully this is helpful to somebody.


Installing Scrapy on an Amazon CentOS AMI

We're experimenting with Scrapy, and I thought I'd share what I found while installing the Scrapy package, as it has multiple dependencies many Python installations don't normally include, and those are not listed in the documentation.

* First off, you want bzip2:

$ cd /tmp
$ wget http://bzip.org/1.0.6/bzip2-1.0.6.tar.gz
$ tar -xzf bzip2-1.0.6.tar.gz
$ cd bzip2-1.0.6
$ sudo make -f Makefile-libbz2_so
$ sudo make
$ sudo make install PREFIX=/usr/local
$ sudo cp libbz2.so.1.0.6 /usr/local/lib

* Then you want the libffi headers

$ sudo yum install libffi-devel

* Then you want to download the Python source and extract it:

$ wget https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tgz
$ tar -xzf Python-2.7.6.tgz
$ cd Python-2.7.6

* I usually uncomment any lines referencing ssl and zlib in Modules/Setup.dist

$ vi Modules/Setup.dist
# find and uncomment the lines

* Then build Python:

$ ./configure --prefix=/usr/local
$ sudo make
$ sudo make altinstall

* Install virtualenv if you don't have it (you should)

* Activate your virtualenv with your fresh Python

$ cd
$ virtualenv -p /usr/local/bin/python2.7 myenv
$ source myenv/bin/activate

* Install Scrapy

$ easy_install Scrapy

This was tested on an Amazon AWS image named amzn-ami-pv-2013.09.2.x86_64-ebs (ami-a43909e1), known as Amazon Linux AMI x86_64 PV EBS with the following version strings:

$ cat /proc/version
Linux version 3.4.73-64.112.amzn1.x86_64 (mockbuild@gobi-build-31003) (gcc version 4.6.3 20120306 (Red Hat 4.6.3-2) (GCC) ) #1 SMP Tue Dec 10 01:50:05 UTC 2013

$ cat /etc/*-release
Amazon Linux AMI release 2014.03


Elastic MapReduce, Hive and Input Files

We're using Hive and Amazon's Elastic MapReduce to process sizable data sets. Today, I was wondering why a simple count query on a table with under a billion rows was taking a long time. The table file is in a single gzipped file in an S3 bucket, and Hive was only using a single mapper. So I thought, hrm, it looks like the job isn't distributed at all, so let's try splitting the input file into a bunch of smaller files to see if Hive will be able to put more mappers to work.

This is the initial slow job, with a single gzipped file for the table in S3:

hive> select count(*) FROM mytable;

Job 0: Map: 1  Reduce: 1   Cumulative CPU: 254.84 sec   HDFS Read: 207 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 14 seconds 840 msec
Time taken: 274.51 seconds, Fetched: 1 row(s)

This is the same job run against 240 non-gzipped files for the table in S3:

-- MULTIPLE FILES, not gzipped
hive> select count(*) FROM mytable_multiple_files_no_gzip;

Job 0: Map: 48  Reduce: 1   Cumulative CPU: 538.05 sec   HDFS Read: 25536 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 8 minutes 58 seconds 50 msec
Time taken: 55.071 seconds, Fetched: 1 row(s)

Not bad, eh?

Then I tried the same split schema, except each file was gzipped individually (240 gzipped input files):

hive> select count(*) FROM mytable_multiple_files_gzip;

Job 0: Map: 240  Reduce: 1   Cumulative CPU: 1552.43 sec   HDFS Read: 52080 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 25 minutes 52 seconds 430 msec
Time taken: 112.735 seconds, Fetched: 1 row(s)

So with gzipped input files, I had a one mapper-one file relationship; with uncompressed input files, I had a one mapper-five files relationship.

These numbers were obtained on a cluster with 8 i2.2xlarge data nodes and an m3.xlarge name node.

Typically (at least that's what a cursory Google search suggests), people have the opposite problem--too many small-ish files in S3, and too many mappers. Too many mappers can delay your reducers' work. So I'll do some testing on different splitting schemas for the same data set and update.


McCarthy was self-righteous too

Brendan Eich, inventor of JavaScript, just resigned from his brand new position as CEO of the Mozilla foundation, after it was discovered he made a $1000 donation to the anti-gay-marriage campaign in California known as Prop 8.

That discovery caused uproar among the self-righteous bien-pensants who work for Mozilla, and a number of employees posted tweets about how they thought he should resign.

I'm angry about this because this isn't very different from McCarthyism in reverse. A guy was forced out of a job because his political views don't agree with the majority.

I feel opposing gay marriage is bigoted, wrong, indefensible and on the wrong side of history. I don't know Eich. For all I know he's a raging asshole with ultra-right-wing views. He might even hate kittens and burp at the dinner table. I don't know.

But what I do know is that getting forced out of a job by a self-righteous San Francisco mob of entitled nerds who have probably never even seen a Republican in the flesh is just as indefensible. It's not what America and California are about. And it shows liberals can be assholes, too, when they put their minds to it.

I'd venture to say a very large number of CEOs are raging right-wing Republicans with questionable ethics. If you don't like your CEO's politics, you're free to work somewhere else. Your job isn't in grave danger if you and your CEO don't see eye-to-eye in terms of politics--there are laws on the books protecting you from discrimination. Why should your CEO's job be in jeopardy for that very same reason?

Eich's contributions to Web tech are immense and he may well be as capable as anyone of running Mozilla, a company he's been with for years. Yet he lost his job because of his politics. And that's not right, whether you agree with him or not.


Notes on Azkaban

I've been evaluating tools to run data processing jobs and narrowed my list down to Luigi and Azkaban.

I nixed Luigi for a number of reasons:
  • you can't execute jobs from the web UI.
  • you can't schedule jobs--you still have to use cron and all the bs that goes with that (manually managing overlap, e.g. what should I do when my first job is still running when the second job is scheduled to start?).
  • the documentation is horrid.
So far so good. Azkaban does have some quirks I'm working through. For example:
  • the executor.host property is not in the default config. The web component wisely defaults to localhost, but it would be handy to have it in the default config, even commented out (like many other properties) so you can run Azkaban in its preferred distributed mode without having to look through Google Groups questions.
  • I still can't figure out how to set up the host configuration for the Hadoop cluster Azkaban is supposed to talk to. Fixed--see below.
But the UI is intuitive and it handles the overlap issue (for a given job flow) like a champ: if a job is schedule or run while it's still running, you can tell Azkaban to abort the second run, let it run in parallel, or wait until the first run completes before the second run starts.

Things to remember:
  • Hadoop and Hive must be installed on the job executor box. Not running, just installed, with the standard HADOOP_HOME and HIVE_HOME env vars set, etc. 
  • Then you have to put your actual cluster's config files in the executor server's Hadoop config directory (typically $HADOOP_HOME/conf). This is because Azkaban looks for your remote Hadoop name node location in its local Hadoop configuration files. 
  • There's a lot of documentation out there based on Hadoop's old (1.x) directory structure. Hadoop 2.x has changed a lot of that and the Jars aren't where you'd expect them. Inspect your classpaths in all the config and properties files used by Azkaban and your Hive jobs. If a Hive job fails, it's a good bet you have a classpath problem, so look at your Azkaban executor server logs (not just the logs in the web interface).
  • The startup and shutdown scripts in bin/ are pretty brittle. Make sure all the directories are set correctly and you handle errors if they're not. Also, make sure to run them as bin/script.sh instead of cd bin; ./script.sh because they rely on relative directories.
  • Remember to open port 12321 in the executor server's firewall so the web server can submit jobs.
  • Remember to open port 9000 on your master Hadoop node so Azkaban can submit jobs to it. 
  • One project == one flow. If you upload more than one flow into a project, only the last one is retained. 
  • You can't run the Hive job examples as-is. They won't work because they're missing a few properties.
This is the basic Hive word count job example:


In order for it to work, it needs to look like this (differences that matter are bolded):


I'm using Azkaban 2.1 and Hadoop 1.2.1. Different versions will have different paths and classpaths, but the azk.hive.action, classpath and hive.query.file are crucial (hive.script doesn't work).

Other things that don't work out of the box:

  • When proxying to use the hadoop user of your choice, you need to set the security manager class to its fully qualified name. The sample config does not fully qualify the class name and so the executor fails to load. To wit:
# hadoop security manager setting common to hadoop jobs

should be

# hadoop security manager setting common to hadoop jobs hadoop.security.manager.class=azkaban.security.HadoopSecurityManager_H_1_0

in the plugins/jobtypes/commonprivate.properties file.