2014-06-19

Enabling LZO compression for Hive to avoid cannot find class com.hadoop.mapred.DeprecatedLzoTextInputFormat error

I just spent a bunch of time reading through documentation and Google Group postings about how to enable LZO compression in Hive, only to find none of them was the right solution. In the end I did find something that worked, so hopefully this can help someone.

Goal: enable LZO-compressed files to be used for Hive tables.

Environment: Hadoop cluster managed with Cloudera Manager version 5.

Prerequisites:
What's missing from the instructions and the Google Group postings about that error is how to tell Hive where to find the Hadoop LZO jar. The instructions about classpath settings above are not sufficient, and you'll have this error when running a Hive query against an LZO table:

cannot find class com.hadoop.mapred.DeprecatedLzoTextInputFormat

To fix this:
  • Go to your Cloudera Manager UI home page
  • Click Hive
  • Click Configuration > View and Edit
  • Under Service-Wide > Advanced, look for Hive Auxiliary JARs Directory
  • Set the value to /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib
  • Restart the Hive service (and any related services)
Now you can run queries against LZO-compressed files.

As a reminder, to create a table backed by LZO-compressed files in HDFS, do something like this:

CREATE EXTERNAL TABLE `my_lzo_table`(`something` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
  'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/hdfs/path/to/your/lzo/files';

2014-06-11

Cloudera Manager Fails to Restart

I've been experimenting with Cloudera Manager to manage Hadoop clusters on EC2. So far it seems to be working a little better than Ambari, which managed to install its agent software on all my nodes but always failed to start the required services.
Cloudera Manager did fail as well, but that seemed to be due to my security group settings. I changed the configuration and restarted the service on the host, but the restart always failed with this error:

Caused by: java.io.FileNotFoundException: /usr/share/cmf/python/Lib/site$py.class (Permission denied)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.(FileInputStream.java:146)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:126)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:134)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:134)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.doProcessElements(ExplodedJarVisitor.java:92)
        at org.hibernate.ejb.packaging.AbstractJarVisitor.getMatchingEntries(AbstractJarVisitor.java:149)
        at org.hibernate.ejb.packaging.NativeScanner.getClassesInJar(NativeScanner.java:128)
        ... 31 more

Odd, I thought, since by default the service runs as root and should have free rein. So I poked around in that Python library directory, and lo and behold:

I chmod'ed 644 the .class files (in /usr/share/cmf/python/Lib and /usr/share/cmf/python/Lib/simplejson) and sure enough everything is working again.
Hopefully this is helpful to somebody.