2014-03-12

Notes on Azkaban

I've been evaluating tools to run data processing jobs and narrowed my list down to Luigi and Azkaban.

I nixed Luigi for a number of reasons:
  • you can't execute jobs from the web UI.
  • you can't schedule jobs--you still have to use cron and all the bs that goes with that (manually managing overlap, e.g. what should I do when my first job is still running when the second job is scheduled to start?).
  • the documentation is horrid.
So far so good. Azkaban does have some quirks I'm working through. For example:
  • the executor.host property is not in the default config. The web component wisely defaults to localhost, but it would be handy to have it in the default config, even commented out (like many other properties) so you can run Azkaban in its preferred distributed mode without having to look through Google Groups questions.
  • I still can't figure out how to set up the host configuration for the Hadoop cluster Azkaban is supposed to talk to. Fixed--see below.
But the UI is intuitive and it handles the overlap issue (for a given job flow) like a champ: if a job is schedule or run while it's still running, you can tell Azkaban to abort the second run, let it run in parallel, or wait until the first run completes before the second run starts.

Things to remember:
  • Hadoop and Hive must be installed on the job executor box. Not running, just installed, with the standard HADOOP_HOME and HIVE_HOME env vars set, etc. 
  • Then you have to put your actual cluster's config files in the executor server's Hadoop config directory (typically $HADOOP_HOME/conf). This is because Azkaban looks for your remote Hadoop name node location in its local Hadoop configuration files. 
  • There's a lot of documentation out there based on Hadoop's old (1.x) directory structure. Hadoop 2.x has changed a lot of that and the Jars aren't where you'd expect them. Inspect your classpaths in all the config and properties files used by Azkaban and your Hive jobs. If a Hive job fails, it's a good bet you have a classpath problem, so look at your Azkaban executor server logs (not just the logs in the web interface).
  • The startup and shutdown scripts in bin/ are pretty brittle. Make sure all the directories are set correctly and you handle errors if they're not. Also, make sure to run them as bin/script.sh instead of cd bin; ./script.sh because they rely on relative directories.
  • Remember to open port 12321 in the executor server's firewall so the web server can submit jobs.
  • Remember to open port 9000 on your master Hadoop node so Azkaban can submit jobs to it. 
  • One project == one flow. If you upload more than one flow into a project, only the last one is retained. 
  • You can't run the Hive job examples as-is. They won't work because they're missing a few properties.
This is the basic Hive word count job example:

type=hive
user.to.proxy=azkaban
hive.script=scripts/hive-wc.hql

In order for it to work, it needs to look like this (differences that matter are bolded):

type=hive
user.to.proxy=hadoop
azk.hive.action=execute.query
classpath=./*,./lib/*,${hadoop.home}/*,${hadoop.home}/lib/*,${hive.home}/lib/*
hive.query.file
=scripts/hive-wc.hql

I'm using Azkaban 2.1 and Hadoop 1.2.1. Different versions will have different paths and classpaths, but the azk.hive.action, classpath and hive.query.file are crucial (hive.script doesn't work).

Other things that don't work out of the box:

  • When proxying to use the hadoop user of your choice, you need to set the security manager class to its fully qualified name. The sample config does not fully qualify the class name and so the executor fails to load. To wit:
# hadoop security manager setting common to hadoop jobs
hadoop.security.manager.class=HadoopSecurityManager_H_1_0

should be

# hadoop security manager setting common to hadoop jobs hadoop.security.manager.class=azkaban.security.HadoopSecurityManager_H_1_0

in the plugins/jobtypes/commonprivate.properties file.

1 comment:

  1. One project != one flow

    A project can have more than one flow. Any job that does not have any dependents (i.e., no other jobs depend on it) is considered a flow. Another way to think about this is that each sink in your project job graph is considered a flow.

    ReplyDelete