Django, static assets, versioning, and WhiteNoise

I had an interesting time troubleshooting an issue with Django, WhiteNoise and static asset versioning. This may be obvious to experienced Django users, but not to me; I've maintained Flask and Rails apps before, but Django is a new beast. I'll document it here in case it helps somebody.

My goal was to set up asset versioning in a Django app to serve static files as filename.somehash.js instead of filename.js (same with other file types like css, png, etc). This is standard practice; most modern frameworks have that capability, and different ways to do it.

I had started using WhiteNoise because the internets suggested it was a much, much easier task than other ways to do it. I was hoping to do asset versioning and deploy a Cloudfront CDN at the same time, and WhiteNoise is set up to do just that.

Once everything was set up according to the documentation, I ran python manage.py collectstatic and saw the versioned file names getting generated. Checking the files themselves confirmed that. But when I loaded the app in a browser, only the unversioned file names were being requested.

After much head-scratching, I found this was because the app templates reference the static files with the standard {% load static %} method. The problem went away when I changed that to {% load static from staticfiles %} as suggested in this closed issue on the subject. Note that I didn't try the other option mentioned in that issue, {% load staticfiles %}, but that should also work. 

Once the app restarted, beautiful unique file names were being requested and served. But I was occasionally getting 500 errors. I traced those back to instances where the app and WhiteNoise were being asked to serve files that no longer exist. Those references to deleted js, css, etc. files didn't actually harm the app's functionality, but when WhiteNoise is asked to serve them, it throws an exception and causes the app to 500.

That's not ideal behavior--my take is that 50x errors in a production app should never happen and always be handled gracefully when they do, so a library that causes 500s by actively raising exceptions rather than logging, catching and handling them gracefully isn't ideal. But them's the breaks, and I might yet submit a PR to the owner if I find the time.

In this particular app's case, this behavior was especially non-ideal because some of these files were referenced in commented-out JavaScript, and not actually requested; it looks like WhiteNoise and/or Django greedily consider anything that looks like a static file path to be actually requested, even if it's in code that doesn't execute.

The solution is simple--find all those dangling references and exterminate them! Use those 500s to your advantage by exercising the app and tailing your error logs. It's easy to argue that's something you should do no matter what, so it wasn't hard to convince the code owners it was the right thing to do.


Installing wxPython in a virtualenv on Centos 6.7

I'm looking at wxPython to write a GUI for an app I'm working on, and as it turns out using wxPython with virtual environments isn't completely obvious. Hopefully someone finds this helpful.

My distribution is CentOS 6.7 with a hand-built Python 2.7.6.

Step 1: Install wxPython

This will install wxPython in your system's default Python library directory (not the one you want).

$ sudo yum install wxPython
$ sudo find / -name wx*.py

Step 2: Create and activate your virtualenv

$ cd
$ virtualenv -p /usr/bin/python2.7 venv
$ source venv/bin/activate

Importing wx will fail:

$ python
Python 2.7.6 (default, Dec  2 2013, 21:17:42) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import wx
Traceback (most recent call last):
  File "", line 1, in
ImportError: No module named wx

Step 3: Symlink wxPython into your virtualenv

$ cd ~/venv/lib/python2.7/site-packages
$ ln -s /usr/lib64/python2.6/site-packages/wx-2.8-gtk2-unicode/wx
$ ln -s /usr/lib64/python2.6/site-packages/wx-2.8-gtk2-unicode/wxPython
$ ln -s /usr/lib64/python2.6/site-packages/wxversion.py

Step 4: Start coding!

$ source venv/bin/activate
$ python
Python 2.7.6 (default, Dec  2 2013, 21:17:42) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import wx
>>> app = wx.App()
>>> frame = wx.Frame(None, -1, 'lol')
>>> frame.Show()
>>> app.MainLoop()

Note: I only just started playing with wxPython, so there may be other symlinks required to make it work. Let me know if so, and I'll update the post.


Recruiter isn't even trying anymore

I got this InMain on LinkedIn today.

March 25, 2015, 1:33 PM
Dear Roger,
Trust this finds you in good health!! I wanted to share a Back End Engineer role with you with one of my direct clients , its a long term contract and location is Mountain View, CA. Here is the short story:

The candidate should be able to quickly adapt to different environment/framework as the project needs.
These are minimum required skills.
Open source backend stack
Web backend frameworks in Java OR C#, OR Python OR Ruby, etc.
Database design and maintenance using SQL and/or NoSQL
Let me know if you think that this would be a fit as per your skills and expertise, if no, maybe you can refer someone !!
  • Nothing about the company (industry? size? age? product?)
  • The technology stack is essentially "whatever" (but open source! Including C#!)
  • The data stack is essentially "whatever"
  • The role is web AND database design AND dba
I LOL'ed heartily as I marked the InMail as spam.


Comcast lies

Verbatim transcript of a chat session with Naval. I was having internet connectivity problems. Naval says (s)he resolved them, and then did the standard upsell trick, in which (s)he lied. Then when I noticed the problems weren't resolved, (s)he lied again.


Lie #1: " you will get additional service at a cheaper cost."

The rep correctly showed my current service is 50m down and basic cable TV for $82 / mo. He said Comcast had an offer where I could get "additional service at a cheaper cost." The "cheaper cost" is $101.99 for first 12 months and then $126.99 from 13-24 months.

Lie #2: "Every thing is totally fine"

At the exact time Naval said "Every thing is totally fine," Comcast was having well-documented problems:


Lie #3 (probably): "You will also check this problem on our site also"

The "server upgradation" and service degradation is not listed.



I spoke with another rep, Jose, who was great. Excerpts from our chat, mostly for the lulz:

Jose: May I know the issue that our rep lied to you?
Roger: (summary of the above)
Roger: I asked the price
Roger: "This package will cost you only $101.99 for first 12 months, after that it will cost you $126.99 from 13-24 months."
Roger: $101.99 is not actually cheaper than $91.96.
Jose: yes that is correct. Its simple math.

Then this unintentionally hilarious pearl:

Jose: On internet issue they can be corrected by our Internet dept.
Jose: We are cable troubleshooting dept so all of our tools here are for cable TV.
Jose: On broken promises those usually happens on Sales dept.Jose: Here on cable troubleshhoting we dont lie.


Enabling LZO compression for Hive to avoid cannot find class com.hadoop.mapred.DeprecatedLzoTextInputFormat error

I just spent a bunch of time reading through documentation and Google Group postings about how to enable LZO compression in Hive, only to find none of them was the right solution. In the end I did find something that worked, so hopefully this can help someone.

Goal: enable LZO-compressed files to be used for Hive tables.

Environment: Hadoop cluster managed with Cloudera Manager version 5.

What's missing from the instructions and the Google Group postings about that error is how to tell Hive where to find the Hadoop LZO jar. The instructions about classpath settings above are not sufficient, and you'll have this error when running a Hive query against an LZO table:

cannot find class com.hadoop.mapred.DeprecatedLzoTextInputFormat

To fix this:
  • Go to your Cloudera Manager UI home page
  • Click Hive
  • Click Configuration > View and Edit
  • Under Service-Wide > Advanced, look for Hive Auxiliary JARs Directory
  • Set the value to /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib
  • Restart the Hive service (and any related services)
Now you can run queries against LZO-compressed files.

As a reminder, to create a table backed by LZO-compressed files in HDFS, do something like this:

CREATE EXTERNAL TABLE `my_lzo_table`(`something` string)


Cloudera Manager Fails to Restart

I've been experimenting with Cloudera Manager to manage Hadoop clusters on EC2. So far it seems to be working a little better than Ambari, which managed to install its agent software on all my nodes but always failed to start the required services.
Cloudera Manager did fail as well, but that seemed to be due to my security group settings. I changed the configuration and restarted the service on the host, but the restart always failed with this error:

Caused by: java.io.FileNotFoundException: /usr/share/cmf/python/Lib/site$py.class (Permission denied)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.(FileInputStream.java:146)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:126)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:134)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:134)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.doProcessElements(ExplodedJarVisitor.java:92)
        at org.hibernate.ejb.packaging.AbstractJarVisitor.getMatchingEntries(AbstractJarVisitor.java:149)
        at org.hibernate.ejb.packaging.NativeScanner.getClassesInJar(NativeScanner.java:128)
        ... 31 more

Odd, I thought, since by default the service runs as root and should have free rein. So I poked around in that Python library directory, and lo and behold:

I chmod'ed 644 the .class files (in /usr/share/cmf/python/Lib and /usr/share/cmf/python/Lib/simplejson) and sure enough everything is working again.
Hopefully this is helpful to somebody.


Installing Scrapy on an Amazon CentOS AMI

We're experimenting with Scrapy, and I thought I'd share what I found while installing the Scrapy package, as it has multiple dependencies many Python installations don't normally include, and those are not listed in the documentation.

* First off, you want bzip2:

$ cd /tmp
$ wget http://bzip.org/1.0.6/bzip2-1.0.6.tar.gz
$ tar -xzf bzip2-1.0.6.tar.gz
$ cd bzip2-1.0.6
$ sudo make -f Makefile-libbz2_so
$ sudo make
$ sudo make install PREFIX=/usr/local
$ sudo cp libbz2.so.1.0.6 /usr/local/lib

* Then you want the libffi headers

$ sudo yum install libffi-devel

* Then you want to download the Python source and extract it:

$ wget https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tgz
$ tar -xzf Python-2.7.6.tgz
$ cd Python-2.7.6

* I usually uncomment any lines referencing ssl and zlib in Modules/Setup.dist

$ vi Modules/Setup.dist
# find and uncomment the lines

* Then build Python:

$ ./configure --prefix=/usr/local
$ sudo make
$ sudo make altinstall

* Install virtualenv if you don't have it (you should)

* Activate your virtualenv with your fresh Python

$ cd
$ virtualenv -p /usr/local/bin/python2.7 myenv
$ source myenv/bin/activate

* Install Scrapy

$ easy_install Scrapy

This was tested on an Amazon AWS image named amzn-ami-pv-2013.09.2.x86_64-ebs (ami-a43909e1), known as Amazon Linux AMI x86_64 PV EBS with the following version strings:

$ cat /proc/version
Linux version 3.4.73-64.112.amzn1.x86_64 (mockbuild@gobi-build-31003) (gcc version 4.6.3 20120306 (Red Hat 4.6.3-2) (GCC) ) #1 SMP Tue Dec 10 01:50:05 UTC 2013

$ cat /etc/*-release
Amazon Linux AMI release 2014.03


Elastic MapReduce, Hive and Input Files

We're using Hive and Amazon's Elastic MapReduce to process sizable data sets. Today, I was wondering why a simple count query on a table with under a billion rows was taking a long time. The table file is in a single gzipped file in an S3 bucket, and Hive was only using a single mapper. So I thought, hrm, it looks like the job isn't distributed at all, so let's try splitting the input file into a bunch of smaller files to see if Hive will be able to put more mappers to work.

This is the initial slow job, with a single gzipped file for the table in S3:

hive> select count(*) FROM mytable;

Job 0: Map: 1  Reduce: 1   Cumulative CPU: 254.84 sec   HDFS Read: 207 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 14 seconds 840 msec
Time taken: 274.51 seconds, Fetched: 1 row(s)

This is the same job run against 240 non-gzipped files for the table in S3:

-- MULTIPLE FILES, not gzipped
hive> select count(*) FROM mytable_multiple_files_no_gzip;

Job 0: Map: 48  Reduce: 1   Cumulative CPU: 538.05 sec   HDFS Read: 25536 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 8 minutes 58 seconds 50 msec
Time taken: 55.071 seconds, Fetched: 1 row(s)

Not bad, eh?

Then I tried the same split schema, except each file was gzipped individually (240 gzipped input files):

hive> select count(*) FROM mytable_multiple_files_gzip;

Job 0: Map: 240  Reduce: 1   Cumulative CPU: 1552.43 sec   HDFS Read: 52080 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 25 minutes 52 seconds 430 msec
Time taken: 112.735 seconds, Fetched: 1 row(s)

So with gzipped input files, I had a one mapper-one file relationship; with uncompressed input files, I had a one mapper-five files relationship.

These numbers were obtained on a cluster with 8 i2.2xlarge data nodes and an m3.xlarge name node.

Typically (at least that's what a cursory Google search suggests), people have the opposite problem--too many small-ish files in S3, and too many mappers. Too many mappers can delay your reducers' work. So I'll do some testing on different splitting schemas for the same data set and update.


McCarthy was self-righteous too

Brendan Eich, inventor of JavaScript, just resigned from his brand new position as CEO of the Mozilla foundation, after it was discovered he made a $1000 donation to the anti-gay-marriage campaign in California known as Prop 8.

That discovery caused uproar among the self-righteous bien-pensants who work for Mozilla, and a number of employees posted tweets about how they thought he should resign.

I'm angry about this because this isn't very different from McCarthyism in reverse. A guy was forced out of a job because his political views don't agree with the majority.

I feel opposing gay marriage is bigoted, wrong, indefensible and on the wrong side of history. I don't know Eich. For all I know he's a raging asshole with ultra-right-wing views. He might even hate kittens and burp at the dinner table. I don't know.

But what I do know is that getting forced out of a job by a self-righteous San Francisco mob of entitled nerds who have probably never even seen a Republican in the flesh is just as indefensible. It's not what America and California are about. And it shows liberals can be assholes, too, when they put their minds to it.

I'd venture to say a very large number of CEOs are raging right-wing Republicans with questionable ethics. If you don't like your CEO's politics, you're free to work somewhere else. Your job isn't in grave danger if you and your CEO don't see eye-to-eye in terms of politics--there are laws on the books protecting you from discrimination. Why should your CEO's job be in jeopardy for that very same reason?

Eich's contributions to Web tech are immense and he may well be as capable as anyone of running Mozilla, a company he's been with for years. Yet he lost his job because of his politics. And that's not right, whether you agree with him or not.


Notes on Azkaban

I've been evaluating tools to run data processing jobs and narrowed my list down to Luigi and Azkaban.

I nixed Luigi for a number of reasons:
  • you can't execute jobs from the web UI.
  • you can't schedule jobs--you still have to use cron and all the bs that goes with that (manually managing overlap, e.g. what should I do when my first job is still running when the second job is scheduled to start?).
  • the documentation is horrid.
So far so good. Azkaban does have some quirks I'm working through. For example:
  • the executor.host property is not in the default config. The web component wisely defaults to localhost, but it would be handy to have it in the default config, even commented out (like many other properties) so you can run Azkaban in its preferred distributed mode without having to look through Google Groups questions.
  • I still can't figure out how to set up the host configuration for the Hadoop cluster Azkaban is supposed to talk to. Fixed--see below.
But the UI is intuitive and it handles the overlap issue (for a given job flow) like a champ: if a job is schedule or run while it's still running, you can tell Azkaban to abort the second run, let it run in parallel, or wait until the first run completes before the second run starts.

Things to remember:
  • Hadoop and Hive must be installed on the job executor box. Not running, just installed, with the standard HADOOP_HOME and HIVE_HOME env vars set, etc. 
  • Then you have to put your actual cluster's config files in the executor server's Hadoop config directory (typically $HADOOP_HOME/conf). This is because Azkaban looks for your remote Hadoop name node location in its local Hadoop configuration files. 
  • There's a lot of documentation out there based on Hadoop's old (1.x) directory structure. Hadoop 2.x has changed a lot of that and the Jars aren't where you'd expect them. Inspect your classpaths in all the config and properties files used by Azkaban and your Hive jobs. If a Hive job fails, it's a good bet you have a classpath problem, so look at your Azkaban executor server logs (not just the logs in the web interface).
  • The startup and shutdown scripts in bin/ are pretty brittle. Make sure all the directories are set correctly and you handle errors if they're not. Also, make sure to run them as bin/script.sh instead of cd bin; ./script.sh because they rely on relative directories.
  • Remember to open port 12321 in the executor server's firewall so the web server can submit jobs.
  • Remember to open port 9000 on your master Hadoop node so Azkaban can submit jobs to it. 
  • One project == one flow. If you upload more than one flow into a project, only the last one is retained. 
  • You can't run the Hive job examples as-is. They won't work because they're missing a few properties.
This is the basic Hive word count job example:


In order for it to work, it needs to look like this (differences that matter are bolded):


I'm using Azkaban 2.1 and Hadoop 1.2.1. Different versions will have different paths and classpaths, but the azk.hive.action, classpath and hive.query.file are crucial (hive.script doesn't work).

Other things that don't work out of the box:

  • When proxying to use the hadoop user of your choice, you need to set the security manager class to its fully qualified name. The sample config does not fully qualify the class name and so the executor fails to load. To wit:
# hadoop security manager setting common to hadoop jobs

should be

# hadoop security manager setting common to hadoop jobs hadoop.security.manager.class=azkaban.security.HadoopSecurityManager_H_1_0

in the plugins/jobtypes/commonprivate.properties file.