Python import basics in plain English

In my own experience learning Python, and that of others on Python teams I've worked with, a common hurdle is understanding how Python does imports.

The basics are actually very simple, but the documentation tends to be a little neckbeardy and dense, and hard to grok if you're new to the language. So I thought I'd list common, simple practical examples of Python imports in case they help someone.

To import data and functions from somewhere else (another .py file in your project, a standard library like os, or a third-party library you may have installed with pip), you have the following options:

import <module>
import <module> as <other_name>
from <module> import a, b, c
from <module> import *

Let's look at what these options mean.

import <module>

import <module> means the program you're in can access everything that is defined in <module> (variables, classes, functions, etc), and you have to prepend "<module>." to those things. For example, if  is the "os" library (it comes with Python), which defines a function called getpid and a variable named name, your program can do this:

>>> import os
>>> os.name
>>> os.getpid()

This works with your own libraries too. Say you created a Python file named network_functions.py which contains a constant named BANDWIDTH and a method named connect(url), you can do:

>>> import network_functions
>>> network_functions.BANDWIDTH
>>> network_functions.connect('https://google.com')

If you don't want to have to type out the whole prefix (which can get unwieldy if your imports are nested (the modules are subdirectories), e.g. import lib.network.connection_functions), you have the following options:

import <module> as <other_name>

This lets you use <other_name> instead of the module's full name. 

>>> import lib.network.connection_functions as netfunc
>>> netfunc.BANDWIDTH
>>> netfunc.connect('https://google.com')

from <module> import a, b, c

This lets you import only what you need from module into your current program's namespace. This means everything you imported from the external module can be called by its bare name in your program:

>>> from lib.network.connection_functions import BANDWIDTH, connect
>>> connect('https://google.com')

from <module> import *

Import everything from the imported module into your current program's namespace, so you can call everything from the module by its bare name. This is strongly not recommended, and I'll explain why.

>>> from lib.network.connection_functions import *
>>> connect('https://google.com')

This is almost never a good idea, because you don't always know or control what is defined in an external module, and there can be name collisions, e.g. functions or variables with the same name, so you may not be using the variable or function you expect! For example:

In file helpers.py:
def connect(url):
    print "Connecting to", url

In file network.py:
def connect(url):
    print "Hacking into", url

>>> from helpers import *
>>> from network import *
>>> connect('https://google.com')
# which one is called?

>>> from network import *
>>> from helpers import *
>>> connect('https://google.com')
# which one is called?

This example may seem contrived, but it's very common to import a bunch of modules written by different people, and some variable or function names are common or obvious enough that they may appear several times in different modules. Why wouldn't they, after all? Joe doesn't know about Jill's (or your own) module, so they have no reason to coordinate and ensure they're not using the same function names. 

If you use from <module> import * with several modules, the odds are very good you'll call a function and actually invoke one that's not the one you expect. And that can be really tricky to debug. 

So what should you do if you do need a bunch of functionality from a module and don't want to import every single function and variable by name with:

from <module> import var1, var2, var3, fun1, fun2, fun4 # etc 

It's simple! Don't use from <module> import *. Instead, use import <module> and presto, your program can use everything from <module>, as long as you prefix <module>. before the names.

>>> import os
>>> os.getpid()

Handy Tips

Do you ever want to know the variables or functions defined in a module you imported without having to Google them? Just use vars or dir:

>>> import os
>>> dir(os)
['EX_CANTCREAT', 'EX_CONFIG', 'EX_DATAERR', 'EX_IOERR', 'EX_NOHOST', 'EX_NOINPUT', 'EX_NOPERM', 'EX_NOUSER', 'EX_OK', 'EX_OSERR', 'EX_OSFILE', 'EX_PROTOCOL', 'EX_SOFTWARE', 'EX_TEMPFAIL', 'EX_UNAVAILABLE', 'EX_USAGE', 'F_OK', 'NGROUPS_MAX', 'O_APPEND', 'O_ASYNC', 'O_CREAT', 'O_DIRECTORY', 'O_DSYNC', 'O_EXCL', 'O_EXLOCK', 'O_NDELAY', 'O_NOCTTY', 'O_NOFOLLOW', 'O_NONBLOCK', 'O_RDONLY', 'O_RDWR', 'O_SHLOCK', 'O_SYNC', 'O_TRUNC', 'O_WRONLY', 'P_NOWAIT', 'P_NOWAITO', 'P_WAIT', 'R_OK', 'SEEK_CUR', 'SEEK_END', 'SEEK_SET', 'TMP_MAX', 'UserDict', 'WCONTINUED', 'WCOREDUMP', 'WEXITSTATUS', 'WIFCONTINUED', 'WIFEXITED', 'WIFSIGNALED', 'WIFSTOPPED', 'WNOHANG', 'WSTOPSIG', 'WTERMSIG', 'WUNTRACED', 'W_OK', 'X_OK', '_Environ', '__all__', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '_copy_reg', '_execvpe', '_exists', '_exit', '_get_exports_list', '_make_stat_result', '_make_statvfs_result', '_pickle_stat_result', ] # and a bunch more

>>> vars(os)
{'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_NOHOST': 68, 'seteuid': , 'pathsep': ':', 'execle': , '_Environ': , ] # and a bunch more

What's Next?

In a later post, I'll cover other tricky aspects of imports, namely how Python maps import to Python code files in directories, and how to debug ImportError: No module named <module> problems that can occur depending on what directories your files are in.

Hopefully that was helpful!


Django, static assets, versioning, and WhiteNoise

I had an interesting time troubleshooting an issue with Django, WhiteNoise and static asset versioning. This may be obvious to experienced Django users, but not to me; I've maintained Flask and Rails apps before, but Django is a new beast. I'll document it here in case it helps somebody.

My goal was to set up asset versioning in a Django app to serve static files as filename.somehash.js instead of filename.js (same with other file types like css, png, etc). This is standard practice; most modern frameworks have that capability, and different ways to do it.

I had started using WhiteNoise because the internets suggested it was a much, much easier task than other ways to do it. I was hoping to do asset versioning and deploy a Cloudfront CDN at the same time, and WhiteNoise is set up to do just that.

Once everything was set up according to the documentation, I ran python manage.py collectstatic and saw the versioned file names getting generated. Checking the files themselves confirmed that. But when I loaded the app in a browser, only the unversioned file names were being requested.

After much head-scratching, I found this was because the app templates reference the static files with the standard {% load static %} method. The problem went away when I changed that to {% load static from staticfiles %} as suggested in this closed issue on the subject. Note that I didn't try the other option mentioned in that issue, {% load staticfiles %}, but that should also work. 

Once the app restarted, beautiful unique file names were being requested and served. But I was occasionally getting 500 errors. I traced those back to instances where the app and WhiteNoise were being asked to serve files that no longer exist. Those references to deleted js, css, etc. files didn't actually harm the app's functionality, but when WhiteNoise is asked to serve them, it throws an exception and causes the app to 500.

That's not ideal behavior--my take is that 50x errors in a production app should never happen and always be handled gracefully when they do, so a library that causes 500s by actively raising exceptions rather than logging, catching and handling them gracefully isn't ideal. But them's the breaks, and I might yet submit a PR to the owner if I find the time.

In this particular app's case, this behavior was especially non-ideal because some of these files were referenced in commented-out JavaScript, and not actually requested; it looks like WhiteNoise and/or Django greedily consider anything that looks like a static file path to be actually requested, even if it's in code that doesn't execute.

The solution is simple--find all those dangling references and exterminate them! Use those 500s to your advantage by exercising the app and tailing your error logs. It's easy to argue that's something you should do no matter what, so it wasn't hard to convince the code owners it was the right thing to do.


Installing wxPython in a virtualenv on Centos 6.7

I'm looking at wxPython to write a GUI for an app I'm working on, and as it turns out using wxPython with virtual environments isn't completely obvious. Hopefully someone finds this helpful.

My distribution is CentOS 6.7 with a hand-built Python 2.7.6.

Step 1: Install wxPython

This will install wxPython in your system's default Python library directory (not the one you want).

$ sudo yum install wxPython
$ sudo find / -name wx*.py

Step 2: Create and activate your virtualenv

$ cd
$ virtualenv -p /usr/bin/python2.7 venv
$ source venv/bin/activate

Importing wx will fail:

$ python
Python 2.7.6 (default, Dec  2 2013, 21:17:42) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import wx
Traceback (most recent call last):
  File "", line 1, in
ImportError: No module named wx

Step 3: Symlink wxPython into your virtualenv

$ cd ~/venv/lib/python2.7/site-packages
$ ln -s /usr/lib64/python2.6/site-packages/wx-2.8-gtk2-unicode/wx
$ ln -s /usr/lib64/python2.6/site-packages/wx-2.8-gtk2-unicode/wxPython
$ ln -s /usr/lib64/python2.6/site-packages/wxversion.py

Step 4: Start coding!

$ source venv/bin/activate
$ python
Python 2.7.6 (default, Dec  2 2013, 21:17:42) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import wx
>>> app = wx.App()
>>> frame = wx.Frame(None, -1, 'lol')
>>> frame.Show()
>>> app.MainLoop()

Note: I only just started playing with wxPython, so there may be other symlinks required to make it work. Let me know if so, and I'll update the post.


Recruiter isn't even trying anymore

I got this InMain on LinkedIn today.

March 25, 2015, 1:33 PM
Dear Roger,
Trust this finds you in good health!! I wanted to share a Back End Engineer role with you with one of my direct clients , its a long term contract and location is Mountain View, CA. Here is the short story:

The candidate should be able to quickly adapt to different environment/framework as the project needs.
These are minimum required skills.
Open source backend stack
Web backend frameworks in Java OR C#, OR Python OR Ruby, etc.
Database design and maintenance using SQL and/or NoSQL
Let me know if you think that this would be a fit as per your skills and expertise, if no, maybe you can refer someone !!
  • Nothing about the company (industry? size? age? product?)
  • The technology stack is essentially "whatever" (but open source! Including C#!)
  • The data stack is essentially "whatever"
  • The role is web AND database design AND dba
I LOL'ed heartily as I marked the InMail as spam.


Comcast lies

Verbatim transcript of a chat session with Naval. I was having internet connectivity problems. Naval says (s)he resolved them, and then did the standard upsell trick, in which (s)he lied. Then when I noticed the problems weren't resolved, (s)he lied again.


Lie #1: " you will get additional service at a cheaper cost."

The rep correctly showed my current service is 50m down and basic cable TV for $82 / mo. He said Comcast had an offer where I could get "additional service at a cheaper cost." The "cheaper cost" is $101.99 for first 12 months and then $126.99 from 13-24 months.

Lie #2: "Every thing is totally fine"

At the exact time Naval said "Every thing is totally fine," Comcast was having well-documented problems:


Lie #3 (probably): "You will also check this problem on our site also"

The "server upgradation" and service degradation is not listed.



I spoke with another rep, Jose, who was great. Excerpts from our chat, mostly for the lulz:

Jose: May I know the issue that our rep lied to you?
Roger: (summary of the above)
Roger: I asked the price
Roger: "This package will cost you only $101.99 for first 12 months, after that it will cost you $126.99 from 13-24 months."
Roger: $101.99 is not actually cheaper than $91.96.
Jose: yes that is correct. Its simple math.

Then this unintentionally hilarious pearl:

Jose: On internet issue they can be corrected by our Internet dept.
Jose: We are cable troubleshooting dept so all of our tools here are for cable TV.
Jose: On broken promises those usually happens on Sales dept.Jose: Here on cable troubleshhoting we dont lie.


Enabling LZO compression for Hive to avoid cannot find class com.hadoop.mapred.DeprecatedLzoTextInputFormat error

I just spent a bunch of time reading through documentation and Google Group postings about how to enable LZO compression in Hive, only to find none of them was the right solution. In the end I did find something that worked, so hopefully this can help someone.

Goal: enable LZO-compressed files to be used for Hive tables.

Environment: Hadoop cluster managed with Cloudera Manager version 5.

What's missing from the instructions and the Google Group postings about that error is how to tell Hive where to find the Hadoop LZO jar. The instructions about classpath settings above are not sufficient, and you'll have this error when running a Hive query against an LZO table:

cannot find class com.hadoop.mapred.DeprecatedLzoTextInputFormat

To fix this:
  • Go to your Cloudera Manager UI home page
  • Click Hive
  • Click Configuration > View and Edit
  • Under Service-Wide > Advanced, look for Hive Auxiliary JARs Directory
  • Set the value to /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib
  • Restart the Hive service (and any related services)
Now you can run queries against LZO-compressed files.

As a reminder, to create a table backed by LZO-compressed files in HDFS, do something like this:

CREATE EXTERNAL TABLE `my_lzo_table`(`something` string)


Cloudera Manager Fails to Restart

I've been experimenting with Cloudera Manager to manage Hadoop clusters on EC2. So far it seems to be working a little better than Ambari, which managed to install its agent software on all my nodes but always failed to start the required services.
Cloudera Manager did fail as well, but that seemed to be due to my security group settings. I changed the configuration and restarted the service on the host, but the restart always failed with this error:

Caused by: java.io.FileNotFoundException: /usr/share/cmf/python/Lib/site$py.class (Permission denied)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.(FileInputStream.java:146)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:126)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:134)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.getClassNamesInTree(ExplodedJarVisitor.java:134)
        at org.hibernate.ejb.packaging.ExplodedJarVisitor.doProcessElements(ExplodedJarVisitor.java:92)
        at org.hibernate.ejb.packaging.AbstractJarVisitor.getMatchingEntries(AbstractJarVisitor.java:149)
        at org.hibernate.ejb.packaging.NativeScanner.getClassesInJar(NativeScanner.java:128)
        ... 31 more

Odd, I thought, since by default the service runs as root and should have free rein. So I poked around in that Python library directory, and lo and behold:

I chmod'ed 644 the .class files (in /usr/share/cmf/python/Lib and /usr/share/cmf/python/Lib/simplejson) and sure enough everything is working again.
Hopefully this is helpful to somebody.


Installing Scrapy on an Amazon CentOS AMI

We're experimenting with Scrapy, and I thought I'd share what I found while installing the Scrapy package, as it has multiple dependencies many Python installations don't normally include, and those are not listed in the documentation.

* First off, you want bzip2:

$ cd /tmp
$ wget http://bzip.org/1.0.6/bzip2-1.0.6.tar.gz
$ tar -xzf bzip2-1.0.6.tar.gz
$ cd bzip2-1.0.6
$ sudo make -f Makefile-libbz2_so
$ sudo make
$ sudo make install PREFIX=/usr/local
$ sudo cp libbz2.so.1.0.6 /usr/local/lib

* Then you want the libffi headers

$ sudo yum install libffi-devel

* Then you want to download the Python source and extract it:

$ wget https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tgz
$ tar -xzf Python-2.7.6.tgz
$ cd Python-2.7.6

* I usually uncomment any lines referencing ssl and zlib in Modules/Setup.dist

$ vi Modules/Setup.dist
# find and uncomment the lines

* Then build Python:

$ ./configure --prefix=/usr/local
$ sudo make
$ sudo make altinstall

* Install virtualenv if you don't have it (you should)

* Activate your virtualenv with your fresh Python

$ cd
$ virtualenv -p /usr/local/bin/python2.7 myenv
$ source myenv/bin/activate

* Install Scrapy

$ easy_install Scrapy

This was tested on an Amazon AWS image named amzn-ami-pv-2013.09.2.x86_64-ebs (ami-a43909e1), known as Amazon Linux AMI x86_64 PV EBS with the following version strings:

$ cat /proc/version
Linux version 3.4.73-64.112.amzn1.x86_64 (mockbuild@gobi-build-31003) (gcc version 4.6.3 20120306 (Red Hat 4.6.3-2) (GCC) ) #1 SMP Tue Dec 10 01:50:05 UTC 2013

$ cat /etc/*-release
Amazon Linux AMI release 2014.03


Elastic MapReduce, Hive and Input Files

We're using Hive and Amazon's Elastic MapReduce to process sizable data sets. Today, I was wondering why a simple count query on a table with under a billion rows was taking a long time. The table file is in a single gzipped file in an S3 bucket, and Hive was only using a single mapper. So I thought, hrm, it looks like the job isn't distributed at all, so let's try splitting the input file into a bunch of smaller files to see if Hive will be able to put more mappers to work.

This is the initial slow job, with a single gzipped file for the table in S3:

hive> select count(*) FROM mytable;

Job 0: Map: 1  Reduce: 1   Cumulative CPU: 254.84 sec   HDFS Read: 207 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 14 seconds 840 msec
Time taken: 274.51 seconds, Fetched: 1 row(s)

This is the same job run against 240 non-gzipped files for the table in S3:

-- MULTIPLE FILES, not gzipped
hive> select count(*) FROM mytable_multiple_files_no_gzip;

Job 0: Map: 48  Reduce: 1   Cumulative CPU: 538.05 sec   HDFS Read: 25536 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 8 minutes 58 seconds 50 msec
Time taken: 55.071 seconds, Fetched: 1 row(s)

Not bad, eh?

Then I tried the same split schema, except each file was gzipped individually (240 gzipped input files):

hive> select count(*) FROM mytable_multiple_files_gzip;

Job 0: Map: 240  Reduce: 1   Cumulative CPU: 1552.43 sec   HDFS Read: 52080 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 25 minutes 52 seconds 430 msec
Time taken: 112.735 seconds, Fetched: 1 row(s)

So with gzipped input files, I had a one mapper-one file relationship; with uncompressed input files, I had a one mapper-five files relationship.

These numbers were obtained on a cluster with 8 i2.2xlarge data nodes and an m3.xlarge name node.

Typically (at least that's what a cursory Google search suggests), people have the opposite problem--too many small-ish files in S3, and too many mappers. Too many mappers can delay your reducers' work. So I'll do some testing on different splitting schemas for the same data set and update.


McCarthy was self-righteous too

Brendan Eich, inventor of JavaScript, just resigned from his brand new position as CEO of the Mozilla foundation, after it was discovered he made a $1000 donation to the anti-gay-marriage campaign in California known as Prop 8.

That discovery caused uproar among the self-righteous bien-pensants who work for Mozilla, and a number of employees posted tweets about how they thought he should resign.

I'm angry about this because this isn't very different from McCarthyism in reverse. A guy was forced out of a job because his political views don't agree with the majority.

I feel opposing gay marriage is bigoted, wrong, indefensible and on the wrong side of history. I don't know Eich. For all I know he's a raging asshole with ultra-right-wing views. He might even hate kittens and burp at the dinner table. I don't know.

But what I do know is that getting forced out of a job by a self-righteous San Francisco mob of entitled nerds who have probably never even seen a Republican in the flesh is just as indefensible. It's not what America and California are about. And it shows liberals can be assholes, too, when they put their minds to it.

I'd venture to say a very large number of CEOs are raging right-wing Republicans with questionable ethics. If you don't like your CEO's politics, you're free to work somewhere else. Your job isn't in grave danger if you and your CEO don't see eye-to-eye in terms of politics--there are laws on the books protecting you from discrimination. Why should your CEO's job be in jeopardy for that very same reason?

Eich's contributions to Web tech are immense and he may well be as capable as anyone of running Mozilla, a company he's been with for years. Yet he lost his job because of his politics. And that's not right, whether you agree with him or not.