Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

2016-12-01

AMA Part 2: Software Development Resources

In the previous post, I discussed a few ways in which I have found coding bootcamps to be inadequate. In this post, I will present a list of resources I have found very useful as a software engineer. Many are free, and most others can be found in used bookstores (or Amazon) for moderate prices. I'd love to hear other people's favorite resources, so please add yours in the comments or on Twitter.

General Computer Science and Programming

Disclaimer: I didn't study CS formally, so this list is short and almost certainly out of date. Please send me your suggestions!

Language Specific

Official Tutorials / Documentation


Note: I strongly do not recommend a very popular Ruby introduction, Why's (Poignant) Guide to Ruby. A lot of people seem to love it. I found it way too cute, unclear, trying too hard, and just bad.

Books

Note: even if you don't program in C or C++, Kernighan & Richie is a great introduction to computer programming at a low level. The Lippman and Liberty, Halpern books go very well together, and are project-oriented walkthroughs of essential features. All three of these books are very short, but they pack a punch.
Note: get both Nutshell booksand read them in parallel. Mine is a free work-in-progress, aimed at people who have never programmed but want/need to learn Java.
Note: anything by Flanagan, Fowler, Beck, Crockford or Martelli is worth reading.

After You've Coded For A While

Note: I listed Stroustrup here because it's a pretty dense, dry read (do not read it as your first programming book) that requires a good idea of how things work under the hood. It's also not a practical introduction to C++ programming; it's a guide to the C++ language, and from that perspective it's full of vital insights about choices made when C++ was designed, which in turn makes you think about what computer languages can do, and the various ways they do it.

Specific Topics


2016-11-23

Docker lessons learned 1 year in

A little under a year ago, I started doing devops work for a startup (the Company) with very specialized needs. As it operates in a highly regulated sector, the company's access to their infrastructure is extremely restricted, to prevent accidental or malicious disclosure of protected information. Their in-house web apps and off-the-shelf on-prem software are deployed on a compliant PaaS (I'll call them "the Host", even though they offer vastly more than just hosting), which is very similar to Heroku and uses Docker exclusively for all applications deployed on their private EC2 cloud. I knew about Docker but had never used it, and it's been an interesting few months, so I thought I'd write up some observations in case they help someone.

Topsy Turvy

If you're coming to Docker from a traditional ops shop, it's important to keep in mind that many of your old habits and best practices either don't apply or are flipped upside down in a Docker environment. For example, you're probably going to use config management with Chef or Ansible a lot less, and convert your playbooks into Dockerfiles instead. Ansible/Chef/etc is based on the assumption that infrastructure has some level of permanence: you stand up a box, set it up with the right services and configuration, and it will probably be there and configured when you get around to deploying your app to it. By contrast, in the Docker world, things are much more just-in-time: you stand up and configure your container(s) while deploying your app. And when you update your app, you just toss the old containers and build new ones.

Another practice that may feel unnatural is the foregrounding of (the main) processes. On a traditional web server, you'd typically run nginx, some kind of app server, and your actual app, all in the background. Docker, on the other hand, tends to use a one-service-one-container approach, and because a container dies when its main process does, you have to have something running in the foreground (not daemonized) for your container to stay up. Typically that'll be your main process (e.g. nginx), or you'll daemonize your main process and have an infinite tail -f /some/log as your main process.

As a corollary, while traditional server setups often have a bunch of backgrounded services all logging to files, a typical Dockerized service will only have one log you care about (the one for your main process), and because a container is usually an ephemeral being, its local file system is best treated as disposable. That means not logging to files, but to stdout instead. It's great for watching what's happening now, but not super convenient if you're used to hopping on a box and doing quick greps and counts or walking through past logs when troubleshooting something that happened an hour ago. To do that, you have to deploy a log management system as soon as your app goes live, not after you have enough traffic and servers that server-hopping, grep and wc has become impractical. So get your logstash container ready, because you need it now, not tomorrow.

It's a decidedly different mindset that takes some getting used to.

I was already on board with the "everything is disposable" philosophy of modern high-availability systems, so conceptually it wasn't a huge leap, but if you're coming from a traditional shop with bare-metal (or even VM) deployments, it's definitely a mental switch.

Twelve Factor App Conventions

This one is more specific to the Host than to Docker in general, but it's part of an opinionated movement in modern software dev shops that includes Docker (and Rails, and Heroku), so I'll list it here. The Twelve-Factor App manifesto is a practical methodology for building modern apps delivered over the web. There's a lot of good stuff in there, like the emphasis on explicit declarations or the importance of a dev/stage environment matching production closely. But there's also questionable dogma that I find technically offensive. Specifically, factor 3 holds that configuration must be stored in the environment (as opposed to config files or delivered over some service).

I believe this is wrong. The app is software that runs in user space; the environment is a safe, hands-off container for the app. The environment and the app live at different levels of resolution: all the app stuff is inward-looking, only for and about the app; while the environment is outward-looking, configured with and exposing the right data for its guests (the apps and services running in the environment). Storing app-level (userspace) data in the environment is like trusting the bartender in a public bar with your specific drink preferences, and asking her what you like to drink (yes, this is a bad simile).

In addition, the concerns, scope, skills, budget, toolsets, and personalities of the folks involved in app work tend to be different from those of people doing the environment (ops) stuff. And while I'm ecstatic that devs and ops people appear to finally be merging into a "devops" hybrid, there's a host of practical reasons to divide up the work.

In practical terms, storing configuration in the environment also has significant drawbacks given the tools of the trade: people like me use grep dozens of times every day, and grepping through a machine's environment comprehensively (knowing that env variables may have been set as different Unix users) is error-prone and labor-intensive for no discernible benefit. Especially when your app is down and you're debugging things under pressure. It's also very easy to deploy what's supposed to be a self-contained "thing" (your twelve-factor app) and see it fail miserably, because someone forgot to set the environment variables (which highlights the self-contradictory, leaky nature of that config-in-the-environment precept: if your app depends on something external to it (the environment), it's not self-contained).

Another driver for the config-in-the-environment idea is to make sure developers don't store sensitive information like credentials, passwords, etc. in code that winds up in source control (and thus on every dev's computer, and potentially accidentally left in code you helpfully decided to open-source on GitHub). That makes a ton of sense and I'm all for it. But for practical purposes, this still means every dev who wants to do work on their local machine needs a way to get those secrets onto their computer, and there aren't a lot of really easy-to-use, auditable, secure and practical methods to share secrets. In other words, storing configuration in the environment doesn't solve a (very real) problem: it just moves it somewhere else, without providing a practical solution.

You may find this distinction specious, backwards, antiquated, or whatever. That's fine. The environment is the wrong place to store userspace/app-specific information. Don't do it.

That was a long-winded preamble to what I really wanted to discuss, namely the fact that the Host embraces this philosophy, and in quite a few instances it's made me want to punch the wall. In particular, the Host makes you set environment variables using a command-line client that's kind of like running remote ssh commands, meaning that values you set need to be escaped, and they don't always get escaped or unescaped the way you expect when you query them. So if you set an environment variable value to its current value as queried by the command-line client, you'll double-escape the value (e.g. "lol+wat" gets first set as "lol\+wat"; looking it up returns "lol\+wat" (escaped); resetting it turns it into "lol\\\+wat"; i.e. a set-get-set operation isn't idempotent). All this is hard-to-debug, painfully annoying, and completely unnecessary if the model wasn't so stupid about using the environment for configuration.

Dev == Prod?

One of the twelve-factor tenets is that dev/stage should mirror production closely. This is a very laudable goal, as it minimizes the risk of unexpected bugs due to environment differences (aka "but it worked on my machine"). It's especially laudable as a lot of developers (at least in Silicon Valley) have embraced OSX/macOS as their OS of choice, even though nobody deploys web apps to that operating system in production, which means there's always a non-zero risk of stuff that works on dev failing on production because of some incompatibility somewhere. This also means every dev wastes huge amounts of time getting their consumer laptop to masquerade an industrial server, using ports and casks and bottles and build-from-source and other unholy devices, instead of just, you know, doing the tech work on the same operating system you're deploying on, because that would mean touching Linux and ewww that's gross.

Originally, the Company had wrapped its production apps into Docker container using the Host's standard Dockerfiles and Procfiles, but devs were doing work on their bare-metal Macs, which meant finding, installing and configuring a whole bunch of things like Postgres, Redis, nginx, etc. That's annoying, overwhelming for new employees (since the documentation or Ansible playbooks you have to do that work are always behind and out of date about what actually happens on dev machines), and a pain to keep up to date. Individual dev machines drift apart from each other, "it works on my machine (but nor on yours)" becomes a frequent occurrence, and massive amounts of time (and money) are wasted debugging self-inflicted problems that really don't deserve to be debugged when it's so easy to do it right with a Linux VM and Ansible playbooks, but that would mean touching Linux and ewww that's gross.

So I was asked to wrap the dev environment into Dockerfiles, and ideally we'd use the same Dockerfile as production, so that dev could truly mirror prod and we'd make all those pesky bugs go away. Good plan. Unfortunately, though, I didn't find that to be practical in the Company's situation: the devs use a lot of dev-only tools (unit test harnesses, linters, debuggers, tracers, profilers) that we really do not want to have available in production. In addition, starting the various apps and services is also done differently on dev and prod: debug options are turned on, logging levels are more verbose, etc. So we realized and accepted the fact that we just can't use the same Dockerfile on dev and on prod. Instead, I've been building a custom parent image that includes the intersection of all the services and dependencies used in the Company's various apps, and converting each app's Dockerfile to extend that new base image. This significantly reduces the differences and copy-pasta between Dockerfiles, and will give us faster deployments, as the base image's file system layers are shared and therefore more likely to be cached.

Runtime v. Build Time

Back to Docker-specific bits, this one was a doozy. When building the dev Dockerfiles, I had split the setup between system-level configuration (in the Dockerfile) and app-specific setup (e.g. pip installs, node module installation, etc), which lived in a bootstrap script executed as the Dockerfile's CMD. It worked well, but it felt inelegant (two places to look for information about the container), so I was asked to move the bootstrap stuff into the Dockerfile.

The devs' setup requirements are fairly standard: they have their Mac tools set up just right, so they want to be able to use them to edit code, while the code executes in a VM or a Docker container. This means sharing the source code folder between the Mac host and the Docker containers, using the well-supported VOLUME or -v functionality. Because node modules and pip packages are app-specific, they are listed in various bog-standard requirements.txt and package.json files in the code base (and hence in the Mac's file system). As the code base is in a shared folder mounted inside the Docker container, I figured it'd be easy to just put the pip install stuff in the Dockerfile and point it at the mounted directories.

But that failed, every time. A pip install -e /somepath/ that was meant to install a custom library in editable mode (so it's pip-installed the same way as on prod, but devs can live-edit it) failed every time, missing its setup.py file, which is RIGHT THERE IN THE MOUNTED FOLDER YOU STUPID F**KING POS. A pip install -r /path/requirements.txt also failed, even though 1) it worked fine in the bootstrap script, which is also in the same folder/codebase 2) the volumes were specified and mounted correctly (I checked from inside the container).

That's when I realized the difference between build time and runtime in Docker. The stuff in the Dockerfile is read and executed at build time, so your app has what it needs in the container at runtime. During build time, your container isn't really running--a bunch of temporary containers briefly run so various configuration steps can be executed, and they leave file system layers behind as Docker moves through the Dockerfile. The volumes you declare in your Dockerfile and/or docker-compose.yml file are mounted as you'd expect (you can ssh into your container and see the mount points); but they are only bound to the host's shared folders at runtime. This means that commands in your Dockerfile (which are used at build time) cannot view or access files in your shared Mac folder, because those only become available at runtime.

Of course you could just ADD or COPY the files you need from the Mac folder into the mounted directory, and do your pip install in the Dockerfile that way. It works, but it feels kinda dirty. Instead, what we'll do is identify which pip libraries are used by most services, and bake those into our base image. That'll shave a few seconds off the app deployment time.

Editorializing a bit, while I (finally) understand why things behaved the way they did, and it's completely consistent with the way Docker works, I feel it's a design flaw and should not be allowed by the Docker engine. It violates the least-surprise principle in a major way: it does only part of what you think it will do (create folders and mount points). I'd strongly favor some tooling in Docker itself that detects cases like these and issues a WARNING (or barfs altogether if there was a strict mode).

Leaky Abstractions and Missing Features

Docker aims to be a tidy abstraction of a self-contained black box running on top of some machine (VM or bare-metal). It does a reasonable job using its union file system, but the abstraction is leaky: the underlying machine still peeks through, and can bite you in the butt.

I was asked to Dockerize an on-prem application. It's a Java app which is launched with a fairly elaborate startup script that sets various command-line arguments passed to the JVM, like memory and paths. The startup script is generic and meant to just work on most systems, no matter how much RAM they have or where stuff is stored in the file system. In this case, the startup script sets the JVM to use some percentage of the host's RAM, leaving enough for the operating system to run. It does this sensibly, parsing /proc/meminfo and injecting the calculated RAM into a -Xmx argument.

But when Dockerized, the container simply refused to run: the Host had allocated some amount of RAM to it, and the app's launcher was requesting 16 times more, because the /proc/meminfo file was... the host EC2 instance's! Of course, you could say "duh, that's a layered file system, of course that's what it does" and you'd be right. But the point is that a Docker container is not a fully encapsulated thing; it's common enough to query your environment's available RAM, and a clean, encapsulated container system should always give an answer that's reflective of itself, not breaking through to the underlying hardware.

Curious "Features"

Docker's network management is... peculiar. One of its more esoteric features is the order in which ports get EXPOSEd. I was working on a Dockerfile that was extending a popular public image, and I could not make it visible to the outside world, even though my ports were explicitly EXPOSEd and mapped. My parent image was EXPOSing port 443, and I wanted to expose a higher port (4343). For independent reasons, the Host's system only exposes the first port it finds, even if several are EXPOSEd; and because there's no UNEXPOSE functionality, it seemed I'd have to forget about extending the public base image and roll my own so I could control the port.

But the Host's bottomless knowledge of Docker revealed that Docker exposes ports in lexicographic order. Not numeric. That means 3000 comes before 443. So I could still EXPOSE port a high port (3000) as long as lexicographically it appeared before the base image's port 443, and the Host would pick that one for my app.

I still have a bruise on my forehead from the violent D'OHs I gave that day.

On a slightly higher level than this inside-baseball arcana, though, this "feature" also shows how leaky the Docker abstraction is: a child image is not only highly constrained by what the parent image does (you can't close/unexpose/override ports the parent exposes), it (or its auhor) needs to have intimate knowledge of its parent's low-level details. Philosophically, that's somewhat contrary to the Docker ideal of every piece of software being self-contained. Coming at it from the software world, if I saw a piece of object-oriented code with a class hierarchy where a derived class had to know, be mindful of, or override a lot of the parent class's attributes, that'd be a code smell I'd want to get rid of pretty quickly.

Conclusion: Close, But Not Quite There

There is no question Docker is a very impressive and useful piece of software. Coupled with great, state-of-the-art tooling (such as the container tools available from AWS and other places), and some detailed understanding of Docker internals, it's a compelling method for deploying and scaling software quickly and securely.

But in a resource-constrained environment (a small team, or a team with no dedicated ops resource with significant Docker experience), I doubt I'd deploy Docker on a large scale until some of its issues are resolved. Its innate compatibility with ephemeral resources like web app instances also makes it awkward to use with long-running services like databases (also known as persistence layers, so you know they tend to stick around). So you'll likely end up with a mixed infrastructure (Docker for certain things, traditional servers for others; Dockerfiles here, Ansible there; git push deploys here, yum updates there), or experience the ordeal joy of setting up a database in Docker.

Adding to the above, the Docker ecosystem also has a history of shipping code and tools with significant bugs, stability problems, or non-backward-incompatible changes. Docker for Mac shipped out of beta with show-stopping, CPU-melting bugs. The super common use case of running apps in Docker on dev using code in a shared folder on the host computer was only resolved properly a few months ago; prior to that, inotify events when you modified a file in a shared, mounted folder on the host would not propagate into the container, and so apps that relied on detecting file changes for hot reloads (e.g. webpack in dev mode, or Flask) failed to detect the change and kept serving stale code. Before Docker for Mac came out, the "solution" was to rsync your local folder into its alter ego in the container so the container would "see" the inotify events and trigger hot reloads; an ingenious, effective, but brittle and philosophically bankrupt solution that gave me the fantods.

Docker doesn't make the ops problem go away; it just moves it somewhere else. Someone (you) still has to deal with it. The promise is ambitious, and I feel it'll be closer to delivering on that promise in a year or two. We'll just have to deal with questionable tooling, impenetrable documentation, and doubtful stability for a while longer.

2016-09-02

Python import basics in plain English

In my own experience learning Python, and that of others on Python teams I've worked with, a common hurdle is understanding how Python does imports.

The basics are actually very simple, but the documentation tends to be a little neckbeardy and dense, and hard to grok if you're new to the language. So I thought I'd list common, simple practical examples of Python imports in case they help someone.

To import data and functions from somewhere else (another .py file in your project, a standard library like os, or a third-party library you may have installed with pip), you have the following options:

import <module>
import <module> as <other_name>
from <module> import a, b, c
from <module> import *

Let's look at what these options mean.

import <module>

import <module> means the program you're in can access everything that is defined in <module> (variables, classes, functions, etc), and you have to prepend "<module>." to those things. For example, if  is the "os" library (it comes with Python), which defines a function called getpid and a variable named name, your program can do this:

>>> import os
>>> os.name
'posix'
>>> os.getpid()
51678

This works with your own libraries too. Say you created a Python file named network_functions.py which contains a constant named BANDWIDTH and a method named connect(url), you can do:

>>> import network_functions
>>> network_functions.BANDWIDTH
1024
>>> network_functions.connect('https://google.com')
Connecting...

If you don't want to have to type out the whole prefix (which can get unwieldy if your imports are nested (the modules are subdirectories), e.g. import lib.network.connection_functions), you have the following options:

import <module> as <other_name>

This lets you use <other_name> instead of the module's full name. 

>>> import lib.network.connection_functions as netfunc
>>> netfunc.BANDWIDTH
1024
>>> netfunc.connect('https://google.com')
Connecting...

from <module> import a, b, c

This lets you import only what you need from module into your current program's namespace. This means everything you imported from the external module can be called by its bare name in your program:

>>> from lib.network.connection_functions import BANDWIDTH, connect
>>> BANDWIDTH
1024
>>> connect('https://google.com')
Connecting...

from <module> import *

Import everything from the imported module into your current program's namespace, so you can call everything from the module by its bare name. This is strongly not recommended, and I'll explain why.

>>> from lib.network.connection_functions import *
>>> BANDWIDTH
1024
>>> connect('https://google.com')
Connecting...

This is almost never a good idea, because you don't always know or control what is defined in an external module, and there can be name collisions, e.g. functions or variables with the same name, so you may not be using the variable or function you expect! For example:

In file helpers.py:
def connect(url):
    print "Connecting to", url

In file network.py:
def connect(url):
    print "Hacking into", url

>>> from helpers import *
>>> from network import *
>>> connect('https://google.com')
# which one is called?

>>> from network import *
>>> from helpers import *
>>> connect('https://google.com')
# which one is called?

This example may seem contrived, but it's very common to import a bunch of modules written by different people, and some variable or function names are common or obvious enough that they may appear several times in different modules. Why wouldn't they, after all? Joe doesn't know about Jill's (or your own) module, so they have no reason to coordinate and ensure they're not using the same function names. 

If you use from <module> import * with several modules, the odds are very good you'll call a function and actually invoke one that's not the one you expect. And that can be really tricky to debug. 

So what should you do if you do need a bunch of functionality from a module and don't want to import every single function and variable by name with:

from <module> import var1, var2, var3, fun1, fun2, fun4 # etc 

It's simple! Don't use from <module> import *. Instead, use import <module> and presto, your program can use everything from <module>, as long as you prefix <module>. before the names.

>>> import os
>>> os.getpid()
'posix'

Handy Tips

Do you ever want to know the variables or functions defined in a module you imported without having to Google them? Just use vars or dir:

>>> import os
>>> dir(os)
['EX_CANTCREAT', 'EX_CONFIG', 'EX_DATAERR', 'EX_IOERR', 'EX_NOHOST', 'EX_NOINPUT', 'EX_NOPERM', 'EX_NOUSER', 'EX_OK', 'EX_OSERR', 'EX_OSFILE', 'EX_PROTOCOL', 'EX_SOFTWARE', 'EX_TEMPFAIL', 'EX_UNAVAILABLE', 'EX_USAGE', 'F_OK', 'NGROUPS_MAX', 'O_APPEND', 'O_ASYNC', 'O_CREAT', 'O_DIRECTORY', 'O_DSYNC', 'O_EXCL', 'O_EXLOCK', 'O_NDELAY', 'O_NOCTTY', 'O_NOFOLLOW', 'O_NONBLOCK', 'O_RDONLY', 'O_RDWR', 'O_SHLOCK', 'O_SYNC', 'O_TRUNC', 'O_WRONLY', 'P_NOWAIT', 'P_NOWAITO', 'P_WAIT', 'R_OK', 'SEEK_CUR', 'SEEK_END', 'SEEK_SET', 'TMP_MAX', 'UserDict', 'WCONTINUED', 'WCOREDUMP', 'WEXITSTATUS', 'WIFCONTINUED', 'WIFEXITED', 'WIFSIGNALED', 'WIFSTOPPED', 'WNOHANG', 'WSTOPSIG', 'WTERMSIG', 'WUNTRACED', 'W_OK', 'X_OK', '_Environ', '__all__', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '_copy_reg', '_execvpe', '_exists', '_exit', '_get_exports_list', '_make_stat_result', '_make_statvfs_result', '_pickle_stat_result', ] # and a bunch more

>>> vars(os)
{'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_NOHOST': 68, 'seteuid': , 'pathsep': ':', 'execle': , '_Environ': , ] # and a bunch more

What's Next?

In a later post, I'll cover other tricky aspects of imports, namely how Python maps import to Python code files in directories, and how to debug ImportError: No module named <module> problems that can occur depending on what directories your files are in.

Hopefully that was helpful!