Archive

Monthly Archives: October 2011

I’ve been using django-taggit to provide a tagging model for content items in my app. However, I wanted to arrange the tags into a hierarchy/taxonomy. It’s simple enough to use a custom through model to define a custom tag model with a parent pointer, which lets you arrange your tags into a tree:

from taggit.models import TagBase, ItemBase
from taggit.managers import TaggableManager

...

# the custom tag model
class HierarchicalTag (TagBase):
    parent = models.ForeignKey('self', null=True, blank=True)

# the through model
class TaggedContentItem (ItemBase):
    content_object = models.ForeignKey('ContentItem')
    tag = models.ForeignKey('HierarchicalTag', related_name='tags')

# the content item
class ContentItem (ItemBase):
    tags = TaggableManager(through=TaggedContentItem, blank=True)

However, suppose you have a tree of tags like this:

Vehicle
    Car
        BMW
            Z4
        Ford
            Fiesta
        Chevrolet
            Volt

and you have content items tagged with leaves (Z4, Fiesta, Volt), but you want to search for all items tagged with anything from the ‘Car’ branch of the tree. Chances are you’ll end up writing a recursive function to gather up all the descendants of ‘Car’, which doesn’t scale because it involves many SQL queries, or using esoteric SQL syntax available only in the big database engines (and certainly not sqlite3).

At work, where we use a non-relational database engine, we long ago overcame the same issue (efficient manipulation and querying of hierarchical models), so I already had an idea of what I needed to do. But, as is the way with Python and Django, I figured there would probably already be packages that implement efficient hierarchical data — and there are.

The two main contenders seem to be django-mptt and django-treebeard. I tried mptt first, mainly because the consensus seemed to be that it was smaller and easier to use, but also because it purported to allow you to add hierarchical structure to existing models by configuration, which in my case would mean I didn’t have to define a custom tag model and could attach hierarchy directly to taggit’s Tag model.

However, my experience of mptt was poor – the documentation appeared to be out of date with respect to both the version of mptt I got from pip and the latest git trunk. Also, when I tried to use mptt’s admin classes for Django, I got exceptions (I admit I didn’t try very hard to overcome them).

So I gave treebeard a go, and had a much smoother time. Treebeard implements a number of hierarchy techniques with different performance characteristics (e.g. cheap querying but expensive insertion), allowing you to choose which one suits your application’s use of the trees. In my case I went for ‘Materialised Path Trees’ because it’s the relational equivalent of the technique I’m already familiar with. Implementing hierarchical tags was a straightforward case of having my custom tag model extend treebeard’s MP_Node model which, as the name suggests, implements a node in a Materialised Path Tree:

from treebeard.mp_tree import MP_Node

...

class HierarchicalTag (TagBase, MP_Node):
    node_order_by = [ 'name' ]

class TaggedContentItem (ItemBase):
    content_object = models.ForeignKey('ContentItem')
    tag = models.ForeignKey('HierarchicalTag', related_name='tags')

class ContentItem (ItemBase):
    tags = TaggableManager(through=TaggedContentItem, blank=True)

(The node_order_by is what treebeard uses to order siblings when a new node is added to the tree.) That was literally all that was needed. Going back to the ‘Car’ example, the code to find all ContentItems tagged with any of the descendants of ‘Car’:

# look up the Car term
car = HierarchicalTag.objects.get(name='Car')

# get a queryset of all its descendants: with treebeard this is 1 SQL statement
# use HierarchicalTag.get_tree(car) if you want to include 'Car'

treeqs = car.get_descendants()

# now find the ContentItems using an inner queryset
qs = ContentItem.objects.filter(tags__in=treeqs)

I was trying out Piston to build an API for my Django app. I wrote a simple REST URL to return all of the 399 records in my database, encoded as JSON.

To my surprise, it was taking 11 seconds. I wondered whether I was doing something horrendous in SQL. I normally rely on django-debug-toolbar to tell me these things, but debug-toolbar injects itself into an HTML page. With my API returning JSON, it wasn’t going to work.

Now Piston supports a variety of formats: JSON, XML, YAML, and so on. It does so via what it calls emitters, which are classes that implement the encoding of your return data. Since I needed HTML for debug-toolbar, I hacked up an HTML emitter:

from piston.emitters import Emitter
from django.http import HttpResponse
from django.utils import simplejson
# thanks to marteinn_se for this bit (see comments)
from django.core.serializers.json import DateTimeAwareJSONEncoder

class HTMLEmitter( Emitter ):
    def render( self, request ):
        data = self.construct()
        json_dump = simplejson.dumps(data, cls=DateTimeAwareJSONEncoder, ensure_ascii=False, indent=4)
        return HttpResponse('' % json_dump)

Emitter.register('html', HTMLEmitter, 'text/html')

Then after following the instructions on how to edit my urls.py to use the emitter, I went to a REST URL in the browser, and there was the debug toolbar, telling me that it spent 300ms in SQL and that I was going to have to look elsewhere for the other 10700ms. (*)

By the way – if you try this and you get an exception about __name__, that’s because of a bug in the 0.85 version of the debug-toolbar (the latest release at the time of writing). It’s been resolved in the debug-toolbar trunk, which you can fetch and install with

git clone https://github.com/django-debug-toolbar/django-debug-toolbarcd django-debug-toolbarpython setup.py install

(*) P.S. The slowness is down to fetching data out of the database. Through judicious use of the QuerySet methods values_list() and select_related() I’ve halved the time taken. Still much too slow, but getting there…

I found myself needing to write a Django view to serve a file. The path to the file is held in the database, and is not in the URL space, so using Django’s static file stuff isn’t appropriate.

It’s easy once you twig that an HttpResponse object is a file-like object; just open the source file and copy it to the response object. This was my first attempt:

with open(path, 'rb') as f:    response = HttpResponse(mimetype=file.mime_type)    copyfileobj(f, response)    return response

(copyfileobj is from the python shutil package.)

It works, though I was mildly concerned that the copy presumably causes the whole file to be read into memory before being served, rather than being written directly down the socket to the client. There is a more elegant way to do it:

f = open(path, 'rb')return HttpResponse(content=f, mimetype=file.mime_type)

Since a Python File object is an iterable, and content can be an iterable or a string, this works. What I’m not sure about, though, is when or how the file f gets closed.

I had the following Python code to invoke ffmpeg and extract a frame from a movie:

from subprocess import Popen, PIPE
...
args = [
    'ffmpeg',
     '-y',
    '-i', filein,
    '-r', '1',
    '-vframes', '1',
    '-f', 'image2',
    '-t', '00:00:01',
    '-ss', '20',
    '-',]
p = Popen(args, stdout=PIPE, stderr=PIPE)
( out, err ) = p.communicate()

(At the end of this, out contains the bytes of the frame.)

This was working fine from a standalone python test program, but when invoked as a celery task, ffmpeg would fail to decode the video. In the celery log (running celeryd in verbose mode, python manage.py celeryd -l info) the errors looked like this:

[mpeg4 @ 0x464d90]illegal MB_type
[mpeg4 @ 0x464d90]Error at MB: 1338
[mpeg4 @ 0x464d90]concealing 1344 DC, 1344 AC, 1344 MV errors...

At first I thought perhaps the environment that the task ran in was different, so I tried a few things:

  • hard-wiring the path to ffmpeg in the code (in case it was picking up a different ffmpeg from somewhere)
  • examining the ffmpeg output, which is quite verbose about the details of how it was built, to establish the right ffmpeg was being run
  • capturing the output of the env shell command and comparing it to the working environment, to look for salient differences

None of this led anywhere.

Since the code worked standalone, I figured it was something about how the subprocess was being invoked. Most likely, I thought, seeing as the working version was being run from an interactive shell, it was something to do with file descriptors and standard input/output.

My code wasn’t dealing with stdin explicitly, so I made it do so by opening a pipe and sending an empty string (the argument to communicate()):

p = Popen(args, stdin=PIPE, stdout=PIPE, stderr=PIPE)
( out, err ) = p.communicate('')

Turns out my hunch was right, and this worked just fine. I’m not sure why it’s necessary, and at some point I’ll return and experiment a bit more.

A quick one, since it took me a bit of googling to find the answer.

It seems that there’s something amiss with the packaging of PIL (the Python Imaging Library) if you use easy_install to install it. The error you’ll get when you try to import PIL is:

 Traceback (most recent call last):
  File “<stdin>”, line 1, in <module>
ImportError: No module named PIL

Using pip to remove and then re-install it worked for me:

pip uninstall PIL
pip install PIL

(BTW if you’re using virtualenv, then you already have pip.)

Before you get too excited though, try the following in the Python interpreter:

import _imaging

If you get the following error, and you’re on a Mac, it’s because of an architecture clash between your libjpeg and the PIL build:

Traceback (most recent call last):
  File “<stdin>”, line 1, in <module>
ImportError: dlopen(/blah/blah/python2.6/site-packages/PIL/_imaging.so, 2): Symbol not found: _jpeg_resync_to_restart
  Referenced from: /blah/blah/python2.6/site-packages/PIL/_imaging.so
  Expected in: flat namespace
 in /blah/blah/python2.6/site-packages/PIL/_imaging.so

My understanding of what’s going on is that Python is a 64-bit (x86_64) application attempting to dynamically load a 32-bit (i386) shared library, which fails.

Anyway – I had libjpeg installed via the fink package manager. I removed it and installed it with homebrew instead, and the problem went away:

brew install jpeg

If that doesn’t work for you, this thread may have an answer.

Update: I’ve since found out about Pillow, which appears to be a maintained PIL fork. Might be one to check out.

%d bloggers like this: