Pulling data from CouchDB to a Relational Database made easy with _changes

couchdb logo(This is the first post in Dimagi’s new tech blog series, where we shift the focus from the impact of our projects to the technology side of our work.)

Much has been written and debated about the various advantages and disadvantages of different document databases, and by far the most common choice in evaluating a document DB is a choice between CouchDB and MongoDB. This post isn’t going to get into that debate, but at Dimagi we’ve been using CouchDB succcessfully in multiple production projects for over a year now and overall we’re really happy with it. CouchDB has some features that made it an obvious choice for our products, and in particular its crash recovery and built-in replication have been life-savers. But for each time we’ve sung CouchDB’s praises and sworn our alleigance, we’ve cursed it for making certain remedial tasks truly painful.

Perhaps at the top of the list of “things that are annoying in CouchDB” is general reporting. Here we’ll look at how to work around this problem by hooking into CouchDB’s _changes feed and storing your data redundantly in something that’s much easier to work with (a relational database).

First, the problem

Couch’s map/reduce is great for extracting aggregated data from a very specific set of information in your database, but is terrible for things like “show me all data between January and June 2011, broken down by user and location”. One thing CouchDB does to help with this is let you use a complex key, to provide different levels of aggregation of your data. You can get pretty far with complex keys if you know the order of things you want to query.

But what if you want to aggregate on different things? Sometimes by place, then by year, sometimes by year, then by place. Unfortunately, because aggregation only happens in the order defined by the keys, this isn’t really possible natively. Instead, you’re left with a couple unattractive options:

Option 1: Writing a new view that provides access to the data with a different key hierarchy for each possible order:

view_by_place.js

emit([doc.place, doc.year], doc.value);

view_by_year.js

emit([doc.year, doc.place], doc.value);

Now you can choose which view to query based on the ordering, but this adds a lot of overhead both in terms of duplication of code, needless extra views, and remembering which one to query.

Option 2: Using the first element of the key as an index into the rest of the key.

This is a trick where you encode the logic directly into the view using the first element of the key.

view_both.js

emit(["place_first", doc.place, doc.year], doc.value);
emit(["year_first", doc.year, doc.place], doc.value);

Now you can query with something like key=[“place_first”, “boston”] or key=[“year_first”, 2011]. But, again, this is a pretty cluttered approach and as the number of different orders increases can become quite arduous to manage. It also deeply couples your queries to your views, and makes it challenging to add new queries later.

What we really want is a Plain Old Relational Database (PORD) to write arbitrary queries against. It turns out that this is actually quite easy.

_ch _ch _ch _ch _changes!

BowieOne of the the main engines inside CouchDB is the _changes feed. This thing powers all of replication and makes it really easy to build ancillary applications on your database. Essentially it is a line-by-line feed of all activity that has occurred in your database (pruned for redundancy). With this you can easily write a standalone tool that listens for changes to the database and does something with them – in this case storing data in a Plain Old Relational Database.

The pattern is very simple:

1. Connect to _changes feed
2. For every line in the feed:
a. Determine if you want to act on it
b. (If yes) do something about it

The details

We use couchdbkit and django to interact with CouchDB and our Relational Database. Couchdbkit makes interacting with the _changes feed trivial through the Consumer object. First we define a callback function that we want executed for each line (which actually comes in as a dictionary in couchdbkit). Our callback will get the document associated with the line and, do an export to a django model. Some details are emitted for clarity:


def export(line):
  # line is like:
  # {"seq":3,"id":"test3","changes":[{"rev":"1-02c6b758b08360abefc383d74ed5973d"}]}
  doc_id = line.get("id")
  doc = db.get(doc_id) # assumes db is defined
  if doc.exportable:   # assumes doc has some property "exportable"
    # and some other properties to query on
    MyModel.objects.create(id=doc.get_id, place=doc.place, date=doc.date)

 

Once we have this function we initiate a new Consumer object and reference the callback:

c = Consumer(db)
c.wait(cb=export)

That’s it! This process just needs to be started and will happily sit for all of eternity waiting for incoming changes and immediately act on them by creating models in the database.

There are some additional optimizations and details to this (see below), but that’s the entire gist of it.

Pruning the data with filters

If you know you only care about data matching a certain pattern, then you can use CouchDB filters [LINK: http://guide.couchdb.org/draft/notifications.html#filters] to pre-restrict the data so that only documents matching a certain filter show up in it. For example, instead of doing the “doc.exportable” check in python we can only include documents with a property of exportable set to true in the feed using the following:

foofilter.js

function (doc, req) {
  return doc.exportable == true;
}

Anything that doesn’t return “true” will not be included in the feed. This can negate the need for the check in the export function, since we can use the filter to only include relevant documents.

Django integration

Since we’re going to save our objects in the relational database we want to be able to run this in a bootstrapped django environment. The easiest way to do this is with a management command.

mymodel_export.py


from django.core.management.base import LabelCommand
from couchdbkit.client import Database
from couchdbkit.consumer import Consumer
from myapp.models import MyModel

db = Database("couchdbname")

def export(line):
  # line is like:
  # {"seq":3,"id":"test3","changes":[{"rev":"1-02c6b758b08360abefc383d74ed5973d"}]}
  doc_id = line.get("id")
  doc = db.get(doc_id) # assumes db is defined
  # and some other properties to query on
  MyModel.objects.create(id=doc.get_id, place=doc.place, date=doc.date)

class Command(LabelCommand):
  def handle(self, *args, **options):
    c = Consumer(db)
    c.wait(filter="myapp/foofilter", cb=export)

 

There are a few more details to pay attention to – in particular you don’t want to act on documents that have been deleted (which can still show up in the _changes feed) or old revisions of documents, but for the most part the code above is fully functional. Couch to SQL in 20 lines of code!

p.s. Our use case was storing the structured parts of unstructured xform submissions in django so that we can report on them. The full code that provides the skeleton for this example is available on github, and in particular the management command.

Share

Tags

Similar Articles

The World's Most Powerful Mobile Data Collection Platform

Start a FREE 30-day CommCare trial today. No credit card required.

Get Started

Learn More

Get the latest news delivered
straight to your inbox