Modeling a Tree in a Document Database

I’ve been working a lot with non-relational datastores lately (“nosql”). A particularly useful type of these is the document database, exemplified by CouchDB and MongoDB. They are schema-less, simple to access and manipulate (especially Couch), and represent data as JSON objects (or in the case of Mongo, a binary serialization format equivalent to JSON), which makes them really easy to use in a large number of dynamic languages.

Tree structures on top of SQL

Now, back in Relational Land, developers have come up with a number of patterns for representing data structures on top of regular tables in a SQL database. If you’re familiar with Rails, you’ll know some of these as acts_as_tree and acts_as_nested_set. The patterns are at their essence not all that different from building data structures in C with pointers and arrays. Here’s a brief overview of the two patterns for creating a tree:

acts_as_tree

This is the simplest way to represent a tree in a database. Child nodes of the tree contain foreign keys to their parent within the same table. If we have the tree with root node A with child nodes B and C, and D as a child of B, the table might look like this:

id name parent_id
1 A NULL
2 B 1
3 C 1
4 D 2

This is simple to understand and query. To get the root node(s), you query WHERE parent_id IS NULL. To get the immediate children of A, you query WHERE parent_id = 1. However, if you have a large tree, you need to perform about LOG(N) queries to find a given node (where N is the number of nodes in the tree), and you can’t find the node in a single query.

Pros: Easy to understand, fast to insert and move

Cons: Requires multiple queries to get whole subtrees

acts_as_nested_set

The “nested set” pattern is a little more complicated but has some benefits over the standard parent-linking tree. In addition to the parent_id column, the nested set also adds a “left” and “right” column, allowing you to get whole subtrees in a single query. Here’s how our table will look with the extra columns.

id name parent_id left right
1 A NULL 1 8
2 B 1 2 5
3 C 1 6 7
4 D 2 3 4

Direct children queries are the same as the simple tree pattern, but if you want the whole subtree under A, you can grab all of them in a single query with left BETWEEN 1 AND 8. A good visualization of this can be seen in the original documentation for the pattern. This pattern allows fast access of deeply nested trees, and you also get the benefit that children are ordered by the left and right columns. However, when inserting nodes in the tree, large swaths of records need to have their left and right columns updated, making it expensive.

Pros: Lookup up entire subtrees with a single query (fast), intrinsic ordering of children

Cons: Slow to insert and move, due to many modifications of existing records

Trees in a document database

When trying to figure out the best way to implement a tree structure in a document database such as Couch or Mongo, several possibilities present themselves to me.

Direct storage

If your individual nodes aren’t very large and you can afford to load the whole thing in one call, storing it directly in the same document makes sense. Here’s how our above tree might look as a JSON document:

{
  "name": "A",
  "children": [
    {"name": "B", "children": [{"name": "D"}]},
    {"name": "C"}
  ]
}

Pros: Native tree-like data structure, intrinsic ordering of children

Cons: Could get ugly with larger and more complex documents, concurrency is limited

Parent links

Just like acts_as_tree, you could put the ID of the parent document in each child. However, this only becomes useful if you have good indexes in Mongo or a useful view in CouchDB to make lookup efficient. Here’s how the JSON documents might look for our same tree:

{"_id": "A"}
{"_id": "B", "parent_id": "A"}
{"_id": "C", "parent_id": "A"}
{"_id": "D", "parent_id": "B"}

Pros: Simple to understand, easy to find parent

Cons: Needs good view or index to find child documents, no intrinsic ordering of children

Child lists

A variation on our parent links pattern would be to store the ids of the child documents in the parent. This would suffer similar limitations to the parent link pattern. The resulting structure might look like this:

{"_id": "A", "children": ["B", "C"]}
{"_id": "B", "children": ["D"]}
{"_id": "C"}
{"_id": "D"}

Pros: Simple to understand, easy to find children, intrinsic ordering of children

Cons: Needs good view or index to find parent documents

Branches and leaves

The final pattern I’m calling “branches and leaves”. This is sometimes seen in OO data structures where the nodes in the tree contain no data, but point to the data. The primary advantage that I see of this pattern is that you can get the whole tree structure in a single document (as in the first pattern), but the tree need not contain the data. Here’s how our tree might look:

{
  "leaf": "A",
  "children": [
    {"leaf": "B", "children": [{"leaf": "D"}] },
    {"leaf": "C"}
  ]
}
{"_id": "A", ...}
{"_id": "B", ...}
{"_id": "C", ...}
{"_id": "D", ...}

Pros: Two lookups to find any node, native tree data structure, data separate from tree, intrinsic ordering

Cons: Traversing from one node to another requires referring back to the tree data structure (maybe this is not a bad thing — it can be cached), concurrency is limited

Wrap-up

I’ve been mulling over these options for a few days. I’m not sure which one I will pick of the last three (the first one is not an option). One could also combine several of the patterns—for example, the parent link and children list—but the inability to perform transactions across multiple documents could quickly lead to inconsistency.

I’d appreciate your comments.

blog comments powered by Disqus