Modeling a Tree in a Document Database
by Sean Cribbs
I’ve been working a lot with non-relational datastores lately (“nosql”). A particularly useful type of these is the document database, exemplified by CouchDB and MongoDB. They are schema-less, simple to access and manipulate (especially Couch), and represent data as JSON objects (or in the case of Mongo, a binary serialization format equivalent to JSON), which makes them really easy to use in a large number of dynamic languages.
Tree structures on top of SQL
Now, back in Relational Land, developers have come up with a number of patterns for representing data structures on top of regular tables in a SQL database. If you’re familiar with Rails, you’ll know some of these as acts_as_tree
and acts_as_nested_set
. The patterns are at their essence not all that different from building data structures in C with pointers and arrays. Here’s a brief overview of the two patterns for creating a tree:
acts_as_tree
This is the simplest way to represent a tree in a database. Child nodes of the tree contain foreign keys to their parent within the same table. If we have the tree with root node A with child nodes B and C, and D as a child of B, the table might look like this:
id | name | parent_id |
1 | A | NULL |
2 | B | 1 |
3 | C | 1 |
4 | D | 2 |
This is simple to understand and query. To get the root node(s), you query WHERE parent_id IS NULL
. To get the immediate children of A, you query WHERE parent_id = 1
. However, if you have a large tree, you need to perform about LOG(N)
queries to find a given node (where N is the number of nodes in the tree), and you can’t find the node in a single query.
Pros: Easy to understand, fast to insert and move
Cons: Requires multiple queries to get whole subtrees
acts_as_nested_set
The “nested set” pattern is a little more complicated but has some benefits over the standard parent-linking tree. In addition to the parent_id
column, the nested set also adds a “left” and “right” column, allowing you to get whole subtrees in a single query. Here’s how our table will look with the extra columns.
id | name | parent_id | left | right |
1 | A | NULL | 1 | 8 |
2 | B | 1 | 2 | 5 |
3 | C | 1 | 6 | 7 |
4 | D | 2 | 3 | 4 |
Direct children queries are the same as the simple tree pattern, but if you want the whole subtree under A, you can grab all of them in a single query with left BETWEEN 1 AND 8
. A good visualization of this can be seen in the original documentation for the pattern. This pattern allows fast access of deeply nested trees, and you also get the benefit that children are ordered by the left
and right
columns. However, when inserting nodes in the tree, large swaths of records need to have their left
and right
columns updated, making it expensive.
Pros: Lookup up entire subtrees with a single query (fast), intrinsic ordering of children
Cons: Slow to insert and move, due to many modifications of existing records
Trees in a document database
When trying to figure out the best way to implement a tree structure in a document database such as Couch or Mongo, several possibilities present themselves to me.
Direct storage
If your individual nodes aren’t very large and you can afford to load the whole thing in one call, storing it directly in the same document makes sense. Here’s how our above tree might look as a JSON document:
{
"name": "A",
"children": [
{"name": "B", "children": [{"name": "D"}]},
{"name": "C"}
]
}
Pros: Native tree-like data structure, intrinsic ordering of children
Cons: Could get ugly with larger and more complex documents, concurrency is limited
Parent links
Just like acts_as_tree
, you could put the ID of the parent document in each child. However, this only becomes useful if you have good indexes in Mongo or a useful view in CouchDB to make lookup efficient. Here’s how the JSON documents might look for our same tree:
{"_id": "A"}
{"_id": "B", "parent_id": "A"}
{"_id": "C", "parent_id": "A"}
{"_id": "D", "parent_id": "B"}
Pros: Simple to understand, easy to find parent
Cons: Needs good view or index to find child documents, no intrinsic ordering of children
Child lists
A variation on our parent links pattern would be to store the ids of the child documents in the parent. This would suffer similar limitations to the parent link pattern. The resulting structure might look like this:
{"_id": "A", "children": ["B", "C"]}
{"_id": "B", "children": ["D"]}
{"_id": "C"}
{"_id": "D"}
Pros: Simple to understand, easy to find children, intrinsic ordering of children
Cons: Needs good view or index to find parent documents
Branches and leaves
The final pattern I’m calling “branches and leaves”. This is sometimes seen in OO data structures where the nodes in the tree contain no data, but point to the data. The primary advantage that I see of this pattern is that you can get the whole tree structure in a single document (as in the first pattern), but the tree need not contain the data. Here’s how our tree might look:
{
"leaf": "A",
"children": [
{"leaf": "B", "children": [{"leaf": "D"}] },
{"leaf": "C"}
]
}
{"_id": "A", ...}
{"_id": "B", ...}
{"_id": "C", ...}
{"_id": "D", ...}
Pros: Two lookups to find any node, native tree data structure, data separate from tree, intrinsic ordering
Cons: Traversing from one node to another requires referring back to the tree data structure (maybe this is not a bad thing — it can be cached), concurrency is limited
Wrap-up
I’ve been mulling over these options for a few days. I’m not sure which one I will pick of the last three (the first one is not an option). One could also combine several of the patterns—for example, the parent link and children list—but the inability to perform transactions across multiple documents could quickly lead to inconsistency.
I’d appreciate your comments.