Article: Hierarchical data in MySQL: easy and fast »
FERDY CHRISTANT - MAR 27, 2009 (11:44:00 PM)
Storing hierarchical data in a relational database is a classic problem. Relational databases are not really designed for this purpose, because they store data in flat rows. Storage is actually the least of our problems. Imagine a commenting system with an unlimited level of nesting in replies. This is easy to store in a relational database:
|2||No it doesn't!
|3||Oh yes it does!
|4||The economy sucks
For the sake of simplicity, I have omitted other comment fields such as author, date, etc. The table shows that we can easily store unlimited levels of nesting by simply linking to the parent of the current comment. Comments that have no parent are the root comments, the others are leafs. If we would apply a foreign key to the parent_id column and link it to the id column we even get full referential integrity.
So far so good? Yes, but the real problems start when we want to query that table for useful stuff. Imagine the following tasks:
- Get the full comment tree sorted in the correct order. Hint: you cannot use date sorting, since replies can be made anywhere and must be positioned below their parent at all times.
- For each comment in the tree, we want to know how deep it is (how many levels deep), so we can use that for indenting the comment below its parent.
- For each comment we want to know how many replies there are, no matter how deep in the tree.
None of the above tasks can be queried efficiently using the table we started with. The problem is that since we do not know how many levels deep the hierarchy is, we need to make recursive calls, leading to many queries to the database, high memory usage in our scripts and generally a system that does not scale.
As said, this is a classic problem, and luckily there are existing solutions and patterns for this. This article explains four different approaches, I highly recommend it. In summary, these are the four methods:
- Recursive. The approach we already discussed. This approach is inefficient and does not scale.
- Stack. This one is not much different from the recursive approach. The idea is that you build a stack of hierarchical data by looping through the rows, typically inside a stored procedure. Once you have the stack built, it is very easy to use and query, however, it still requires a lot of database connections.
- Modified Preorder Tree Traversal. This surely is the most sophisticated, but also most complex, of the methods to work with hierarchical data. I encourage you to read the full description to really understand it. The principle of this method is that we assign smart numbers (left and right) to each leaf in the tree, these numbers indicate the relative position of the node towards other nodes. With a tree preordered like this, querying the hierarchical data becomes very fast and easy. There is a major downfall though: as soon as you insert or change something in the tree, it has to be rebuild. This will lead to many update queries, and these are expensive. The larger the tree, the larger the time of inserting a new node.
- Flat table model. Surely we know that if we could preprocess things like the level of the comment (how deep is it) and the rank (sort order), querying would become much easier. The author also admits that this does shift the performance problem to the writes: upon inserting a comment, all others ranks have to be recalculated. Anyways, the idea of this model is that you store essential attributes of the tree, instead of calculating them each time we retrieve the tree. Kind of like a tree cache.
Based on the options presented so far, it seems the flat table model looks promising, yet it does have a major limitation: a very expensive write operation. Luckily, one of the commenters of the article posted a way around this. I have taken his idea as a basis, and extended it somewhat.
The Flat Table Model done right
Let's just get to it right away. Here's my(heavily inspired) solution:
|2||No it doesn't!
|3||Oh yes it does!
|4||The economy sucks
Instead of using a rank, we are using a lineage. They have the same purpose (correctly ordering the comments), it's just that a lineage has one major advantage: it never needs to be updated! As you can see, the lineage column displays the hierarchy ids in a flat column (from highest parent to lowest child). This field is calculated upon insertion, no other inserts need to take place. The deep column simply sets the nesting level of the comment, which is handy when we need this value for indentation later on. By storing it, we require no recursive queries at all.
Let's go into a little bit more detail concerning the insertion of a new comment. Here's how it works:
- in our save routine of a new comment, we check whether it came from a parent (is parent_id set?)
- if so, we do a query to retrieve the parent row
- next, we calculate the lineage and deepness of the new comment. For lineage, we use the parent's lineage. For deepness, we use the parent's deepness + 1.
- we now insert our new comment. next, we use the returned comment id of the new comment, add it to the lineage of the newly created comment, and save it again (update).
All in all, we are retrieving a row, inserting one, and updating one. Considering the other highly inefficient methods and knowing that most people will read comments and not write them, this is pretty awesome. We have no recursion, not even a regular loop.
Querying this structure is even better. It is super easy to get all comments in the right order, get the indentation, and even the replies per comment, no matter how many levels of nesting we have. This query kind of combines these three tasks:
SELECT c.id, c.user_id, c.body, c.deep, c.lineage, c.parent_id, (SELECT COUNT(*) FROM comment where comment.lineage LIKE (CONCAT(c.lineage,'%')) AND comment.lineage!=c.lineage) as replies
FROM comment as c
order by c.lineage
Careful readers will see that we are using a subquery to count the replies for each comment. This part does have some performance overhead, but not much. If you do not need to show the number of replies to each comment, leave it our for even faster results. Or, you could take this even further by storing the number of replies in the database. The downfall of that approach is that you will need to recursively update all parents of the current comment row. That's not as bad as it sounds, it is not likely that you will have more than a handful of nesting levels for a single comment, in fact, most will have none, one or two.
What about referential integrity? Since we have flattened the lineage and deep columns, we have lost some of that. No worries though, we are keeping our referential parent_id -> id foreign key. We do have referential integrity at the foundation. Should anything go wrong with the lineage or deep columns, then we can easily use the parent_id -> id relationship to "rebuild" those values.
The advanced flat model is optimized for a comment system, it should not be used everywhere. Particularly it is suitable for wide trees (little nesting), not deep trees. The reason for this limitation is the lineage column, which flattens the hierarchy in a single column. When there are too many levels of nesting (let's say over 20), a different model may be more suitable.
Yet another way
Throughout my online research, I found yet another way to work with hierarchical data. This approach is again a flat model approach, but this time the author reserves one column per nesting level in the table to store the deepness. The problem is that this results in a hardcoded nesting level limit. The other problem is that it is patented. Go figure. That's like patenting breathing.
There are multiple approaches towards storing and retrieving hierarchical data in (my)SQL, but for comment systems and other wide tree hierarchies, I hope you like my tweaked flat model approach. It has no recursion, not even loops, fast retrieves, the least writes, and a solid foundation of referential integrity. Credits go out to the author of the article and the commenter that suggested the lineage method. I only extended those ideas a little bit.
In closing, I like to end with a small tip. In many online examples, I see people looping through a depth counter to append things like spaces for indentation. This is not needed, almost every language has a function for this, here it is in PHP:
...where the first param is the indentation string, and the second param is the number of indentations.