Article: Hierarchical data in MySQL: easy and fast »
FERDY CHRISTANT - MAR 27, 2009 (11:44:00 PM)
Storing hierarchical data in a relational database is a classic problem. Relational databases are not really designed for this purpose, because they store data in flat rows. Storage is actually the least of our problems. Imagine a commenting system with an unlimited level of nesting in replies. This is easy to store in a relational database:
| comment_id |
body |
parent_id |
| 1 | JungleDragon rocks |
|
| 2 | No it doesn't! |
1 |
| 3 | Oh yes it does! |
2 |
| 4 | The economy sucks |
|
| 5 | True | 4 |
For the sake of simplicity, I have omitted other comment fields such as author, date, etc. The table shows that we can easily store unlimited levels of nesting by simply linking to the parent of the current comment. Comments that have no parent are the root comments, the others are leafs. If we would apply a foreign key to the parent_id column and link it to the id column we even get full referential integrity.
So far so good? Yes, but the real problems start when we want to query that table for useful stuff. Imagine the following tasks:
- Get the full comment tree sorted in the correct order. Hint: you cannot use date sorting, since replies can be made anywhere and must be positioned below their parent at all times.
- For each comment in the tree, we want to know how deep it is (how many levels deep), so we can use that for indenting the comment below its parent.
- For each comment we want to know how many replies there are, no matter how deep in the tree.
None of the above tasks can be queried efficiently using the table we started with. The problem is that since we do not know how many levels deep the hierarchy is, we need to make recursive calls, leading to many queries to the database, high memory usage in our scripts and generally a system that does not scale.
As said, this is a classic problem, and luckily there are existing solutions and patterns for this. This article explains four different approaches, I highly recommend it. In summary, these are the four methods:
- Recursive. The approach we already discussed. This approach is inefficient and does not scale.
- Stack. This one is not much different from the recursive approach. The idea is that you build a stack of hierarchical data by looping through the rows, typically inside a stored procedure. Once you have the stack built, it is very easy to use and query, however, it still requires a lot of database connections.
- Modified Preorder Tree Traversal. This surely is the most sophisticated, but also most complex, of the methods to work with hierarchical data. I encourage you to read the full description to really understand it. The principle of this method is that we assign smart numbers (left and right) to each leaf in the tree, these numbers indicate the relative position of the node towards other nodes. With a tree preordered like this, querying the hierarchical data becomes very fast and easy. There is a major downfall though: as soon as you insert or change something in the tree, it has to be rebuild. This will lead to many update queries, and these are expensive. The larger the tree, the larger the time of inserting a new node.
- Flat table model. Surely we know that if we could preprocess things like the level of the comment (how deep is it) and the rank (sort order), querying would become much easier. The author also admits that this does shift the performance problem to the writes: upon inserting a comment, all others ranks have to be recalculated. Anyways, the idea of this model is that you store essential attributes of the tree, instead of calculating them each time we retrieve the tree. Kind of like a tree cache.
Based on the options presented so far, it seems the flat table model looks promising, yet it does have a major limitation: a very expensive write operation. Luckily, one of the commenters of the article posted a way around this. I have taken his idea as a basis, and extended it somewhat.
The Flat Table Model done right
Let's just get to it right away. Here's my(heavily inspired) solution:
| comment_id |
body |
parent_id |
lineage | deep |
| 1 | JungleDragon rocks |
1 |
0 | |
| 2 | No it doesn't! |
1 |
1-2 |
1 |
| 3 | Oh yes it does! |
2 | 1-2-3 |
2 |
| 4 | The economy sucks |
4 |
0 | |
| 5 | True | 4 | 4-5 | 1 |
Instead of using a rank, we are using a lineage. They have the same purpose (correctly ordering the comments), it's just that a lineage has one major advantage: it never needs to be updated! As you can see, the lineage column displays the hierarchy ids in a flat column (from highest parent to lowest child). This field is calculated upon insertion, no other inserts need to take place. The deep column simply sets the nesting level of the comment, which is handy when we need this value for indentation later on. By storing it, we require no recursive queries at all.
Let's go into a little bit more detail concerning the insertion of a new comment. Here's how it works:
- in our save routine of a new comment, we check whether it came from a parent (is parent_id set?)
- if so, we do a query to retrieve the parent row
- next, we calculate the lineage and deepness of the new comment. For lineage, we use the parent's lineage. For deepness, we use the parent's deepness + 1.
- we now insert our new comment. next, we use the returned comment id of the new comment, add it to the lineage of the newly created comment, and save it again (update).
All in all, we are retrieving a row, inserting one, and updating one. Considering the other highly inefficient methods and knowing that most people will read comments and not write them, this is pretty awesome. We have no recursion, not even a regular loop.
Querying this structure is even better. It is super easy to get all comments in the right order, get the indentation, and even the replies per comment, no matter how many levels of nesting we have. This query kind of combines these three tasks:
SELECT c.id, c.user_id, c.body, c.deep, c.lineage, c.parent_id, (SELECT COUNT(*) FROM comment where comment.lineage LIKE (CONCAT(c.lineage,'%')) AND comment.lineage!=c.lineage) as replies
FROM comment as c
order by c.lineage
Careful readers will see that we are using a subquery to count the replies for each comment. This part does have some performance overhead, but not much. If you do not need to show the number of replies to each comment, leave it our for even faster results. Or, you could take this even further by storing the number of replies in the database. The downfall of that approach is that you will need to recursively update all parents of the current comment row. That's not as bad as it sounds, it is not likely that you will have more than a handful of nesting levels for a single comment, in fact, most will have none, one or two.
Referential integrity
What about referential integrity? Since we have flattened the lineage and deep columns, we have lost some of that. No worries though, we are keeping our referential parent_id -> id foreign key. We do have referential integrity at the foundation. Should anything go wrong with the lineage or deep columns, then we can easily use the parent_id -> id relationship to "rebuild" those values.
Usage
The advanced flat model is optimized for a comment system, it should not be used everywhere. Particularly it is suitable for wide trees (little nesting), not deep trees. The reason for this limitation is the lineage column, which flattens the hierarchy in a single column. When there are too many levels of nesting (let's say over 20), a different model may be more suitable.
Yet another way
Throughout my online research, I found yet another way to work with hierarchical data. This approach is again a flat model approach, but this time the author reserves one column per nesting level in the table to store the deepness. The problem is that this results in a hardcoded nesting level limit. The other problem is that it is patented. Go figure. That's like patenting breathing.
Conclusion
There are multiple approaches towards storing and retrieving hierarchical data in (my)SQL, but for comment systems and other wide tree hierarchies, I hope you like my tweaked flat model approach. It has no recursion, not even loops, fast retrieves, the least writes, and a solid foundation of referential integrity. Credits go out to the author of the article and the commenter that suggested the lineage method. I only extended those ideas a little bit.
In closing, I like to end with a small tip. In many online examples, I see people looping through a depth counter to append things like spaces for indentation. This is not needed, almost every language has a function for this, here it is in PHP:
str_repeat(" ",$row['deep']);
...where the first param is the indentation string, and the second param is the number of indentations.
Enjoy!



Comments: 31
Reviews: 10
Average rating:
Highest rating: 5
Lowest rating: 1
COMMENT: ALAN

APR 7, 2009 - 10:20:12 PM
COMMENT: LEWIS
APR 10, 2009 - 11:16:13 AM
I wonder why you think that, maybe if you thought about it a bit more you would change your mind. Or maybe I am just being a bit egotistical. What does everyone else think.
lewis.
Lineage Adena «
COMMENT: STEVE LOUNSBURY

JUL 22, 2009 - 06:48:21 PM
COMMENT: FERDY
JUL 24, 2009 - 04:15:36 PM
In my situation, I'm using it for a comments system, where comments cannot be deleted. However, it is not difficult to delete a node, since we still have referential integrity. You can walk DOWN the three using referential integrity. It would require recursion though, but since deletes are typically infrequent and nestings not so deep, it should perform fine I think. «
COMMENT: ROBERTO F.
AUG 21, 2009 - 18:03:44
*level1
**level1A
**level1B
*level2
**level2a
***level2a1.
OffCourse the LEVEL(N)(A) are the name. But I've getting no sucess just using MySQL without and code. Do ya know how could I do that?
«
COMMENT: SIMON
NOV 27, 2009 - 11:41:29
COMMENT: TOMAS
MAY 2, 2010 - 09:36:48
COMMENT: HANS CASTORP
MAY 2, 2010 - 05:14:11 PM
When using recursive CTEs (or Oracle's CONNECT BY) it is a very elegant, easy to use and efficient solution «
COMMENT: HANS CASTORP
MAY 2, 2010 - 05:15:19 PM
In that case recursion is indeed not a good solution «
COMMENT: DAVID


MAY 16, 2010 - 13:40:36
COMMENT: DAVID

MAY 16, 2010 - 16:29:15
No worries, I solved this now.
I created a new field called 'thread'. A comment and all its replies share the same thread id which I increment when a new comment thread begins. Then all I had to do was use 'ORDER BY thread DESC, lineage'. This now displays a classic bulletin board hierarchy. Comments listed latest first with replies indented below each one.
Thank again for the lineage idea.
David. «
COMMENT: DAVID


MAY 18, 2010 - 11:20:17
Ordering by lineage in this way can sometimes give unexpected results because with alphanumeric ordering 11 comes before 2. It's not disastrous but you can tidy this up by padding the lineage elements with zeros using the php function sprintf. In this way 00011 comes after 00002.
I now have a working example of the lineage solution described on this page:
www.thevac.co.uk/messages.php
Please note this is a live message board. «
COMMENT: MAYLEE
JUL 18, 2010 - 01:33:55 AM
COMMENT: VANI
SEP 28, 2010 - 07:52:30 PM
Do you think it's a good idea to use for folders with permissions?
Basically, when a user logs in the system i need to get a list of folder available for that user and the permissions associated (like readonly folder, writable folder, duplicate folder).
I was also looking at hierarchical tagging system. But has doubts about it whether it's a good idea to include permissions with that.
What are your thoughts about that? «
COMMENT: RAJESH

OCT 6, 2010 - 10:13:19 PM
COMMENT: RAJESH
OCT 6, 2010 - 10:13:54 PM
COMMENT: SUJOY PAL

NOV 12, 2010 - 09:23:59 AM
COMMENT: PETER
DEC 14, 2010 - 19:11:23
COMMENT: JOHAN

FEB 3, 2011 - 03:24:02 PM
It is explained very well how to get a record and all its children. I needed it the other way around: To select a record and all its parents.
It is a little bit tricky but you *can* use lineage for that direction too. I came up with a solution using a subquery. Note that this example uses zero padding (7 chars total). So my table data looks somewhat like this:
id: 0, parent: null, lineage: 0000001
id: 2, parent: 1 , lineage: 0000001-0000002
id: 3, parent: 1 , lineage: 0000001-0000003
id: 4, parent: 3 , lineage: 0000001-0000003-0000004
id: 5, parent: 3 , lineage: 0000001-0000003-0000005
id: 6, parent: 1 , lineage: 0000001-0000006
id: 7, parent: null, lineage: 0000007
id: 8, parent: null, lineage: 0000008
To get the tree of item with id 10 I use this query:
SELECT * FROM MyTree WHERE lineage LIKE (SELECT CONCAT(SUBSTRING(lineage,1,7),'%') AS root FROM MyTree WHERE id = 10) «
COMMENT: PIER LUIGI


JUN 24, 2011 - 10.11.55
1) The deep column is not needed because you can compute the value using the lineage column (count the IDs and subtract 1).
2) I suggest to prepend and postpend the separator into the lineage column like "-1-2-3-11-" because this allow you to easily identify some useful info using the standard SQL LIKE operator:
- with LIKE "-(id)-%" you get the all messages of a main tread identified by (id)
- with LIKE "%-(id)-%" you get the all messages (child and ancestors) of a message identified by (id)
Bye. «
COMMENT: CHARLIE
JUL 8, 2011 - 01:26:51 PM
COMMENT: DON

JUL 18, 2011 - 05:50:27 PM
Many thanks! «
COMMENT: MARCEL

AUG 5, 2011 - 13:23:25
Just one remark: I'm always surprised how the tags are often (ab)used for creating spacing.. since you already know the depth of the node, don't clutter up your html with countless nbsp's, just do it with css! Either inline (margin-left) or classes (depth0, -1, -2, etc).. «
COMMENT: JOHN

DEC 28, 2011 - 07:39:55 PM
I'm using this structure in a table of 'categories', any of which can be nested under another for an infinite number of hierarchical levels. It works great...until the client wants a way to re-order the categories in the list.
How would you do that? The only thing I can think of is to give teh client a 'weight' or 'order by' field that contains an int. Then, generate a sort field in the get query that's something like CONCAT(`lineage`, ',' , `weight`) Problem is I really need to strip the last entry from lineage and replace it with weight/order...that way I'll get a sort on the lineage of all parent IDs + the weight for final children.
My familiarity with SQL string functions is not where it should be though...nor am I really sure this is the best way to do it.
How would you do it? «
COMMENT: THE MIKENESS
MAR 13, 2012 - 16:58:24
(CASE WHEN parent_id IS NULL THEN comment_id ELSE parent_id END) as sortcol
then you'd sort by sortcol above all else, and then simply have to correct in code for the fact that you might have wanted to sort the parents by something other than their id. I think Oracle has some sort of functions that allow you to check values of previous rows and make decisions based on that, so it may be possible in some SQL systems, but I don't so far believe I can come up with a way to do it 100% properly in SQL, however, this solution is most likely far superior in every way to every other solution I've seen with the exception being that some minimal post-processing is necessary. In PHP, you'd just have to stick all rows in the resultset into an array that is indexed by your desired sort order column, and then instead of iterating this new array by order it was inserted into, iterate through it in order of your sort order column (or sort the array). «
COMMENT: VINS
MAY 12, 2012 - 04:22:06 AM
Suppose if I will add one more column to the table called "sort_order" and I want the result in as per the sort_order. for example the resulting hierarchy will be in order by lineage and sort order as well. All the child of the parents will be sorted by the assigned sort order. Also if the sort order will be same then the they will be sort by their name (body).
What will be the mysql query for this. can anybody help me ?
comment_id body parent lineage deep sort_order
1 Jungle 1 0 1
2 doesn't! 1 1-2 1 2
3 it Works 1 1-3 1 2
4 Oh yes 2 1-2-3 2 1
5 economy 4 0 2
6 True 4 4-5 1 1 «
COMMENT: TAMÁS MÁRTON
SEP 30, 2012 - 01:27:40 DU.
COMMENT: PRAMOD
NOV 2, 2012 - 11:33:21 AM
COMMENT: COZ
NOV 3, 2012 - 03:09:54
Replace it with a category_id (the id of the parent article for the comment)
Then that's all you need, select everything with that category_id, order it in reverse by id (which is sequential)
When displaying if it's not a root comment then indent it.
That's all it is, simple as anything. A lot more efficient than your way. :) «
COMMENT: FERDY
NOV 4, 2012 - 12:27:15
COMMENT: USMAN
MAR 1, 2013 - 07:08:01 AM
I am trying to create multi level menu from sometime. I get some idea from here. But I am little bit confused with lineage. «