administration mode
Pssst...Ferdy is the creator of JungleDragon, an awesome wildlife community. Visit JungleDragon

 

Article: Hierarchical data in MySQL: easy and fast »

FERDY CHRISTANT - MAR 27, 2009 (11:44:00 PM)

Storing hierarchical data in a relational database is a classic problem. Relational databases are not really designed for this purpose, because they store data in flat rows. Storage is actually the least of our problems. Imagine a commenting system with an unlimited level of nesting in replies. This is easy to store in a relational database:

comment_id
body
parent_id
1 JungleDragon rocks
 
2 No it doesn't!
1
3 Oh yes it does!
2
4 The economy sucks
 
5 True 4

For the sake of simplicity, I have omitted other comment fields such as author, date, etc. The table shows that we can easily store unlimited levels of nesting by simply linking to the parent of the current comment. Comments that have no parent are the root comments, the others are leafs. If we would apply a foreign key to the parent_id column and link it to the id column we even get full referential integrity. 

So far so good? Yes, but the real problems start when we want to query that table for useful stuff. Imagine the following tasks:

  • Get the full comment tree sorted in the correct order. Hint: you cannot use date sorting, since replies can be made anywhere and must be positioned below their parent at all times.
  • For each comment in the tree, we want to know how deep it is (how many levels deep), so we can use that for indenting the comment below its parent.
  • For each comment we want to know how many replies there are, no matter how deep in the tree.

None of the above tasks can be queried efficiently using the table we started with. The problem is that since we do not know how many levels deep the hierarchy is, we need to make recursive calls, leading to many queries to the database, high memory usage in our scripts and generally a system that does not scale.

As said, this is a classic problem, and luckily there are existing solutions and patterns for this. This article explains four different approaches, I highly recommend it. In summary, these are the four methods:

  • Recursive. The approach we already discussed. This approach is inefficient and does not scale.
  • Stack. This one is not much different from the recursive approach. The idea is that you build a stack of hierarchical data by looping through the rows, typically inside a stored procedure. Once you have the stack built, it is very easy to use and query, however, it still requires a lot of database connections.
  • Modified Preorder Tree Traversal. This surely is the most sophisticated, but also most complex, of the methods to work with hierarchical data. I encourage you to read the full description to really understand it. The principle of this method is that we assign smart numbers (left and right) to each leaf in the tree, these numbers indicate the relative position of the node towards other nodes. With a tree preordered like this, querying the hierarchical data becomes very fast and easy. There is a major downfall though: as soon as you insert or change something in the tree, it has to be rebuild. This will lead to many update queries, and these are expensive. The larger the tree, the larger the time of inserting a new node.
  • Flat table model. Surely we know that if we could preprocess things like the level of the comment (how deep is it) and the rank (sort order), querying would become much easier. The author also admits that this does shift the performance problem to the writes: upon inserting a comment, all others ranks have to be recalculated. Anyways, the idea of this model is that you store essential attributes of the tree, instead of calculating them each time we retrieve the tree. Kind of like a tree cache.

Based on the options presented so far, it seems the flat table model looks promising, yet it does have a major limitation: a very expensive write operation. Luckily, one of the commenters of the article posted a way around this. I have taken his idea as a basis, and extended it somewhat.

The Flat Table Model done right

Let's just get to it right away. Here's my(heavily inspired) solution:

comment_id
body
parent_id
lineage deep
1 JungleDragon rocks
  1
0
2 No it doesn't!
1
1-2
1
3 Oh yes it does!
2 1-2-3
2
4 The economy sucks
  4
0
5 True 4 4-5 1

Instead of using a rank, we are using a lineage. They have the same purpose (correctly ordering the comments), it's just that a lineage has one major advantage: it never needs to be updated! As you can see, the lineage column displays the hierarchy ids in a flat column (from highest parent to lowest child). This field is calculated upon insertion, no other inserts need to take place. The deep column simply sets the nesting level of the comment, which is handy when we need this value for indentation later on. By storing it, we require no recursive queries at all.

Let's go into a little bit more detail concerning the insertion of a new comment. Here's how it works:

  • in our save routine of a new comment, we check whether it came from a parent (is parent_id set?)
  • if so, we do a query to retrieve the parent row
  • next, we calculate the lineage and deepness of the new comment. For lineage, we use the parent's lineage. For deepness, we use the parent's deepness + 1.
  • we now insert our new comment. next, we use the returned comment id of the new comment, add it to the lineage of the newly created comment, and save it again (update).

All in all, we are retrieving a row, inserting one, and updating one. Considering the other highly inefficient methods and knowing that most people will read comments and not write them, this is pretty awesome. We have no recursion, not even a regular loop.

Querying this structure is even better. It is super easy to get all comments in the right order, get the indentation, and even the replies per comment, no matter how many levels of nesting we have. This query kind of combines these three tasks:

SELECT c.id, c.user_id, c.body, c.deep, c.lineage, c.parent_id, (SELECT COUNT(*) FROM comment where comment.lineage LIKE (CONCAT(c.lineage,'%')) AND comment.lineage!=c.lineage) as replies
FROM comment as c
order by c.lineage

Careful readers will see that we are using a subquery to count the replies for each comment. This part does have some performance overhead, but not much. If you do not need to show the number of replies to each comment, leave it our for even faster results. Or, you could take this even further by storing the number of replies in the database. The downfall of that approach is that you will need to recursively update all parents of the current comment row. That's not as bad as it sounds, it is not likely that you will have more than a handful of nesting levels for a single comment, in fact, most will have none, one or two.

Referential integrity

What about referential integrity? Since we have flattened the lineage and deep columns, we have lost some of that. No worries though, we are keeping our referential parent_id -> id foreign key. We do have referential integrity at the foundation. Should anything go wrong with the lineage or deep columns, then we can easily use the parent_id -> id relationship to "rebuild" those values.

Usage

The advanced flat model is optimized for a comment system, it should not be used everywhere. Particularly it is suitable for wide trees (little nesting), not deep trees. The reason for this limitation is the lineage column, which flattens the hierarchy in a single column. When there are too many levels of nesting (let's say over 20), a different model may be more suitable.

Yet another way

Throughout my online research, I found yet another way to work with hierarchical data. This approach is again a flat model approach, but this time the author reserves one column per nesting level in the table to store the deepness. The problem is that this results in a hardcoded nesting level limit. The other problem is that it is patented. Go figure. That's like patenting breathing.

Conclusion

There are multiple approaches towards storing and retrieving hierarchical data in (my)SQL, but for comment systems and other wide tree hierarchies, I hope you like my tweaked flat model approach. It has no recursion, not even loops, fast retrieves, the least writes, and a solid foundation of referential integrity. Credits go out to the author of the article and the commenter that suggested the lineage method. I only extended those ideas a little bit.

In closing, I like to end with a small tip. In many online examples, I see people looping through a depth counter to append things like spaces for indentation. This is not needed, almost every language has a function for this, here it is in PHP:

str_repeat("   ",$row['deep']);

...where the first param is the indentation string, and the second param is the number of indentations.

Enjoy!

Share |

Comments: 39
Reviews: 12
Average rating: rating
Highest rating: 5
Lowest rating: 1

COMMENT: ALAN emailhomepage

APR 7, 2009 - 10:20:12 PM

comment » Thanks for linking to my article. While there is no perfect solution, there are always tradeoffs. With the patented method I examined, you do not need to perform what can potentially be expensive string manipulation functions. To each his own ... :) «

COMMENT: LEWIS

APR 10, 2009 - 11:16:13 AM

comment » thank u

I wonder why you think that, maybe if you thought about it a bit more you would change your mind. Or maybe I am just being a bit egotistical. What does everyone else think.

lewis.

Lineage Adena «

COMMENT: STEVE LOUNSBURY emailhomepage

JUL 22, 2009 - 06:48:21 PM

comment » Good article, definitely an interesting way to go about it. I went through your example and I was wondering how you would go about deleting a node? «

COMMENT: FERDY

JUL 24, 2009 - 04:15:36 PM

comment » Steve,

In my situation, I'm using it for a comments system, where comments cannot be deleted. However, it is not difficult to delete a node, since we still have referential integrity. You can walk DOWN the three using referential integrity. It would require recursion though, but since deletes are typically infrequent and nestings not so deep, it should perform fine I think. «

COMMENT: ROBERTO F. email

AUG 21, 2009 - 18:03:44

comment » Hi, I'm tryin to do that, and I have a table with id/name/parent - but I'd like to return a tree with characteres like

*level1

**level1A

**level1B

*level2

**level2a

***level2a1.

OffCourse the LEVEL(N)(A) are the name. But I've getting no sucess just using MySQL without and code. Do ya know how could I do that? 09 «

COMMENT: SIMON

NOV 27, 2009 - 11:41:29

comment » It's a good idea, thanks for that! But when you use this model for a classic menu structure (or any other object dependency) you still have to rescan the whole table whenever you move a node. I still have not found a good solution for that. «

COMMENT: TOMAS email

MAY 2, 2010 - 09:36:48

comment » Hey. I love your method, but just one question. When I have two comments with same name and same levels, but different parents. When i'm taking one parent with all its childs, in 2nd level it gets not right parent. «

COMMENT: HANS CASTORP email

MAY 2, 2010 - 05:14:11 PM

comment » I disagree that recursion "is inefficient and does not scale".

When using recursive CTEs (or Oracle's CONNECT BY) it is a very elegant, easy to use and efficient solution «

COMMENT: HANS CASTORP email

MAY 2, 2010 - 05:15:19 PM

comment » Sorry, I overlooked that this article is for MySQL only

In that case recursion is indeed not a good solution «

COMMENT: DAVID emailhomepagerating

MAY 16, 2010 - 13:40:36

comment » Thank you. A great solution. And very simple. I hit a snag, though. If you want to display latest first and use 'ORDER BY lineage DESC' it puts the replies above instead of below. Is there a simple way around this? «

COMMENT: DAVID emailhomepage

MAY 16, 2010 - 16:29:15

comment » RE: Latest First.

No worries, I solved this now.

I created a new field called 'thread'. A comment and all its replies share the same thread id which I increment when a new comment thread begins. Then all I had to do was use 'ORDER BY thread DESC, lineage'. This now displays a classic bulletin board hierarchy. Comments listed latest first with replies indented below each one.

Thank again for the lineage idea.

David. «

COMMENT: DAVID emailhomepagerating

MAY 18, 2010 - 11:20:17

comment » Another tip.

Ordering by lineage in this way can sometimes give unexpected results because with alphanumeric ordering 11 comes before 2. It's not disastrous but you can tidy this up by padding the lineage elements with zeros using the php function sprintf. In this way 00011 comes after 00002.

I now have a working example of the lineage solution described on this page:

www.thevac.co.uk/messages.php

Please note this is a live message board. «

COMMENT: MAYLEE

JUL 18, 2010 - 01:33:55 AM

comment » It was a great article, thanks. But is that faster way to show the number of replies to each comment? «

COMMENT: VANI rating

SEP 28, 2010 - 07:52:30 PM

comment » Thanks for the nice article.

Do you think it's a good idea to use for folders with permissions?

Basically, when a user logs in the system i need to get a list of folder available for that user and the permissions associated (like readonly folder, writable folder, duplicate folder).

I was also looking at hierarchical tagging system. But has doubts about it whether it's a good idea to include permissions with that.

What are your thoughts about that? «

COMMENT: RAJESH emailhomepage

OCT 6, 2010 - 10:13:19 PM

comment » This article is about implementing the same using Spring, JPA, Annotations and Aspects. «

COMMENT: RAJESH

OCT 6, 2010 - 10:13:54 PM

comment » http://rajeshkilango.blogspot.com/2010/10/manage-hierarchical-data-using-spring.html This article is about implementing the same using Spring, JPA, Annotations and Aspects. «

COMMENT: SUJOY PAL emailrating

NOV 12, 2010 - 09:23:59 AM

comment » Simply superb!!! though still I fully don't understant the hierarchial model to flat table mapping and retrieval, you descriptions are very clear and understandable. 28 «

COMMENT: PETER email

DEC 14, 2010 - 19:11:23

comment » Could you please bring me a explanation or example on how can I store records in php automatically? the table is populated already, it could be done manually, but how can I save data with php automatically... thanks in advance! «

COMMENT: JOHAN homepagerating

FEB 3, 2011 - 03:24:02 PM

comment » Very good article. I use the approach for a category hierarchy. I use the lineage suggestion with the zero padding.

It is explained very well how to get a record and all its children. I needed it the other way around: To select a record and all its parents.

It is a little bit tricky but you *can* use lineage for that direction too. I came up with a solution using a subquery. Note that this example uses zero padding (7 chars total). So my table data looks somewhat like this:

id: 0, parent: null, lineage: 0000001

id: 2, parent: 1 , lineage: 0000001-0000002

id: 3, parent: 1 , lineage: 0000001-0000003

id: 4, parent: 3 , lineage: 0000001-0000003-0000004

id: 5, parent: 3 , lineage: 0000001-0000003-0000005

id: 6, parent: 1 , lineage: 0000001-0000006

id: 7, parent: null, lineage: 0000007

id: 8, parent: null, lineage: 0000008

To get the tree of item with id 10 I use this query:

SELECT * FROM MyTree WHERE lineage LIKE (SELECT CONCAT(SUBSTRING(lineage,1,7),'%') AS root FROM MyTree WHERE id = 10) «

COMMENT: PIER LUIGI emailhomepagerating

JUN 24, 2011 - 10.11.55

comment » Only some improvements:

1) The deep column is not needed because you can compute the value using the lineage column (count the IDs and subtract 1).

2) I suggest to prepend and postpend the separator into the lineage column like "-1-2-3-11-" because this allow you to easily identify some useful info using the standard SQL LIKE operator:

- with LIKE "-(id)-%" you get the all messages of a main tread identified by (id)

- with LIKE "%-(id)-%" you get the all messages (child and ancestors) of a message identified by (id)

Bye. «

COMMENT: CHARLIE

JUL 8, 2011 - 01:26:51 PM

comment » If your id is an auto increment value, how do you insert the lineage value without having to do two inserts for every row? ie. the second query getting the insert_id «

COMMENT: DON emailhomepage

JUL 18, 2011 - 05:50:27 PM

comment » I just ran across this article and found it really helpful in creating my own hierarchy of financial categories, which shouldn't get any deeper than 3-4 levels. I'm using LO Base & PostgreSQL. PG has ltreelib, but I really don't think I need that for my little app. This is easier to comprehend and implement.

Many thanks! «

COMMENT: MARCEL homepagerating

AUG 5, 2011 - 13:23:25

comment » Great article. Certainly helping me out.

Just one remark: I'm always surprised how the   tags are often (ab)used for creating spacing.. since you already know the depth of the node, don't clutter up your html with countless nbsp's, just do it with css! Either inline (margin-left) or classes (depth0, -1, -2, etc).. «

COMMENT: JOHN emailhomepage

DEC 28, 2011 - 07:39:55 PM

comment » So...what do you do if you want to order the nodes by some other order than their ID...which is effectively how they're ordered if you order by lineage...

I'm using this structure in a table of 'categories', any of which can be nested under another for an infinite number of hierarchical levels. It works great...until the client wants a way to re-order the categories in the list.

How would you do that? The only thing I can think of is to give teh client a 'weight' or 'order by' field that contains an int. Then, generate a sort field in the get query that's something like CONCAT(`lineage`, ',' , `weight`) Problem is I really need to strip the last entry from lineage and replace it with weight/order...that way I'll get a sort on the lineage of all parent IDs + the weight for final children.

My familiarity with SQL string functions is not where it should be though...nor am I really sure this is the best way to do it.

How would you do it? «

COMMENT: THE MIKENESS email

MAR 13, 2012 - 16:58:24

comment » another way you can do it, which will require some simple re-sorting in code for the case of the parent items after receiving the SQL result back, is to add a column as follows, and then sort on it first:

(CASE WHEN parent_id IS NULL THEN comment_id ELSE parent_id END) as sortcol

then you'd sort by sortcol above all else, and then simply have to correct in code for the fact that you might have wanted to sort the parents by something other than their id. I think Oracle has some sort of functions that allow you to check values of previous rows and make decisions based on that, so it may be possible in some SQL systems, but I don't so far believe I can come up with a way to do it 100% properly in SQL, however, this solution is most likely far superior in every way to every other solution I've seen with the exception being that some minimal post-processing is necessary. In PHP, you'd just have to stick all rows in the resultset into an array that is indexed by your desired sort order column, and then instead of iterating this new array by order it was inserted into, iterate through it in order of your sort order column (or sort the array). «

COMMENT: VINS

MAY 12, 2012 - 04:22:06 AM

comment » Hi

Suppose if I will add one more column to the table called "sort_order" and I want the result in as per the sort_order. for example the resulting hierarchy will be in order by lineage and sort order as well. All the child of the parents will be sorted by the assigned sort order. Also if the sort order will be same then the they will be sort by their name (body).

What will be the mysql query for this. can anybody help me ?

comment_id body parent lineage deep sort_order

1 Jungle 1 0 1

2 doesn't! 1 1-2 1 2

3 it Works 1 1-3 1 2

4 Oh yes 2 1-2-3 2 1

5 economy 4 0 2

6 True 4 4-5 1 1 «

COMMENT: TAMÁS MÁRTON rating

SEP 30, 2012 - 01:27:40 DU.

comment » Thanks, it helped a lot! «

COMMENT: PRAMOD rating

NOV 2, 2012 - 11:33:21 AM

comment » good... «

COMMENT: COZ rating

NOV 3, 2012 - 03:09:54

comment » You don't need the lineage field at all.

Replace it with a category_id (the id of the parent article for the comment)

Then that's all you need, select everything with that category_id, order it in reverse by id (which is sequential)

When displaying if it's not a root comment then indent it.

That's all it is, simple as anything. A lot more efficient than your way. :) «

COMMENT: FERDY

NOV 4, 2012 - 12:27:15

comment » @Coz: I'm afraid you're wrong. The hierarchical entries aren't sequential at all. «

COMMENT: USMAN homepage

MAR 1, 2013 - 07:08:01 AM

comment » Good Article.

I am trying to create multi level menu from sometime. I get some idea from here. But I am little bit confused with lineage. «

COMMENT: NEWBIE emailrating

FEB 8, 2014 - 07:56:49 AM

comment » Great article, helped a lot in implementing my category solution.

However @PIER LUIGI's query doesn't work as expected as it returns parents and all siblings and only really works for a 2 level category hierarchy. «

COMMENT: SANTOSH KUMAR emailrating

JUN 10, 2014 - 12:52:46 PM

comment » It is a very good concept. But there is a problem. When we order the query based on lineage column then it gives unorder result.

For example, chapter-1 has lineage 1,chapter-2 has lineage 3, and chapter-3 has lineage 10.....then result order is chapter-1,then chapter-3,then chapter-2.

Please help me that how to get proper order.

thanks. «

COMMENT: FERDY

JUN 10, 2014 - 07:52:22 PM

comment » @Santosh: I'm assuming you're talking about 10 being textually sorted before 1 alphabetically? That issue is easily resolved by applying a fixed number of digits. Thus, "1" would become "01". «

COMMENT: SANTOSH KUMAR email

JUN 11, 2014 - 08:31:25 AM

comment » Could you please elaborate id(provide some code).

Thanks. «

COMMENT: SANTOSH KUMAR email

JUN 11, 2014 - 01:32:41 PM

comment » How to insert a new chapter(chapter-2) that i forgot to insert before capter-3. «

COMMENT: FERDY

JUN 16, 2014 - 07:44:02 PM

comment » @Santosh: I'm talking about the simple solution of adding 000s in front of sort-sensitive values. This should be trivial for you to make in code. «

COMMENT: RASMUS SCHULTZ emailhomepage

JUL 15, 2014 - 07:28:16 PM

comment » Interesting approach, I didn't know about that one!

There is one more approach, in addition to the ones mentioned, in which you use a separate table to create a materialized view of parent/child relationships.

So the tree table itself is modeled using the basic id and parent_id approach. And you maintain a separate index-table from that one, which has id, parent_id and a depth column - you can then join across this table and very easily write queries to a single depth, to a maximum depth, to find all parents, etc.

This approach is somewhat expensive for writing, since you need to create one index record for every parent in a new node's path - but reads are very fast, much faster than testing for substrings. Because your index-table has three integer columns, it's going to have a fixed width and very minimal storage overhead, which makes for very fast reads and writes to this table. A unique index can be set up for the id and parent_id columns to speed up reads (and of course slow down writes) even more.

I have implemented this once many years ago with MySQL for a site that had extremely frequent reads from a very, very large category tree with very few writes, and it was laughably fast. «

COMMENT: ASHOK email

AUG 7, 2014 - 12:31:33 PM

comment » how about copying childs and inserting on other nodes? «

RATE THIS CONTENT (OPTIONAL)
Was this document useful to you?
 
rating Awesome
rating Good
rating Average
rating Poor
rating Useless
CREATE A NEW COMMENT
required field
required field HTML is not allowed. Hyperlinks will automatically be converted.