数据库如何在B-Tree/B+Tree中内部存储数据[英] How Database stores data internally in B-Tree/B+Tree





  1. 身份证
  2. 姓名
  3. 年龄
  4. 重量
  5. 经理

我们查询select * from Table1 where age>50 and weight<100




您选择的示例是单个 Tree 无法完成工作的少数情况之一(两个独立的范围).

然而,我正在开发的电子书的第一章解释了 B-Tree 索引的内部工作原理:http://use-the-index-luke.com/anatomy/



  1. AGE 和 WEIGHT 上的串联索引(按此顺序).
    以防万一,查询将读取所有记录 WHERE AGE > 50,然后按 WEIGHT 过滤.

  2. WEIGHT 和 AGE 上的串联索引(另一个顺序).
    这是不同的方式:读取所有记录 WHERE WEIGHT < 100 然后按 AGE 过滤.

哪个更有效取决于您拥有的数据.如果记录 AGE > 50 比 WEIGHT < 100 少,第一个将更有效,否则第二个.但是,如果您使用不同的值进行查询,则图片可能会发生变化.


具有两个独立范围查询的查询需要两个轴,不像链,但更像棋盘.AGE 一个轴, WEIGHT 另一个轴.如果可以的话,您的查询将只需要扫描棋盘的一个角.

但是,b 树只有一个轴,因此您必须先选择要使用的标准.如果选择AGE,则表示从AGE 50开始,整个链都会被扫描到最后.只有部分存储在链端的记录也符合WEIGHT < 100的条件,其他记录必须读取但将被丢弃.


WHERE age = 50 AND weight < 100
WHERE weight = 100 AND age > 50
WHERE age > 50 AND age < 70;



第三种可能的方法是在两列上有两个独立的索引.这允许您拥有任意数量的轴(只需创建更多索引).但是,这存在一些巨大问题.首先,并非所有数据库产品都支持这一点.只要支持它,它就是一个相当广泛的操作.它通常以扫描每个索引的方式工作,为每个结果构建一个位图索引.然后连接这些位图索引以应用 AND 运算符.这需要大量的数据处理——只有当每个条件对它自己的选择性不是很严格时才值得付出努力,但两者加起来都非常有选择性.


如果您的查询在 OLTP 环境中运行:使用一个连接索引.两个独立的索引只是最后的选择.但是,如果您在 OLAP 环境中工作,您可能仍然需要位图索引.

ps:索引 AGE 是一个 exercise 在我的书中(有解决方案)——特别是因为存储 AGE 是一种不好的做法,你应该存储出生日期.



My question is that How database stores data and how it performs query internally.

Suppose we have following fields in our table:

  1. ID
  2. Name
  3. Age
  4. Weight
  5. Manager

and we query select * from Table1 where age>50 and weight<100

I am just curious that how it perform query internally.

What will the Node of B-Tre/B+Tree contains in this example?


The example you have chosen is one of the few cases where a single Tree can't do the job (two independent ranges).

However, the first chapter of my work-in-progress e-Book explains the inner workings of B-Tree indexes: http://use-the-index-luke.com/anatomy/

EDIT for more details why two indexes might be useful for the above example.

The above query can be supported by three possible index configurations:

  1. concatenated index on AGE and then WEIGHT (in this order).
    In case, the query would read all records WHERE AGE > 50 and then filter by WEIGHT.

  2. concatenated index on WEIGHT and then AGE (the other order).
    That goes the different way: read all records WHERE WEIGHT < 100 and then filter by AGE.

Whatever is more efficient depends on the data you have. If there are less records AGE > 50 than WEIGHT < 100 the first will be more efficient, otherwise the second. However, if you query with different values, the picture might change.

The reason that a concatenated index can't support the query well is that each index order is on one axis only. each index entry is before or after another one, but never next to it. All index entries build one chain.

A query that has two independent range queries would require two axes, not like a chain, but more like a chess board. one axis for AGE the other for WEIGHT. If that would be possible, your query would need to scan only one corner of the chess board.

However, a b-tree has only one axis, hence you must chose which criteria to use first. If you chose AGE it means that starting with AGE 50, the entire chain will be scanned until the end. Only some of the records stored at the side of the chain will also qualify for WEIGHT < 100, the other records must be read but will be discarded.

So, a long story to explain why one tree can not support a query with two independent range clauses. On the other hand, one concatenated index can do the following quite well:

WHERE age = 50 AND weight < 100
WHERE weight = 100 AND age > 50
WHERE age > 50 AND age < 70;

However, the problem arises if there are two inequality operators are used on two different columns.

So, what to do?

The third possible approach is to have two independent indexes on the two columns. That allows to have as many axes as you like (just create more indexes). However, there are a few huge problems with that. First of all, not all DB products support that. Whenever it is supported, it is a rather expansive operation. It works typically that way that each index is scanned, a bitmap index is built for each result. Those bitmap indexes are then joined to apply the AND operator. That's a lot of data munging--it is only worth the effort if each condition is not very selective for it's own, but both together are very selective.

Wan't my advice?

If your query runs in an OLTP environment: use one concatenated index. Two independent indexes are an option of last resort only. However, if you are working in an OLAP environment, you might anyways need bitmap indexes.

ps.: Indexing AGE was an exercise in my book (with solution)--especially because storing AGE is a bad practice, you should store the date of birth instead.