是在用数据填充表之前创建索引,还是在数据到位之后创建索引?[英] Is it better to create an index before filling a table with data, or after the data is in place?

本文是小编为大家收集整理的关于是在用数据填充表之前创建索引,还是在数据到位之后创建索引?的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我有一个约100m行的表,我将要复制以更改,并添加索引.我不太关心创建新表所需的时间,但是如果我在插入任何数据或首先插入数据之前更改表,然后添加索引?

,创建的索引会更有效.

推荐答案

在数据插入之后创建索引是更有效的方法(甚至经常推荐在批处理导入之前删除索引,并在导入重新创建它之后).

syntetic示例(Postgresql 9.1,慢速开发机器,一百万行):

CREATE TABLE test1(id serial, x integer);
INSERT INTO test1(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 7816.561 ms
CREATE INDEX test1_x ON test1 (x);
-- Time: 4183.614 ms

插入然后创建索引 - 大约12秒

CREATE TABLE test2(id serial, x integer);
CREATE INDEX test2_x ON test2 (x);
-- Time: 2.315 ms
INSERT INTO test2(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 25399.460 ms

创建索引然后插入 - 大约25.5秒(慢了两倍)

其他推荐答案

添加行后可能更好地创建索引.它不仅会更快,而且树的平衡可能会更好.

编辑"平衡"可能不是这里术语的最佳选择.在B树的情况下,它是按定义平衡的.但这并不意味着B树具有最佳布局.父母内部的子节点分布可能不平衡(在将来的更新中导致更高的成本),如果在更新过程中未仔细执行平衡,则树深可能会比必要的更深.如果添加行后创建索引,则更有可能具有更好的分布.此外,磁盘上的索引页在构建索引后的分裂可能较小. 在这里提供更多信息

其他推荐答案

这在此问题上无关紧要,因为:

  1. 如果首先将数据添加到表,并且在添加索引之后.您的索引生成时间将为O(n*log(N))更长(其中n是添加的行).因为树格的时间为O(N*log(N)),那么如果将其分为旧数据和新数据,则可以简单地将其转换为O(X*log(N) + n*log(N)),并且以这种格式您可以简单地查看您将等待的内容.
  2. 如果您添加索引并将其放置在数据之后.每行(您都有n新行),在将新元素添加到它之后,您会得到更长的插入额外的时间O(log(N))再插入树的结构(新行索引列,因为已经存在索引,并且添加了新行索引必须重新生成到平衡结构中,此成本O(log(P))其中P是索引功率 [索引中的元素] ).您有n新行,然后终于有n * O(log(N))然后O(n*log(N))摘要额外时间.

本文地址:https://www.itbaoku.cn/post/597631.html

问题描述

I have a table of about 100M rows that I am going to copy to alter, adding an index. I'm not so concerned with the time it takes to create the new table, but will the created index be more efficient if I alter the table before inserting any data or insert the data first and then add the index?

推荐答案

Creating index after data insert is more efficient way (it even often recomended to drop index before batch import and after import recreate it).

Syntetic example (PostgreSQL 9.1, slow development machine, one million rows):

CREATE TABLE test1(id serial, x integer);
INSERT INTO test1(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 7816.561 ms
CREATE INDEX test1_x ON test1 (x);
-- Time: 4183.614 ms

Insert and then create index - about 12 sec

CREATE TABLE test2(id serial, x integer);
CREATE INDEX test2_x ON test2 (x);
-- Time: 2.315 ms
INSERT INTO test2(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 25399.460 ms

Create index and then insert - about 25.5 sec (more than two times slower)

其他推荐答案

It is probably better to create the index after the rows are added. Not only will it be faster, but the tree balancing will probably be better.

Edit "balancing" probably is not the best choice of terms here. In the case of a b-tree, it is balanced by definition. But that does not mean that the b-tree has the optimal layout. Child node distribution within parents can be uneven (leading to more cost in future updates) and the tree depth can end up being deeper than necessary if the balancing is not performed carefully during updates. If the index is created after the rows are added, it is will more likely have a better distribution. In addition, index pages on disk may have less fragmentation after the index is built. A bit more information here

其他推荐答案

This doesn't matter on this problem because:

  1. If you add data first to the table and after it you add index. Your index generating time will be O(n*log(N)) longer (where n is a rows added). Because tree gerating time is O(N*log(N)) then if you split this into old data and new data you get O((X+n)*log(N)) this can be simply converted to O(X*log(N) + n*log(N)) and in this format you can simply see what you will wait additional.
  2. If you add index and after it put data. Every row (you have n new rows) you get longer insert additional time O(log(N)) needed to regenerate structure of the tree after adding new element into it (index column from new row, because index already exists and new row was added then index must be regenerated to balanced structure, this cost O(log(P)) where P is a index power [elements in index]). You have n new rows then finally you have n * O(log(N)) then O(n*log(N)) summary additional time.