用于压缩连续正整数的C语言库[英] C Library for compressing sequential positive integers

本文是小编为大家收集整理的关于用于压缩连续正整数的C语言库的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我有一个非常普遍的问题,即为盘中阵列创建索引.简而言之,我需要将每个字符串的位置存储在盘中表示中.例如,一个非常幼稚的解决方案将是一个索引阵列,如下所示:

uint64 idx [] = {0,20,500,1024,...,103434};

说第一个字符串位于位置0,第二个字符串位于位置20,第三个处位于位置500和位置为103434.

.

这些位置始终是非阴性的64位整数.尽管数字可能因任何差异而有所不同,但实际上,我希望典型的差异在2^8到2^20范围内.我希望该索引在内存中被mmap'ed,并且将随机访问位置(假设统一分布).

我正在考虑编写自己的代码来进行某种块Delta编码或其他更复杂的编码,但是编码/解码速度和空间之间有很多不同的权衡,我宁愿将工作库作为工作库一个起点,甚至可以解决任何自定义的事情.

有提示吗? C库将是理想的,但是C ++也可以让我运行一些初始基准.

如果您仍在关注,请提供更多细节.这将用于构建类似于CDB( http://cr.yp.yp.to/cdto/cdtb)/cdbmake.html )在图书馆cmph上( http://cmph.sf.net ) .简而言之,它是一个基于大磁盘的大型读取的关联图,内存中的索引很小.

由于它是一个库,所以我无法控制输入,但是我要优化的典型用例具有数百万个值,平均值大小为几千字节范围和2^31时的最大值.

在记录中,如果我没有找到准备使用的库,我打算在64个整数中实现Delta编码,并在迄今为止指定块偏移量的初始字节.块本身将用一棵树索引,给我O(log(n/64))访问时间.还有太多其他选择,我希望不讨论它们.我真的很期待准备使用代码,而不是关于如何实施编码的想法.我将很高兴与所有人分享我的工作,一旦我的工作工作.

感谢您的帮助,如果您有任何疑问,请告诉我.

推荐答案

我使用 fastbit (kesheng wu lbl.gov),似乎您似乎需要一些东西良好,快速且现在,因此,Fastbit是Oracle的BBC(字节对准位图代码,伯克利德)的高度有效的改进.它很容易设置和非常好的gern.

但是,给出更多时间,您可能需要查看灰色代码解决方案,这似乎是您的目的.

丹尼尔·莱米尔(Daniel Lemire)在 code.google ,我已经阅读了他的一些论文,它们非常好,在Fastbit和使用已排列的灰色代码重新排序的替代方法方面取得了一些进步.

几乎忘记了,我也遇到了 Tokyo内阁,尽管我认为它不适合适合我目前的项目,如果我以前知道它,我可能会考虑更多;),它具有很大的互操作性,

东京内阁写在C中 语言,并作为C的API提供 Perl,Ruby,Java和Lua.东京 机柜可在平台上使用 API符合C99和 posix.

当您提到CDB时,TC基准具有TC模式(TC支持的几个操作约束,用于不同的perf),在该模式下,它超过了CDB的10次,用于阅读性能,而写入2次.

关于您的三角洲编码要求,我对 bsdiff 和它的能力,并且有能力离开 - 执行任何file.exe内容修补系统,它也可能具有一些用于您的一般需求的资金界面.

Google的新二进制压缩应用程序, courgette 可能值得检查出来,如果您错过了新闻稿,则在我见过的一个测试案例中比BSDIFF小10倍.

其他推荐答案

您到底要压缩什么?如果您正在考虑索引的总空间,那么保存空间真的值得吗?

如果这样,您可以尝试的一件事就是将空间切成两半,然后将其存储在两个桌子中.第一存储(上UINT,启动索引,长度,指向第二个表的指针),第二个将存储(索引,下UINT).

为了进行快速搜索,将使用 b+ tree ./p>

其他推荐答案

您有两个冲突的要求:

  1. 您想压缩非常小的项目(每个字节8个).
  2. 您需要为每个项目有效的随机访问.

第二个要求很可能为每个项目施加固定的长度.

本文地址:https://www.itbaoku.cn/post/597725.html

问题描述

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:

uint64 idx[] = { 0, 20, 500, 1024, ..., 103434 };

Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.

The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap'ed in memory, and the positions will be accessed randomly (assume uniform distribution).

I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.

Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.

A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.

Since it is a library, I don't have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.

For the record, if I don't find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.

I appreciate your help and let me know if you have any doubts.

推荐答案

I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.

However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.

Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.

Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,

Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX.

As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.

With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.

Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.

其他推荐答案

What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space?

If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint).

For fast searching, indices would be implemented using something like B+ Tree.

其他推荐答案

You have two conflicting requirements:

  1. You want to compress very small items (8 bytes each).
  2. You need efficient random access for each item.

The second requirement is very likely to impose a fixed length for each item.