关于事件时间序列的数据库建议[英] Database suggestions for time series of events

本文是小编为大家收集整理的关于关于事件时间序列的数据库建议的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

对于我的一个项目,我必须在数据库中输入大量事件,以进行以后处理,我试图决定哪种DBM最适合我的目的.

我有:

  • 目前约400万离散事件

  • 将存储在DB

  • 中的大约600 GB数据

这些事件以多种格式出现,但我估计单个属性的计数约为5,000.大多数事件仅包含每个大约100个属性的值.属性值应视为任意字符串,在某些情况下是整数.

这些事件最终将合并为单个时间序列.尽管它们确实有一些内部结构,但没有提及其他事件,我相信这意味着我不需要对象DB或某些ORM系统.

我的要求:

  • 开源许可证 - 我可能必须对其进行一些调整.

  • 可扩展性通过能够扩展到多个服务器,尽管首先只使用一个系统.

  • 快速查询 - 更新并不是那么关键.

  • C/C ++,Java和Python的成熟驱动程序/绑定.最好是与他人一起发挥出色的许可证 - 由于技术决定,我宁愿不承诺任何事情.我认为大多数DB驱动程序在这里没有问题,但无论如何都应该提及.

  • Linux的可用性.

  • 这是不错的,但不是必需的,如果也可用于Windows

我的理想DB将使我可以通过单个查询从指定时间段中检索所有事件.

到目前为止我发现/考虑的内容:

  • Postgresql 随着页面大小的增加,显然可以在每个表中具有多达6,000列的列.如果我对属性计数的估计没有关闭,则可能会.

  • mysql 似乎每张表的限制为4,000列. i 可以使用多个与SQL-FU的表格,但我不想使用.

  • mongodb 是我目前倾向于的.这将使我能够保留事件的内部结构,同时仍然能够查询它们.它的API似乎也很简单.我不知道它在性能方面的表现如何 - 至少在一台服务器上.

  • opentsdb 和它的公制收集框架听起来很有趣.我可以为每个时间序列使用一个时间序列属性(可能有助于我的某些处理),将属性值作为标签,并将条目标记为将其与特定事件相关联.从管理员和应用程序程序员的角度来看,上面的三个可能具有更陡峭的准备曲线.对其性能不知道.

  • 使用 hbase 直接.这可能比 opentsdb 仍然高于前三个选项.

可能还有其他数据库可以做到这一点,因此请随时让我知道 - 我很高兴任何可能对我有帮助的建议或评论.

ps:我只作为数据库管理员的经验很少,所以我对任何误解表示歉意.

推荐答案

使用具有数千列的表是疯狂的.尤其是当您所说的大多数人为零时.

您应该首先考虑从中转换数据结构:

table_1
-------
event_id
attribute_1
attribute_2
[...]
attribute_5000

这样的东西:

table_1          event_values             attributes
--------         ------------             ----------
event_id         event_id                 attribute_id
                 attribute_id             attribute_type
                 attribute_value

可以与任何RDM一起使用(您唯一的约束将是总数据库大小和性能)

其他推荐答案

答案可能很晚,但这是我要做的.

我使用HDF5作为我的时间序列存储库.它具有许多有效且快速的压缩样式,可以混合和匹配.它可以与多种不同的编程语言一起使用.它在Windows和Linux上可用.

我在时间戳字段中使用boost :: date_time.这允许大量基于DateTime的计算.

在金融领域,我为每个酒吧,tick,交易,报价,...

创建特定的数据结构

我创建了许多自定义迭代器,并使用了标准模板库算法来有效地搜索特定值或基于时间的记录的范围.然后可以将选择加载到内存中.

本文地址:https://www.itbaoku.cn/post/597561.html

问题描述

For one of my projects, I have to enter a big-ish collection of events into a database for later processing and I am trying to decide which DBMS would be best for my purpose.

I have:

  • About 400,000,000 discrete events at the moment

  • About 600 GB of data that will be stored in the DB

These events come in a variety of formats, but I estimate the count of individual attributes to be about 5,000. Most events only contain values for about 100 attributes each. The attribute values are to be treated as arbitrary strings and, in some cases, integers.

The events will eventually be consolidated into a single time series. While they do have some internal structure, there are no references to other events, which - I believe - means that I don't need an object DB or some ORM system.

My requirements:

  • Open source license - I may have to tweak it a bit.

  • Scalability by being able to expand to multiple servers, although only one system will be used at first.

  • Fast queries - updates are not that critical.

  • Mature drivers/bindings for C/C++, Java and Python. Preferrably with a license that plays well with others - I'd rather not commit myself to anything because of a technical decision. I think that most DB drivers do not have a problem here, but it should be mentioned, anyway.

  • Availability for Linux.

  • It would be nice, but not necessary, if it was also available for Windows

My ideal DB for this would allow me to retrieve all the events from a specified time period with a single query.

What I have found/considered so far:

  • Postgresql with an increased page size can apparently have up to 6,000 columns in each table. If my estimate of the attribute count is not off, it might do.

  • MySQL seems to have a limit of 4,000 columns per table. I could use multiple tables with a bit of SQL-fu, but I'd rather not.

  • MongoDB is what I am currently leaning towards. It would allow me to preserve the internal structure of the events, while still being able to query them. Its API also seems quite straight-forward. I have no idea how well it does performance-wise though - at least on a single server.

  • OpenTSDB and its metric collection framework sounds interesting.I could use a single time series for each attribute (which might help with some of my processing), have the attribute value as a tag and additionally tag the entries to associate them to a specific event. It probably has a steeper preparation curve that the three above, both from an administrator and an application programmer point of view. No idea about its performance.

  • Use HBase directly. This might fit my requirements better than OpenTSDB, although - judging from my past experience with hadoop - the administration overhead is probably still higher than the first three options.

There are probably other databases that could do it, so feel free to let me know - I would appreciate any suggestion or comment that might help me with this.

PS: I only have minimal experience as a DB administrator, so I apologise for any misconceptions.

推荐答案

Using tables with thousands of columns is madness. Especially when most of them are zero as you said.

You should first look into converting your data-structure from this:

table_1
-------
event_id
attribute_1
attribute_2
[...]
attribute_5000

into something like this:

table_1          event_values             attributes
--------         ------------             ----------
event_id         event_id                 attribute_id
                 attribute_id             attribute_type
                 attribute_value

which can be used with any RDMS (your only constraint then would be the total database size and performance)

其他推荐答案

It is probably very late for an answer, but here is what I do.

I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages. It is available on Windows as well as Linux.

I use boost::date_time for the timestamp field. This allows a large variety of datetime based computations.

In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...

I created a number of custom iterators and used standard template library algorithms to be able to efficiently search for specific values or ranges of time-based records. The selections can then be loaded into memory.