200亿行/月-Hbase / Hive / Greenplum / 什么?[英] 20 Billion Rows/Month - Hbase / Hive / Greenplum / What?

本文是小编为大家收集整理的关于200亿行/月-Hbase / Hive / Greenplum / 什么?的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我想利用您的智慧为数据软件系统拾取正确的解决方案. 这里有一些细节可以更好地理解问题:

数据以一个大事实和〜15个维度的星模结构中组织.
每月20b事实行
有一百行(有点层次结构)的10个维度
5个尺寸,数千行
〜200k行的2个维度
2个大尺寸,有50m-100m行

两个典型的查询与此DB

运行

DIMQ中的顶级成员:

select    top X dimq, count(id) 
from      fact 
where     dim1 = x and dim2 = y and dim3 = z 
group by  dimq 
order by  count(id) desc

针对元组的措施:

select    count(distinct dis1), count (distinct dis2), count(dim1), count(dim2),...
from      fact 
where     dim1 = x and dim2 = y and dim3 = z 

问题:

  1. 执行此类查询的最佳平台是什么
  2. 需要什么样的硬件
  3. 在哪里可以托管(EC2?)


    (目前请忽略进口和加载问题)

tnx,
哈盖.

推荐答案

我不能足够的压力:获得了与现成的报告工具一起播放的东西.

每月200亿行使您进入VLDB领域,因此您需要分区.低基数维度也将表明位图索引将是一个胜利.

  • 忘记云系统( hive hbase ),直到它们具有成熟的SQL支持. 对于数据仓库 您想要的应用程序 与常规合作 报告工具.否则,你 会永远发现自己 陷入写作和维护 临时报告程序.

  • 数据量可以管理 像Oracle这样的更传统的DBM-我知道主要欧洲电信公司每天加载600GB 进入 oracle 数据库.所有其他 事物是平等的,这是两个顺序 大小比您的数据量大, 因此,共享磁盘架构 给你的净空.一个 共享 - nothing netezza 可能是 更快的速度,但这些卷是 不在一个超出一个的水平 常规的共享盘系统. 但是请记住,这些系统都是 非常昂贵.

  • 也请记住,MapReduce是 oracle sybase iq ,甚至还将处理所涉及的数据量,但会较慢 - 它们是共享磁盘架构,但仍然可以管理此类数据量.请参阅/a>供Oracle和SQL Server中的VLDB相关功能进行分解,请记住Oracle刚刚引入了 exadata存储平台也.

    我的fag-afag包装容量计划建议每月3-5 TB左右,包括Oracle或SQL Server的索引.在使用位图索引的Oracle上可能较少,尽管索引叶在Oracle上具有16字节ROWID,而SQL Server上的6个字节页面参考.

    Sybase IQ广泛使用位图索引,并针对数据仓库查询进行了优化.尽管是共享磁盘架构,但对于此类型的查询非常有效(IIRC是原始的面向列的架构).这可能比Oracle或SQL Server更好,因为它专门用于此类工作.

    greenplum可能是一个便宜的选择,但我从未真正使用过它,因此我无法评论它在实践中的工作状况.

    如果您有10个维度,只有几百行考虑将它们合并到单个上垃圾尺寸将通过将十个钥匙合并为一个,这将使您的事实表缩小.您仍然可以在垃圾尺寸上实现层次结构,这将使您的事实表的大小不超过1/2或更多,并消除了索引的大量磁盘使用.

    我强烈建议您使用合理的报告工具的横截面效果很好. 这意味着SQL前端.商业系统,例如 crystal报告允许报告和分析由具有更容易获得的SQL技能的人进行.开源世界还产生了 birt pentaho.. Hive或HBase为您提供了建立自定义前端的业务,除非您乐于在Python上撰写自定义报告格式化,否则您真的不想.

    最后,将其托管在某个地方,您可以轻松地从生产系统中获取快速数据提要.这可能意味着您自己的数据中心的硬件.该系统将是I/O绑定;它正在对大量数据进行简单处理.这意味着您将需要具有快速磁盘子系统的机器.云提供商往往不支持这种类型的硬件,因为它比这些服装传统上使用的可支配1U盒的类型要贵.快速磁盘I/O不是云体系结构的强度.

    其他推荐答案

    我在 vertica 方面取得了巨大的成功.我目前每天都在加载2亿至10亿行(平均每月约9亿美元),尽管我一个月的时间高达170亿.我的维度接近21个维度,并且查询快速运行.当我们根本没有时间窗口来执行数据加载时,我们从较旧的系统继续前进.

    我们对不同的解决方案进行了非常详尽的试验和研究 - 实际上研究了市场上的所有内容.尽管Teradata和Netezza都适合我们,但它们对我们来说太昂贵了. Vertica以价格/性能比率击败了他们.顺便说一句,柱状数据库.

    我们现在有大约80个用户 - 预计到明年年底,当我们开始完全推出时,它将增长到900个用户.

    我们广泛使用asp.net/dundas/reporting服务进行报告.它也与第三方报告解决方案一起表现出色 - 尽管我们没有尝试过.

    顺便说一句,您将用于数据加载什么?我们正在使用 Informatica ,并且对此感到非常满意. SSIS将我们推到墙上.

    其他推荐答案

    使用HBase和Jasperser HBase报告插入,可以创建不错的报告.低延迟OLAP可以在HBase中创建.这将与SQL相同. Jasperser HBase插件提供HBase查询语言,该语言是扩展HBase扫描命令.

    本文地址:https://www.itbaoku.cn/post/597385.html

问题描述

I'd like to use your wisdom for picking up the right solution for a data-warehouse system. Here are some details to better understand the problem:

Data is organized in a star schema structure with one BIG fact and ~15 dimensions.
20B fact rows per month
10 dimensions with hundred rows (somewhat hierarchy)
5 dimensions with thousands rows
2 dimensions with ~200K rows
2 big dimensions with 50M-100M rows

Two typical queries run against this DB

Top members in dimq:

select    top X dimq, count(id) 
from      fact 
where     dim1 = x and dim2 = y and dim3 = z 
group by  dimq 
order by  count(id) desc

Measures against a tuple:

select    count(distinct dis1), count (distinct dis2), count(dim1), count(dim2),...
from      fact 
where     dim1 = x and dim2 = y and dim3 = z 

Questions:

  1. What is the best platform to perform such queries
  2. What kind of hardware needed
  3. Where can it be hosted (EC2?)


    (please ignore importing and loading issues at the moment)

Tnx,
Haggai.

推荐答案

I cannot stress this enough: Get something that plays nicely with off-the-shelf reporting tools.

20 Billion rows per month puts you in VLDB territory, so you need partitioning. The low cardinality dimensions would also suggest that bitmap indexes would be a performance win.

  • Forget the cloud systems (Hive, Hbase) until they have mature SQL support. For a data warehouse application you want something that works with conventional reporting tools. Otherwise, you will find yourself perpetually bogged down writing and maintaining ad-hoc report programs.

  • The data volumes are manageable with a more conventional DBMS like Oracle - I know of a major European telco that loads 600GB per day into an Oracle database. All other things being equal, that's two orders of magnitude bigger than your data volumes, so shared disk architectures still have headroom for you. A shared-nothing architecture like Netezza or Teradata will probably be faster still but these volumes are not at a level that is beyond a conventional shared-disk system. Bear in mind, though, that these systems are all quite expensive.

  • Also bear in mind that MapReduce is not an efficient query selection algorithm. It is fundamentally a mechanism for distributing brute-force computations. Greenplum does have a MapReduce back-end, but a purpose-built shared nothing engine will be a lot more efficient and get more work done for less hardware.

My take on this is that Teradata or Netezza would probably be the ideal tool for the job but definitely the most expensive. Oracle, Sybase IQ or even SQL Server would also handle the data volumes involved but will be slower - they are shared disk architectures but can still manage this sort of data volume. See This posting for a rundown on VLDB related features in Oracle and SQL Server, and bear in mind that Oracle has just introduced the Exadata storage platform also.

My back-of-a-fag-packet capacity plan suggests maybe 3-5 TB or so per month including indexes for Oracle or SQL Server. Probably less on Oracle with bitmap indexes, although an index leaf has a 16-byte ROWID on oracle vs. a 6 byte page reference on SQL Server.

Sybase IQ makes extensive use of bitmap indexes and is optimized for data warehouse queries. Although a shared-disk architecture, it is very efficient for this type of query (IIRC it was the original column-oriented architecture). This would probably be better than Oracle or SQL Server as it is specialized for this type of work.

Greenplum might be a cheaper option but I've never actually used it so I can't comment on how well it works in practice.

If you have 10 dimensions with just a few hundred rows consider merging them into a single junk dimension which will slim down your fact table by merging the ten keys into just one. You can still implement hierarchies on a junk dimension and this would knock 1/2 or more off the size of your fact table and eliminate a lot of disk usage by indexes.

I strongly recommend that you go with something that plays nicely with a reasonable cross-section of reporting tools. This means a SQL front end. Commercial systems like Crystal Reports allow reporting and analytics to be done by people with a more readily obtainable set of SQL skills. The open-source world has also generated BIRT, Jasper Reports and Pentaho.. Hive or HBase put you in the business of building a custom front-end, which you really don't want unless you're happy to spend the next 5 years writing custom report formatters in Python.

Finally, host it somewhere you can easily get a fast data feed from your production systems. This probably means your own hardware in your own data centre. This system will be I/O bound; it's doing simple processing on large volumes of data. This means you will need machines with fast disk subsystems. Cloud providers tend not to support this type of hardware as it's an order of magnitude more expensive than the type of disposable 1U box traditionally used by these outfits. Fast Disk I/O is not a strength of cloud architectures.

其他推荐答案

I have had great success with vertica. I am currently loading anywhere between 200 million to 1 billion rows in a day - averaging about 9 billons row a month - though I have gone as high as 17 billion in a month. I have close to 21 dimensions and the queries run blazingly fast. We moved on from the older system when we simply didn't have the windows of time to do the dataload.

we did a very exhaustive trial and study of different solutions - and practically looked at everything on the market. While both Teradata and Netezza would have suited us, they were simply too expensive for us. Vertica beat them both on the price/performance ratio. It is by the way a columnar database.

We have about 80 users now - and it is expected to grow to about 900 by the end of next year when we start rolling out completely.

We are extensively using ASP.NET/dundas/reporting services for reports. It also plays nice with third party reporting solutions - though we haven't tried it.

By the way what are you going to use for dataload ? We are using informatica and have been very pleased with it. SSIS drove us up the wall.

其他推荐答案

Using HBase and jasperserver hbase reporting pluging, decent reports can be created. Low latency OLAP can be created in HBase. This will work the same as the SQL. Jasperserver HBase plugin provides Hbase query language which is an extension Hbase scan command.