从关系型OLTP数据库到OLAP立方体的最佳方法是什么?[英] What is the best approach to get from relational OLTP database to OLAP cube?

本文是小编为大家收集整理的关于从关系型OLTP数据库到OLAP立方体的最佳方法是什么?的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我有一个相当标准的OLTP标准化数据库,我意识到我需要在数据中的不同维度上进行一些复杂的查询,平均值,标准偏差.

所以我转向SSA和OLAP立方体的创建.

为了创建立方体,我认为我的数据源结构必须采用"星"或"雪花"配置(我认为现在不是现在).

是使用SSI在我的主要OLTP DB上执行某种ETL过程的正常步骤,该过程中的另一个关系db,在适当的"恒星"配置中具有事实和尺寸,然后将此DB用作数据源的数据源OLAP立方体?

谢谢

推荐答案

是的,这是基本思想.您将高度标准化的OLTP数据库并将其分配到立方体中,以进行切片和划分数据,然后在其上介绍报告.逻辑设计技术称为尺寸建模.关于 dimensional Modeling kimball group .拉尔夫·金博尔(Ralph Kimball)的 >也很棒.如果您想了解更多有关BI工具本身的信息,请查看虚拟实验室关于SSIS,分析服务等等.

其他推荐答案

答案是:是,但是.

SSA中的一个维度具有可以使用一系列字段来过滤切片之间的属性之间的关系.这些关系可以是层次结构的(一个以上的深度 - 一个属性可以具有父母和孩子.您还可以建立像属性但具有指导性钻探的钻探路径(称为ssas中的层次结构).

.

为了做到这一点,您需要在数据库中有可用的密钥,这些密钥生活在严格的层次结构关系中(即,钥匙不能在孩子可以拥有多个父母的情况下具有模糊关系).请注意,这不是整个故事,但目前与现实已经足够接近.

这些层次结构可以由系统从平坦的数据结构中构造,也可以通过雪花呈现,并在基础数据源视图中标记的关系(DSV是立方体元数据的一部分),可用于以某种方式按摩数据类似于数据库视图).

雪花是一个3NF -ish架构(严格不必是3NF-您可以在实践中弄平其中的部分),只有1个:M的关系. SSA可以支持其他一些维度结构,例如亲子(亲子与递归自加入)和m:m尺寸(m:m关系 - 正是他们听起来的样子).这种类型的尺寸更加烦恼,但对您可能很有用.

如果您的源数据中有钥匙,可以在雪花上具有等效的数据语义用于立方体尺寸的格式(我实际上是在几次场合进行的).大量使用合成键的模式更有可能为此工作.

如果您的供应商或其他方不允许您在源数据库中添加视图,则可以使用数据源视图. DSV可以具有从数据库查询中填充的称为"命名查询"的虚拟表.

事实表连接到尺寸.在SSAS2005+中,您可以在尺寸内的不同谷物上加入不同的事实表.我通常不会在数据仓库中对此有太多用处,但是如果您尝试使用源数据而无需过分按摩,则此功能可能会很有用.

如果这不起作用,那么您可能必须编写一个ETL过程才能填充星形或雪花架构.

一些条件:

  1. 可以使立方体以实时模式运行,他们只是向基础数据发出查询.这有一些针对您的源数据创建效率低下的查询的风险,因此,除非您真的有信心知道自己在做什么.

  2. (i)的apropos,您可能无法将这些立方体用作应用程序中屏幕的数据源.如果您需要计算用户想要在屏幕上看到的内容的平均值,则可能必须在屏幕后面的存储过程中对其进行计算.

  3. 如果执行此操作,请设置一个复制的数据库,然后填充该立方体.将此数据库定期刷新,因此您的ETL过程可以从内部一致的数据集运行.如果您从实时数据库运行,则可能会在以后填充的某些项目可能取决于运行相应过程后创建的记录.

    您可以承担运行维度负载的情况,然后将新数据输入到系统中.当事实表负载运行时,它现在包含取决于尚未加载的维数数据的数据.这将破坏立方体并导致负载过程失败.批次刷新复制的数据库以运行ETL或Cube负载会减轻此问题.

    如果您没有重复的数据库的选项,则可以为缺少数据设置更多Slack策略.

  4. 如果您的基本生产数据具有重大数据质量问题,它们将在多维数据集中反映出来. Gigo.

本文地址:https://www.itbaoku.cn/post/597557.html

问题描述

I have a fairly standard OLTP normalised database and I have realised that I need to do some complex queries, averages, standard deviations across different dimensions in the data.

So I have turned to SSAS and the creation of OLAP cubes.

However to create the cubes I believe my data source structure needs to be in a 'star' or 'snowflake' configuration (which I don't think it is right now).

Is the normal procedure to use SSIS to do some sort of ETL process on my primary OLTP DB into another relational DB that is in the proper 'star' configuration with facts and dimensions, and then use this DB as the datasource for the OLAP cubes?

Thanks

推荐答案

Yes, that is the basic idea. You take your highly normalized OLTP database and de-normalize it into cubes for the purpose of slicing and dicing the data and then presenting reports on it. The logical design technique is called dimensional modeling. There is a ton of great information about dimensional modeling over at the Kimball Group. Ralph Kimball's books on the subject are also excellent. If you want to learn more about the BI tools themselves, check out the virtual labs on SSIS, analysis services and more.

其他推荐答案

The answer is: yes, but.

A dimension in SSAS has relationships between attributes that can be used a a series of fields to filter of slice by. These relationships can be hierarchical (more than one level deep - one attribute can have a parent and children. You can also establish drill down paths (called hierarchies in SSAS) that act like attributes but have a guided drilldown.

In order to do this you need to have keys available in the database that live in a strictly hierarchical relationship (i.e. the keys can't have fuzzy relationships where a child can have more than one parent). Note that this isn't the whole story but it's close enough to reality for the moment.

These hierarchies can be constructed from a flat data structure by the system or presented through a snowflake with relationships marked up in the underlying data source view (DSVs are a part of the cube metadata and can be used to massage data in a manner similar to database views).

A snowflake is a 3NF-ish schema (it doesn't strictly have to be 3NF - you can flatten parts of it in practice) that only has 1:M relationships. SSAS can support a few other dimension structures such as parent-child (parent-child relationship with a recursive self-join) and M:M dimensions (M:M relationships - exactly what they sound like). Dimensions of this type are more fiddly but may be useful to you.

If you have keys in your source data that can have equivalent data semantics to a snowflake then you may be able to populate your cube through a series of database views on your source system that present the underlying data in a sufficiently snowflake-like format to use for cube dimensions (I have actually done this on a couple of occasions). Schemas that make heavy use of synthetic keys are more likely to work well for this.

If your vendor or other parties won't let you add views to your source database you may be able to use a data source view instead. DSV's can have virtual tables called 'named queries' that are populated from a database query.

Fact tables join to dimensions. In SSAS2005+ you can join different fact tables at different grains within a dimension. I wouldn't normally have much use for this in a data warehouse, but this feature might be useful if you're trying to use the source data without having to massage it too heavily.

If this doesn't work then you may well have to write an ETL process to populate a star or snowflake schema.

A few provisos:

  1. Cubes can be made to run in a real-time mode where they just issue a query to the underlying data. This has some risk of creating inefficient queries against your source data, so it is not recommended unless you are really confident that you know what you're doing.

  2. Apropos of (i), you probably won't be able use the cubes as a data source for screens in your application. If you need to calculate averages for something that the user wants to see on a screen you will probably have to calculate it in a stored procedure behind the screen.

  3. If you do this, set up a replicated database and populate the cube off that. Have this database periodically refresh, so your ETL process can run from an internally consistent data set. If you run from a live database, you risk some of the items being populated later on that depend on records that were created after the corresponding process was run.

    You can have the situation where you run a dimension load, and then new data is entered into the system. When the fact table load runs, it now contains data that is dependent on dimension data that hasn't been loaded. This will break the cube and cause the load process to fail. Batch refreshing a replicated database to run the ETL or cube loads off will mitigate this issue.

    If you don't have the option of a replicated database you can set up more slack policies for missing data.

  4. If your underlying production data has significant data quality issues they will be reflected in the cubes. GIGO.