MapReduce替代品[英] MapReduce alternatives









菲尔·科利拉(Phil Colella)基于散射和收集处理节点之间的数据的模式,确定了七种用于科学计算的数值方法,并将其称为"矮人".这些列表已被其他人添加到矮人矿:

  1. 密集的线性代数
  2. 稀疏线性代数
  3. 光谱方法
  4. n-body方法
  5. 结构化网格
  6. 非结构化网格
  7. MapReduce
  8. 组合逻辑
  9. 图形遍历
  10. 动态编程
  11. 回溯和分支和结合
  12. 图形模型
  13. 有限状态机


更新(2014年8月):平流层现在称为 apache flink (孵化).

看看平流层.这是另一个提供更多运营商的大数据运行时(地图,减少,加入,联合,交叉,迭代,...).它还允许定义高级数据流程图(使用Hadoop MR,您必须链接工作).

平流层还支持BSP的图形处理抽象(称为 spargel "> spargel ).


该领域的另一个系统是 spark 它具有自己的模型(RDDS).由于此处提到了BSP,也可以查看 GraphLab ,该提供了BSP的替代方案.



Are there any alternative paradigms to MapReduce (Google, Hadoop)? Is there any other reasonable way how to split & merge big problems?


Definitively. Check out, for example, Bulk Synchronous Parallel. Map/Reduce is in fact a very restricted way of reducing problems, however that restriction makes it manageable in a framework like Hadoop. The question is if it is less trouble to press your problem into a Map/Reduce setting, or if its easier to create a domain-specific parallelization scheme and having to take care of all the implementation details yourself. Pig, in fact, is only an abstraction layer on top of Hadoop which automates many standard problem transformations from not-Map-Reduce-y to Map-Reduce-compatible.

Edit 26.1.13: Found a nice up-to-date overview here


Phil Colella identified seven numerical methods for scientific computation based on the patterns of scattering and gathering of data between processing nodes, and called them 'dwarfs'. These have been added to by others, a list is available at the Dwarf Mine:

  1. Dense Linear Algebra
  2. Sparse Linear Algebra
  3. Spectral Methods
  4. N-Body Methods
  5. Structured Grids
  6. Unstructured Grids
  7. MapReduce
  8. Combinational Logic
  9. Graph Traversal
  10. Dynamic Programming
  11. Backtrack and Branch-and-Bound
  12. Graphical Models
  13. Finite State Machines


Update (August 2014): Stratosphere is now called Apache Flink (incubating).

Have a look at Stratosphere. It is another Big Data runtime that offers more operators (map, reduce, join, union, cross, iterate, ...). It also allows to define advanced data flow graphs (with Hadoop MR, you would have to chain jobs).

Stratosphere also supports BSP with its graph processing abstraction (called Spargel).

If you like to read scientific papers, have a look at Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing, it explains the theoretical backgrounds of the system.

Another system in the field is Spark which has its own model (RDDs). Since BSP has been mentioned here, also have a look at GraphLab, the offer an alternative to BSP.