Talend的性能[英] Talend performance

本文是小编为大家收集整理的关于Talend的性能的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我们有一个要求,我们正在读取来自三个不同文件的数据并在同一作业中使用不同列的这些文件中进行加入.

每个文件大小约为25-30 GB.我们的系统RAM尺寸仅为16GB.与TMAP一起加入. Talend将所有参考数据保留在物理内存中.就我而言,我无法提供那么多的记忆.作业因记忆力不足而失败.如果我在TMAP中使用与Temp Disk选项一起使用,Job的速度很慢.

请帮助我解决这些问题.

  1. 如何处理大于RAM大小的数据?
  2. 管道并行性与塔伦德(Talend)建立?我缺少什么吗 在代码中以实现这一目标?
  3. tuniq&加入操作是在物理记忆中完成的,导致了作业 慢慢奔跑.磁盘选项可用于处理这些 功能,但是太慢了.
  4. 如何在不将数据推到DB(ELT)的情况下如何提高性能. 塔伦德是否可以在数百万中处理大量数据. 较少量的RAM的数据?

谢谢

推荐答案

talend处理大量数据非常快速,有效.这一切都取决于您对TALEND平台的了解.

请考虑以下评论作为您的问题的答案.

Q1.talend过程如何大于RAM大小?

a.您无法将整个RAM用于Talend Studio.只有一小部分RAM可以使用其几乎一半的RAM.

例如: - 在64位系统上可用8 GB的内存,最佳设置可以是: -vmargs

-XMS1024M

-xmx4096m

-xx:maxpermsize = 512m

-dfile.encoding = utf-8

现在,在您的情况下,您必须使用100 GB

增加RAM

或简单地将数据写在硬盘上.为此,您必须为缓冲区组件选择一个临时数据目录 tmap,tbufferinputs,taggregatedrow等

Q2.管道并行性与塔伦德(Talend)建立?我是否缺少代码中的任何事情来实现这一目标?

a.在Talend Studio中,数据流的并行化意味着将副工作的输入数据流划分为并行过程并同时执行它们,以获得更好的性能.

但是,此功能仅在您已订阅的Talend平台解决方案之一的条件下可用.

当您必须开发工作以使用Talend Studio处理非常大的数据时, 您可以单击一次启用或禁用并行化,然后录音室在给定的作业

上自动化实现

在此处输入图像描述

并行执行 并行化的实施需要四个关键步骤,如下所示:

分区():在此步骤中,录音室将输入记录拆分为给定数量的线程.

collecting():在此步骤中,工作室收集拆分线程并将其发送到给定的组件进行处理.

decrationing():在此步骤中,录音室将平行执行的输出分组.

recollecting():在此步骤中,工作室捕获了分组的执行结果并将其输出到给定的组件.

Q3. Tuniq&Join操作是在物理记忆中完成的,导致作业慢慢运行.磁盘选项可用于处理这些功能,但是它太慢了.

Q4.如何在不将数据推向DB(ELT)的情况下可以提高性能. Talend是否可以在数百万中处理大量数据.

3和4.在这里,我建议您使用toutputbulkexec将数据直接插入数据库.组件,然后您可以使用DB级别的ELT组件应用这些操作.

其他推荐答案

您可以尝试在JobDefinition本身中尝试一些更改.喜欢:

- 使用流媒体 - 使用修剪用于大字符串数据.因此不必要的转移 数据将阻止. - 用作连接器onSubjobok而不是oncomponentok 垃圾收集器有机会及时释放更多数据

本文地址:https://www.itbaoku.cn/post/2067625.html

问题描述

We have a requirement where we are reading data from three different files and doing joins among these files with different columns in the same job.

Each file size is around 25-30 GB. Our system RAM size is just 16GB. Doing joins with tmap. Talend is keeping all the reference data in physical memory. In my case, i cannot provide that much memory. Job fails due to out of memory. If i use join with temp disk option in tmap, job was dead slow.

Please help me with these questions.

  1. How Talend process the data larger than RAM size?
  2. Pipeline parallelism is in place with talend? Am i missing anything in the code to accomplish that?
  3. tuniq & Join operations was done in physical memory,causing the job to run dead slow. Disk option is available to handle these functionality, but it was too slow.
  4. How performance can be improved without pushing the data to DB(ELT). Whether talend can handle huge data in millions.Need to handle this kind of data with lesser amount of RAM?

Thanks

推荐答案

Talend process the Large amount of data very fast and in efficient manner. Its all depends on your knowledge about Talend Platforms.

Please consider the below comments as answers for your questions.

Q1.How Talend process the data larger than RAM size?

A. You can not use your entire RAM for Talend studio. Only a fraction of RAM can be used its almost half of your RAM.

For example:- With 8 GB of memory available on 64-bit system, the optimal settings can be: -vmargs

-Xms1024m

-Xmx4096m

-XX:MaxPermSize=512m

-Dfile.encoding=UTF-8

Now in your case either you have to increase your RAM with 100 GB

OR simply write the data on hard disk. For this you have to choose a Temp data directory for buffer components like- tMap, tBufferInputs, tAggregatedRow etc.

Q2. Pipeline parallelism is in place with talend? Am i missing anything in the code to accomplish that?

A. In Talend Studio, parallelization of data flows means to partition an input data flow of a Subjob into parallel processes and to simultaneously execute them, so as to gain better performance.

But this feature is available only on the condition that you have subscribed to one of the Talend Platform solutions.

When you have to develop a Job to process very huge data using Talend Studio, you can enable or disable the parallelization by one single click, and then the Studio automates the implementation across a given Job

enter image description here

Parallel Execution The implementation of the parallelization requires four key steps as explained as follows:

Partitioning (): In this step, the Studio splits the input records into a given number of threads.

Collecting (): In this step, the Studio collects the split threads and sends them to a given component for processing.

Departitioning (): In this step, the Studio groups the outputs of the parallel executions of the split threads.

Recollecting (): In this step, the Studio captures the grouped execution results and outputs them to a given component.

Q3. tuniq & Join operations was done in physical memory,causing the job to run dead slow. Disk option is available to handle these functionality, but it was too slow.

Q4. How performance can be improved without pushing the data to DB(ELT). Whether talend can handle huge data in millions.Need to handle this kind of data with lesser amount of RAM?

A 3&4. Here I will suggest you to insert the data directly into database using tOutputBulkExec. components and then you can apply these operation using ELT components on DB level.

其他推荐答案

You can try out some changes in jobdefinition itself. Like:

-- Use Streaming -- Use Trimming for big stringdata. So transfer of unnecessary data will prevent. -- Use as connector OnSubjobOk instead OnComponentOk so the Garbage Collector has chance to freeing more data in time