问题描述
我正在测试我的火花流应用程序,并且我的代码中有多个功能: - 其中一些在dstream [rdd [xxx]]上进行操作,其中一些在rdd [xxx]上(在我做dstream.foreachrdd之后).
我使用kafka log4j appender来记录在我的功能中发生的业务案例,这些案例都在dstream [rdd]和rdd it self上运行.
但是,只有在我想从Dstream上运行的功能将数据附加到kafka时,数据才能将数据附加到KAFKA.
有人知道这种行为的原因吗?
我正在使用Spark&Kafka的单个虚拟机上工作.我使用Spark提交提交申请.
编辑
实际上我已经找出了问题的一部分.数据仅从我的主要函数中的代码部分就将其附加到KAFKA.我主要之外的所有代码都不会将数据写入Kafka.
在Main中,我像这样声明了记录器:
val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
在我的主管外面,我必须像:
那样声明它@transient lazy val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
为了避免序列化问题.
原因可能是JVM Serialization Concept的背后,或者仅仅是因为工人看不到Log4J配置文件(但是我的Log4J文件在我的源代码中,资源文件夹中)
编辑2
我尝试以多种方式将Log4J文件发送给执行者,但无法正常工作.我尝试了:
-
在-files命令spark-submit
中发送log4j文件
-
设置:--conf "spark.executor.extraJavaOptions =-Dlog4j.configuration=file:/home/vagrant/log4j.properties"在Spark-Submit
-
设置log4j.
此选项都没有.
有人有解决方案吗?我没有看到错误日志中的任何错误.
谢谢
推荐答案
我认为您很接近..首先,您想确保使用--files flag上所有节点上的所有文件导出到工作目录(不是classpath).然后,您要将这些文件引用到执行者和驱动程序的ExtracClassPath选项.我已经附上了以下命令,希望它有帮助.关键是要了解文件后,可以使用工作目录的文件名(而不是URL路径)在节点上访问所有文件.
注意:将log4j文件放在资源文件夹中无法正常工作. (至少当我尝试过时,没有.)
sudo -u hdfs spark-submit --class "SampleAppMain" --master yarn --deploy-mode cluster --verbose --files file:///path/to/custom-log4j.properties,hdfs:///path/to/jar/kafka-log4j-appender-0.9.0.0.jar --conf "spark.driver.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.executor.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" /path/to/your/jar/SampleApp-assembly-1.0.jar
问题描述
I am testing my spark streaming application, and I have multiple functions in my code: - some of them operate on a DStream[RDD[XXX]], some of them on RDD[XXX] (after I do DStream.foreachRDD).
I use Kafka log4j appender to log business cases that occur within my functions, that operate on both DStream[RDD] & RDD it self.
But data gets appended to Kafka only when from functions that operate on RDD -> it doesn't work when I want to append data to kafka from my functions that operate on DStream.
Does anyone know reason to this behaviour?
I am working on a single virtual machine, where I have Spark & Kafka. I submit applications using spark submit.
EDITED
Actually I have figured out the part of the problem. Data gets appended to Kafka only from the part of the code that is in my main function. All the code that Is outside of my main, doesnt write data to kafka.
In main I declared the logger like this:
val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
While outside of my main, I had to declare it like:
@transient lazy val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
in order to avoid serialization issues.
The reason might be behind JVM serialization concept, or simply because workers don't see the log4j configuration file (but my log4j file is in my source code, in resource folder)
Edited 2
I have tried in many ways to send log4j file to executors but not working. I tried:
sending log4j file in --files command of spark-submit
setting: --conf "spark.executor.extraJavaOptions =-Dlog4j.configuration=file:/home/vagrant/log4j.properties" in spark-submit
setting log4j.properties file in --driver-class-path of spark-submit...
None of this option worked.
Anyone has the solution? I do not see any errors in my error log..
Thank you
推荐答案
I think you are close..first you want to make sure all the files are exported to the WORKING DIRECTORY (not CLASSPATH) on all nodes using --files flag. And then you want to reference these files to extracClassPath option of executors and driver. I have attached the following command, hope it helps. Key is to understand once the files are exported, all the files can be accessed on the node using just file name of the working directory (and not url path).
Note: Putting log4j file in the resources folder will not work. (at least when i had tried, it didnt.)
sudo -u hdfs spark-submit --class "SampleAppMain" --master yarn --deploy-mode cluster --verbose --files file:///path/to/custom-log4j.properties,hdfs:///path/to/jar/kafka-log4j-appender-0.9.0.0.jar --conf "spark.driver.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.executor.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" /path/to/your/jar/SampleApp-assembly-1.0.jar