问题描述
我们已经开始使用第三方平台(GIGASPACES),该平台有助于我们进行分布式计算.我们现在试图解决的主要问题之一是如何在此分布式环境中管理日志文件.我们当前有以下设置.
我们的平台分布在8台机器上.在每台计算机上,我们有12-15个进程,可以使用java.util.logging来分离日志文件.在此平台之上,我们有自己的应用程序,可以使用log4j和log将文件分开.我们还将stdout重定向到一个单独的文件以捕获线程转储和类似.
这将产生大约200个不同的日志文件.
到目前为止,我们还没有工具来协助管理这些文件.在以下情况下,这会导致我们严重的头痛.
-
当我们不事先知道问题发生的过程时进行故障排除.在这种情况下
-
试图通过定期检查日志是否与众不同,以主动进行主动.在这种情况下
-
设置警报.我们正在寻求设置有关阈值事件的警报.这正是对200个日志文件进行检查的痛苦.
今天,我们每秒只有大约5个日志事件,但是随着我们将越来越多的代码迁移到新平台时,这将增加.
我想问社区以下问题.
- 您如何处理类似的情况,其中许多日志文件在登录不同框架的几台机器上分发的许多日志文件?
- 您为什么选择该特定解决方案?
- 您的解决方案是如何奏效的?您发现了什么好处,您发现什么不好?
非常感谢.
更新
我们最终评估了Splunk的试用版.我们对它的工作方式感到非常满意,并决定购买它.易于设置,快速搜索和技术倾向的大量功能.我可以在类似情况下推荐任何人检查一下.
推荐答案
我建议您将所有Java记录到 java的简单记录表面)然后将所有日志从SLF4J重定向到 logback . SLF4J在处理所有受欢迎的遗产API(Log4J,Commons-Logging,Java.util.logging等)方面具有特殊支持noreferrer">在这里.
一旦将登录记录到记录中后,您可以使用许多附录之一将汇总在几台计算机上登录,有关详细信息,请参见手册,请参阅我对这个问题有关详细信息.
编辑:您还可以实现自己的自定义logback appender ,并将所有日志重定向到 scribe .
其他推荐答案
探索有趣的选择是运行apache.org/common/docs/r0.21.0/cluster_setup.html" rel="noflowl 在那些节点上并编写一个自定义
我建议您看看一个日志聚合工具,例如 splunk 或 scribe . (我认为,这更多是一个服务器错误问题,因为它与您的应用程序的管理和数据有关,而不是与创建应用有关.)其他推荐答案
问题描述
We have started using a third party platform (GigaSpaces) that helps us with distributed computing. One of the major problems we are trying to solve now is how to manage our log files in this distributed environment. We have the following setup currently.
Our platform is distributed over 8 machines. On each machine we have 12-15 processes that log to separate log files using java.util.logging. On top of this platform we have our own applications that use log4j and log to separate files. We also redirect stdout to a separate file to catch thread dumps and similar.
This results in about 200 different log files.
As of now we have no tooling to assist in managing these files. In the following cases this causes us serious headaches.
Troubleshooting when we do not beforehand know in which process the problem occurred. In this case we currently log into each machine using ssh and start using grep.
Trying to be proactive by regularly checking the logs for anything out of the ordinary. In this case we also currently log in to all machines and look at different logs using less and tail.
Setting up alerts. We are looking to setup alerts on events over a threshold. This is looking to be a pain with 200 log files to check.
Today we have only about 5 log events per second, but that will increase as we migrate more and more code to the new platform.
I would like to ask the community the following questions.
- How have you handled similar cases with many log files distributed over several machines logged through different frameworks?
- Why did you choose that particular solution?
- How did your solutions work out? What did you find good and what did you find bad?
Many thanks.
Update
We ended up evaluating a trial version of Splunk. We are very happy with how it works and have decided to purchase it. Easy to set up, fast searches and a ton of features for the technically inclined. I can recommend anyone in similar situations to check it out.
推荐答案
I would recommend to pipe all your java logging to Simple Logging Facade for Java (SLF4J) and then redirect all logs from SLF4J to LogBack. SLF4J has special support for handling all popular legacy APIs (log4j, commons-logging, java.util.logging, etc), see here.
Once you have your logs in LogBack you can use one of it's many appenders to aggregate logs over several machines, for details, see the manual section about appenders. Socket, JMS and SMTP seem to be the most obvious candidates.
LogBack also has built-in support for monitoring for special conditions in log files and filtering events sent to particular appender. So you could set up SMTP appender to send you an e-mail every time there is an ERROR level event in logs.
Finally, to ease troubleshooting, be sure to add some sort of requestID to all your incoming "requests", see my answer to this question for details.
EDIT: you could also implement your own custom LogBack appender and redirect all logs to Scribe.
其他推荐答案
An interesting option to explore would be to run Hadoop Cluster on those nodes and write a custom Map Reduce job for searching and aggregating results specific for your applications.
其他推荐答案
I'd suggest taking a look at a log aggregation tool like Splunk or Scribe.
(Also, I think this is more of a ServerFault question, as it has to do with administration of your app and it's data, not so much about creating the app.)