在Java中设计高性能的状态机[英] Designing high-performance State Machine in Java

本文是小编为大家收集整理的关于在Java中设计高性能的状态机的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我正在着手编写一个 Java 库来实现高性能有限状态机.我知道那里有很多库,但我想从头开始编写自己的库,因为几乎所有的库都构建了优化的自动机,一次只能处理一个.

我想知道 SO 社区中涉足状态机设计的人们认为,在实现此类高性能库时,最重要/最好的设计原则是什么.

注意事项

  1. 生成的自动机通常不会很大.(约 100-500 个州).
  2. 不过,实现应该能够扩展.
  3. 实施应支持快速转换(最小化、确定性等).
  4. 希望实现 DFA、NFA、GNFA、PDA 和可能的 Tree Automata.如果可能的话,希望在单一界面下.
  5. 应该在内存使用和性能之间取得良好的平衡.

我目前关于设计的问题是:

  1. 应该定义State、Symbol 和Transition 的类吗?或者应该使用"隐藏"的内部结构.就我个人而言,我觉得像这样使用类会浪费大量内存,因为相同的信息可以以更简洁的形式存储.但是,这是否可以实现更快的转换?它还有其他优点/缺点吗?

  2. 在内部存储数据的最佳方式是什么?使用像 HashMap 和 HashSet 这样的数据结构可以实现分摊的常数时间查找,但是会涉及到一个开销元素.这是最好的方法吗?将转换信息存储为原始(或非)数组似乎会浪费大量内存.尤其是当库需要一次处理大量自动机时.不同数据结构的优缺点是什么?

感谢任何意见.谢谢!

推荐答案

你希望它有多快?brics.dk/automaton 的代码确实声明了自己的 State 和 Transition 类,但显然,这些可以使用原语重写(哎呀,整个Transition 类的状态显然很容易适应long).

问题是,例如,如果您将 Transition 类移动为简单的原语,那么您将不再被迫使用缓慢的 HashMap<Transition,...> 默认 Java 集合:您可以使用 Trove 的 TLongObjectHashMap 之类的库(或 TLongInt... 或 TLongLong 等)拥有默认的 HashMap 大时代(Trove 库基本上提供了超级高效的地图和集合,既快又小,当您使用原语时:您不会产生无数垃圾,也不会不断不必要地环绕原语,因此减少 GC 等.如果您对性能感兴趣,那么您确实想要检查 Trove... 他们即将发布的 3.0 版本是 20%比 Trove 2.0 更快).

但它真的有用吗?显然这个库已经足够快了.毫无疑问,通过不浪费地创建对象和使用确实表现良好的集合,它可以变得更快,但尚不清楚它是否可取.

除此之外,我很确定上面的库不是线程安全的.State 构造函数通过执行以下操作创建一个唯一 ID:

static int next_id;
.
.
.
id = next_id++;

并且该构造函数是从... 90 个不同的地方调用的!

不在多线程场景中创建唯一 ID 的教科书示例(哎呀,即使使 next_id volatile 也不够,你比如说,这里想要一个 AtomicInteger).我不太了解图书馆,但这个 ID 的东西在我看来非常很可疑.

本文地址:https://www.itbaoku.cn/post/627402.html

问题描述

I am in the process of starting to write a Java library to implement high-performance Finite State Machines. I know there are a lot of libraries out there, but I want to write my own from scratch, as almost all the libraries out there construct automatons optimized for handling only one at a time.

I would like to know what the people in the SO community who have dabbled in state machine design feels are the most important / best design principles when it comes to implementing high-performance libraries like these.

Considerations

  1. The automatons generated are typically not massive. (~ 100-500 states).
  2. The implementation should be able to scale though.
  3. The implementation should enable fast transformations (minimization, determinization etc.).
  4. Looking to implement DFA, NFA, GNFA, PDA and possibly Tree Automata. Hopefully under a single interface if possible.
  5. Should have a good balance between memory use and performance.

Current questions regarding design for me at the moment are:

  1. Should classes for State, Symbol and Transition be defined? Or should a "hidden" internal structure be used. Personally I feel that using classes as such would waste a lot of memory since the same information can be stored in a much more condensed form. But, does this enable faster transformations? Does it hold any other pros / cons?

  2. What would be the best way to store the data internally? Using data structures like HashMap and HashSet enables amortized constant time lookups, but there is an element of overhead involved. Is this the best way? Storing the transition information as a primitive (or not) array seems to waste quite a bit of memory. Especially when the library needs to handle a lot of automatons at a time. What are the pros / cons of the different data structures?

I appreciate any input. Thanks!

推荐答案

Well how fast do you want it to be? The code at brics.dk/automaton does declare its own State and Transition classes altough, obviously, these could be rewritten using primitives (heck, the entire Transition class's state apparently would easily fit on a long).

Thing is, if you move, for example, the Transition class to simply a primitive, then you're not forced to use anymore the slow HashMap<Transition,...> default Java collections: you can use libraries like Trove's TLongObjectHashMap (or TLongInt... or TLongLong, whatever) which owns the default HashMap big times (the Trove libraries basically provides maps and sets that are super efficient, both fast and small, when you work with primitives: you don't generate countless garbage nor constant needless wrapping around primitives, so less GC etc. If you're into performance, then you do want to check Trove... And their 3.0 upcoming release is 20% faster than Trove 2.0).

But is it really useful? Apparently that library is already plenty of fast. There's no doubt it can be made faster by not wastefully creating objects and by using collections that do actually perform well but it's not clear that it would be desirable.

Besides that, I'm pretty sure that the library above is not thread safe. The State constructor creates a unique ID by doing this:

static int next_id;
.
.
.
id = next_id++;

and that constructor is called from... 90 different places!

Textbook example of a way to not create a unique ID in a multi-threaded scenario (heck, even making next_id volatile wouldn't be sufficient, you want, say, an AtomicInteger here). I don't know the library well enough but this ID thinggy looks very fishy to me.