快速检测字符串中n-gram的方法?[英] quicker way to detect n-grams in a string?

问题描述

我在 SO 上找到了这个解决方案来检测字符串中的 n-gram:(这里:从一个句子生成 N-gram)

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

=> 与毫秒相比,这段代码花费了迄今为止最长的处理时间(我的语料库检测 1-gram、2-gram、3-gram 和 4gram 需要 28 秒:4Mb 的原始文本)用于其他操作(去除停用词等)

有没有人知道 Java 中的解决方案比上面介绍的循环解决方案更快?(我在想多线程,使用集合,或者可能是创造性的方法来拆分字符串......?)谢谢!

推荐答案

你可以试试这样的:

public class NGram {

    private final int n;
    private final String text;

    private final int[] indexes;
    private int index = -1;
    private int found = 0;

    public NGram(String text, int n) {
        this.text = text;
        this.n = n;
        indexes = new int[n];
    }

    private boolean seek() {
        if (index >= text.length()) {
            return false;
        }
        push();
        while(++index < text.length()) {
            if (text.charAt(index) == ' ') {
                found++;
                if (found<n) {
                    push();
                } else {
                    return true;
                }
            }
        }
        return true;
    }

    private void push() {
        for (int i = 0; i < n-1; i++) {
            indexes[i] = indexes[i+1];
        }
        indexes[n-1] = index+1;
    }

    private List<String> list() {
        List<String> ngrams = new ArrayList<String>();
        while (seek()) {
            ngrams.add(get());
        }
        return ngrams;
    }

    private String get() {
        return text.substring(indexes[0], index);
    }
}

对大约 5mb 的文本进行测试,它的执行速度似乎比原始代码快 10 倍.这里的主要区别是regex不用于拆分,并且ngram字符串不是通过连接创建的.

更新:这是我在上面提到的文本 ngram 1-4 上运行时得到的输出.我使用 2GB 内存来确定运行期间对 GC 的影响.跑了多次,看看热点编译器的影响.

Loop 01 Code mine ngram 1 time 071ms ngrams 294121
Loop 01 Code orig ngram 1 time 534ms ngrams 294121
Loop 01 Code mine ngram 2 time 016ms ngrams 294120
Loop 01 Code orig ngram 2 time 360ms ngrams 294120
Loop 01 Code mine ngram 3 time 082ms ngrams 294119
Loop 01 Code orig ngram 3 time 319ms ngrams 294119
Loop 01 Code mine ngram 4 time 014ms ngrams 294118
Loop 01 Code orig ngram 4 time 439ms ngrams 294118

Loop 10 Code mine ngram 1 time 013ms ngrams 294121
Loop 10 Code orig ngram 1 time 268ms ngrams 294121
Loop 10 Code mine ngram 2 time 014ms ngrams 294120
Loop 10 Code orig ngram 2 time 323ms ngrams 294120
Loop 10 Code mine ngram 3 time 013ms ngrams 294119
Loop 10 Code orig ngram 3 time 412ms ngrams 294119
Loop 10 Code mine ngram 4 time 014ms ngrams 294118
Loop 10 Code orig ngram 4 time 423ms ngrams 294118

本文地址:https://www.itbaoku.cn/post/978350.html