在一个大的字符串集合中寻找相似的字符串组[英] Finding groups of similar strings in a large set of strings

本文是小编为大家收集整理的关于在一个大的字符串集合中寻找相似的字符串组的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我有一组相当大的字符串(例如100),该字符串具有许多以相似性为特征的亚组.我正在尝试找到/设计一种算法,该算法可以非常有效地发现这些组.

作为一个例子,假设输入列表在下面的左侧,并且输出组在右侧.

Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts     
Foo McBar                       Philip Roberts   
David Jones                     
Phil Roberts                    Foo McBar        
Davey Jones            =>         
John Smith                      David Jones      
Philip Roberts                  Dave Jones       
Dave Jones                      Davey Jones      
Jonny Smith                     
                                Jane Doe         

                                John Smith       
                                Jonny Smith 

有人知道有什么方法可以合理地解决这个问题?

查找类似字符串的标准方法似乎是Levenshtein距离,但是我看不到我在这里如何很好地利用它,而无需将每个字符串与列表中的其他每个字符串进行比较,然后以某种方式决定在确定两个字符串是否在同一组中的差阈值上.

一种替代方案是将字符串降低到整数的算法,其中类似的字符串哈希与数字线上靠近的整数相似.我不知道那是什么算法,但是,即使存在

有人有任何想法/指针吗?


更新: @will A:也许名字不如我第一次想到的榜样.作为起点,我认为我可以假设在我将要使用的数据中,字符串的一个小更改不会使它从一个组跳到另一组.

推荐答案

另一种流行的方法是通过其Jaccard索引将字符串关联.从 http://en.wikipedia.org/wiki/jaccard_index .

这是一篇有关使用jaccard-index(和其他几种方法)来解决类似您的问题的文章:

http://matpalm.com/resemblance/

其他推荐答案

您要解决的问题是典型的 clusterization "> clusterization 问题.<<<<<<<<<<<<<<<<<<<<<<<<

以简单的 k-means algorithm并使用levenshtein距离作为计算的功能元素和簇中心之间的距离.

btw,levenshtein距离计算的算法在apache commons stringutils中实现 - stringutils.getlevenshteindistance

K-均值的主要问题是您应该指定群集数(您的条款中的子组).因此,您将有2个选项:改进具有EURISTIS的K均值或使用另一种群集化算法,该算法不需要指定簇编号(但是该算法可能会显示出更差的性能,并且如果您决定实现它,则可能非常困难你自己).

其他推荐答案

如果我们要谈论的是可用的单词,请比较其 senephone 可能有帮助:

MRFLPRBRTS: Mr Philip Roberts
FLRBRTS: Phil Roberts   
FLPRBRTS: Philip Roberts 
FMKBR: Foo McBar      
TFTJNS: David Jones    
TFJNS: Dave Jones     
TFJNS: Davey Jones    
JNT: Jane Doe       
JNSM0: John Smith     
JNSM0: Jonny Smith

本文地址:https://www.itbaoku.cn/post/627614.html

问题描述

I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.

As an example let's say the input list is on the left below, and the output groups are on the right.

Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts     
Foo McBar                       Philip Roberts   
David Jones                     
Phil Roberts                    Foo McBar        
Davey Jones            =>         
John Smith                      David Jones      
Philip Roberts                  Dave Jones       
Dave Jones                      Davey Jones      
Jonny Smith                     
                                Jane Doe         

                                John Smith       
                                Jonny Smith 

Does anybody know of any ways to solve this reasonably efficiently?

The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.

An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists

Does anybody have any thoughts/pointers?


UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.

推荐答案

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/

其他推荐答案

The problem you're trying to solve is a typical clusterization problem.

Start with simple K-Means algorithm and use Levenshtein distance as a function for calculating distance between elements and clusters centers.

BTW, algorithm for Levenshtein distance calculation is implemented in Apache Commons StringUtils - StringUtils.getLevenshteinDistance

The main problem of K-Means is that you should specify the number of clusters (subgroups in your terms). So, you'll have 2 options: improve K-Means with some euristic or use another clusterization algorithm which doesn't require specifying clusters number (but that algorithm can show worse performance and can be very difficult in implemenation if you decide to implement it yourself).

其他推荐答案

If we're talking about actual pronouncable words, comparing the (start of) their metaphone might be of assistance:

MRFLPRBRTS: Mr Philip Roberts
FLRBRTS: Phil Roberts   
FLPRBRTS: Philip Roberts 
FMKBR: Foo McBar      
TFTJNS: David Jones    
TFJNS: Dave Jones     
TFJNS: Davey Jones    
JNT: Jane Doe       
JNSM0: John Smith     
JNSM0: Jonny Smith