# 在一个大的字符串集合中寻找相似的字符串组[英] Finding groups of similar strings in a large set of strings

### 问题描述

```Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts
Foo McBar                       Philip Roberts
David Jones
Phil Roberts                    Foo McBar
Davey Jones            =>
John Smith                      David Jones
Philip Roberts                  Dave Jones
Dave Jones                      Davey Jones
Jonny Smith
Jane Doe

John Smith
Jonny Smith
```

## 其他推荐答案

btw，levenshtein距离计算的算法在apache commons stringutils中实现 - stringutils.getlevenshteindistance

K-均值的主要问题是您应该指定群集数(您的条款中的子组).因此，您将有2个选项:改进具有EURISTIS的K均值或使用另一种群集化算法，该算法不需要指定簇编号(但是该算法可能会显示出更差的性能，并且如果您决定实现它，则可能非常困难你自己).

## 其他推荐答案

```MRFLPRBRTS: Mr Philip Roberts
FLRBRTS: Phil Roberts
FLPRBRTS: Philip Roberts
FMKBR: Foo McBar
TFTJNS: David Jones
TFJNS: Dave Jones
TFJNS: Davey Jones
JNT: Jane Doe
JNSM0: John Smith
JNSM0: Jonny Smith
```

### 问题描述

I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.

As an example let's say the input list is on the left below, and the output groups are on the right.

```Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts
Foo McBar                       Philip Roberts
David Jones
Phil Roberts                    Foo McBar
Davey Jones            =>
John Smith                      David Jones
Philip Roberts                  Dave Jones
Dave Jones                      Davey Jones
Jonny Smith
Jane Doe

John Smith
Jonny Smith
```

Does anybody know of any ways to solve this reasonably efficiently?

The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.

An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists

Does anybody have any thoughts/pointers?

UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.

## 推荐答案

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/

## 其他推荐答案

The problem you're trying to solve is a typical clusterization problem.

Start with simple K-Means algorithm and use Levenshtein distance as a function for calculating distance between elements and clusters centers.

BTW, algorithm for Levenshtein distance calculation is implemented in Apache Commons StringUtils - StringUtils.getLevenshteinDistance

The main problem of K-Means is that you should specify the number of clusters (subgroups in your terms). So, you'll have 2 options: improve K-Means with some euristic or use another clusterization algorithm which doesn't require specifying clusters number (but that algorithm can show worse performance and can be very difficult in implemenation if you decide to implement it yourself).

## 其他推荐答案

If we're talking about actual pronouncable words, comparing the (start of) their metaphone might be of assistance:

```MRFLPRBRTS: Mr Philip Roberts
FLRBRTS: Phil Roberts
FLPRBRTS: Philip Roberts
FMKBR: Foo McBar
TFTJNS: David Jones
TFJNS: Dave Jones
TFJNS: Davey Jones
JNT: Jane Doe
JNSM0: John Smith
JNSM0: Jonny Smith
```