高效地比较100.000个矢量[英] Efficient comparison of 100.000 vectors

本文是小编为大家收集整理的关于高效地比较100.000个矢量的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我将100.000个向量保存在数据库中.每个向量的尺寸为60.(int vector [60])

然后,我将一个并希望向用户提供当前的向量,以减少与所选的相似性.

我使用 tanimoto classifier 比较2个矢量:

:

推荐答案

首先,预处理您的向量列表以使每个向量归一化..单位幅度. 现在请注意,您的比较函数t()现在具有恒定项,可以简化公式,以在测试向量和数据库中的值之间找到最大的点乘积.

中的值.

.

现在,考虑一个新功能d = 60D空间中两个点之间的距离.这是经典的 l2距离,以每个组件的差异,每个组件的差异,每个平方均方并取出总和的平方根. d(a,b)= sqrt((a-b)^2),其中a和b每个60维向量.

但是,可以将其扩展到d(a,b)= sqrt(a * a -2 * dot(a,b) + b * b). A和B是单位幅度.函数D是单调的,因此如果我们删除SQRT()并查看平方距离,则不会更改排序顺序.这使我们只有-2 *点(a,b).因此,最小的距离完全等同于最大化点产物.

因此,可以简化原始的t()分类度量,以在非载体之间找到最高的点产物.该比较显示等同于找到60-D空间中样品点的最接近的点.

因此,现在您需要做的就是解决"给定在60D空间中的归一化点,列出归一化样品向量数据库中的20个点的等效问题."

."

这个问题是一个很好的理解.它是 k最近的邻居. 有许多解决此问题的算法.最常见的是经典 kd树 .

但是有一个问题. KD树具有O(E^D)的行为.高维度很快变得痛苦. 60个维度绝对是极为痛苦的类别.甚至不要尝试.

但是,对于高D最近的邻居,有几种替代通用技术. 本文提供了一种清晰的方法.

但实际上,有一个很好的解决方案,涉及另一个转变.如果您有一个度量空间(或者您不会使用Tanimoto比较),则可以将问题的维度降低60维旋转.这听起来很复杂且令人恐惧,但很普遍..它是一种奇异价值分解或特征值分解的一种形式.在统计学中,它被称为主要组件分析.

基本上,这使用一个简单的线性计算来查找数据库真正跨越的方向.您可以将60个维度缩小到较低的数字,也许低至3或4,并且仍然能够准确确定最近的邻居. 有大量的软件库可以用任何语言执行此操作,请参见在这里例如.

最后,您将在可能只有3-10个维度上做一个经典的K最近的邻居..您可以尝试最佳行为.有一个很棒的图书馆,用于这样做 ranger "> ranger ,但是您也可以使用其他库.一个巨大的辅助好处是,您甚至不需要再存储示例数据的所有60个组件!

令人难以置信的问题是,您的数据是否真的可以折叠成较低的维度,而不会影响结果的准确性.实际上,PCA分解可以告诉您您选择的任何D限制的最大剩余错误,因此您可以确保它可以正常工作.由于比较点是基于距离度量的,因此它们很有可能是密切相关的,与Say ash table值不同.

因此,上述摘要:

  1. 将媒介归一化,将问题转换为60维中的K-Neareble邻居问题
  2. 使用主组件分析将维度降低到5个维度的可管理限制
  3. 使用K最近的邻居算法,例如Ranger的KD树库来查找附近的样品.

其他推荐答案

更新:

明确说明60是您的空间的维度,而不是向量的长度,下面的答案不适用于您,所以我将仅适用于历史.


由于您的向量已归一化,因此您可以使用kd-tree在增量超vul卷的MBH中找到邻居.

我知道没有数据库具有kd-tree的本机支持,因此,如果您正在搜索有限数量的最近条目,则可以尝试在MySQL中实现以下解决方案:

  • 将向量的投影存储到2二维空间中的每个点(取n * (n - 1) / 2列)
  • 索引SPATIAL索引
  • 的每一列中的每一列中的每一列
  • 在任何投影中选择给定区域的方形MBR.这些MBR的乘积将为您提供一个有限的超量的超立方体,该量将使所有向量保持不高于给定的距离.
  • 使用MBRContains
  • 在所有MBR中找到所有投影

您仍然需要在有限的值范围内进行排序.

例如,您有一组4 - 二维向量,幅度为2:

(2, 0, 0, 0)
(1, 1, 1, 1)
(0, 2, 0, 0)
(-2, 0, 0, 0)

您必须按照以下方式存储它们:

p12  p13  p14  p23  p24  p34
---  ---  ---  ---  ---  ---
2,0  2,0  2,0  0,0  0,0  0,0
1,1  1,1  1,1  1,1  1,1  1,1
0,2  0,0  0,0  2,0  2,0  0,0
-2,0 -2,0 -2,0 0,0  0,0  0,0

说,您想要与第一个矢量(2, 0, 0, 0)大于0的相似性.

这意味着在超立方体内有矢量:(0, -2, -2, -2):(4, 2, 2, 2).

您发行以下查询:

SELECT  *
FROM    vectors
WHERE   MBRContains('LineFromText(0 -2, 4 2)', p12)
        AND MBRContains('LineFromText(0 -2, 4 2)', p13)
        …

等,用于所有六列

其他推荐答案

因此可以缓存以下信息:

  • 所选向量的规范
  • 点产品A.B,在给定的t(a,b)计算中为分子和分母重复使用.

如果您只需要n最接近的向量或多次进行相同的排序过程,则可能会有更多技巧. (诸如t(a,b)= t(b,a)之类的观察结果,缓存所有向量的向量规范,也许是某种阈值/空间排序).

本文地址:https://www.itbaoku.cn/post/597494.html

问题描述

I save 100.000 Vectors of in a database. Each vector has a dimension 60. (int vector[60])

Then I take one and want present vectors to the user in order of decreasing similarity to the chosen one.

I use Tanimoto Classifier to compare 2 vectors:

alt text

Is there any methods to avoid doing through all entries in the database?

One more thing! I don't need to sort all vectors in the database. I whant to get top 20 the most similar vectors. So maybe we can roughly threshold 60% of entries and use the rest for sorting. What do you think?

推荐答案

First, preprocess your vector list to make each vector normalized.. unit magnitude. Notice now that your comparison function T() now has magnitude terms that become constant, and the formula can be simplified to finding the largest dot product between your test vector and the values in the database.

Now, think of a new function D = the distance between two points in 60D space. This is classic L2 distance, take the difference of each component, square each, add all the squares, and take the square root of the sum. D(A, B) = sqrt( (A-B)^2) where A and B are each 60 dimensional vectors.

This can be expanded, though, to D(A, B) = sqrt(A * A -2*dot(A,B) + B * B). A and B are unit magnitude, then. And the function D is monotonic, so it won't change the sort order if we remove the sqrt() and look at squared distances. This leaves us with only -2 * dot(A,B). Thus, miniumizing distance is exactly equivalent to maximizing dot product.

So the original T() classificiation metric can be simplified into finding the highest dot product between the nornalized vectors. And that comparison is shown equivalent to finding the closest points to the sample point in 60-D space.

So now all you need to do is solve the equivalent problem of "given a normalized point in 60D space, list the 20 points in the database of normalized sample vectors which are nearest to it."

That problem is a well understood one.. it's K Nearest Neighbors. There are many algorithms for solving this. The most common is classic KD trees .

But there's a problem. KD trees have an O(e^D) behavior.. high dimensionality quickly becomes painful. And 60 dimensions is definitely in that extremely painful category. Don't even try it.

There are several alternative general techniques for high D nearest neighbor however. This paper gives a clear method.

But in practice, there's a great solution involving yet another transform. If you have a metric space (which you do, or you wouldn't be using the Tanimoto comparison), you can reduce the dimensionality of the problem by a 60 dimensional rotation. That sounds complex and scary, but it's very common.. it's a form of singular value decomposition, or eigenvalue decomposition. In statistics, it's known as Principal Components Analysis.

Basically this uses a simple linear computation to find what directions your database really spans. You can collapse the 60 dimensions down to a lower number, perhaps as low as 3 or 4, and still be able to accurately determine nearest neighbors. There are tons of software libraries for doing this in any language, see here for example.

Finally, you'll do a classic K nearest neighbors in probably only 3-10 dimensions.. you can experiment for the best behavior. There's a terrific library for doing this called Ranger, but you can use other libraries as well. A great side benefit is you don't even need to store all 60 components of your sample data any more!

The nagging question is whether your data really can be collapsed to lower dimensions without affecting the accuracy of the results. In practice, the PCA decomposition can tell you the maximum residual error for whatever D limit you choose, so you can be assured it works. Since the comparison points are based on a distance metric, it's very likely they are intensely correlated, unlike say hash table values.

So the summary of the above:

  1. Normalize your vectors, transforming your problem into a K-nearest neighbor problem in 60 dimensions
  2. Use Principal Components Analysis to reduce dimensionality down to a manageable limit of say 5 dimensions
  3. Use a K Nearest Neighbor algorithm such as Ranger's KD tree library to find nearby samples.

其他推荐答案

Update:

After you made clear that 60 is the dimension of your space, not the length of the vectors, the answer below is not applicable for you, so I'll keep it just for history.


Since your vectors are normalized, you can employ kd-tree to find the neighbors within an MBH of incremental hypervolume.

No database I'm aware of has native support of kd-tree, so you can try to implement the following solution in MySQL, if you are searching for a limited number of closest entries:

  • Store the projections of the vectors to each of 2-dimensional space possible (takes n * (n - 1) / 2 columns)
  • Index each of these columns with a SPATIAL index
  • Pick a square MBR of a given area within any projection. The product of these MBR's will give you a hypercube of a limited hypervolume, which will hold all vectors with a distance not greater than a given one.
  • Find all projections within all MBR's using MBRContains

You'll still need to sort within this limited range of values.

For instance, you have a set of 4-dimensional vectors with magnitude of 2:

(2, 0, 0, 0)
(1, 1, 1, 1)
(0, 2, 0, 0)
(-2, 0, 0, 0)

You'll have to store them as follows:

p12  p13  p14  p23  p24  p34
---  ---  ---  ---  ---  ---
2,0  2,0  2,0  0,0  0,0  0,0
1,1  1,1  1,1  1,1  1,1  1,1
0,2  0,0  0,0  2,0  2,0  0,0
-2,0 -2,0 -2,0 0,0  0,0  0,0

Say, you want similarity with the first vector (2, 0, 0, 0) greater than 0.

This means having the vectors inside the hypercube: (0, -2, -2, -2):(4, 2, 2, 2).

You issue the following query:

SELECT  *
FROM    vectors
WHERE   MBRContains('LineFromText(0 -2, 4 2)', p12)
        AND MBRContains('LineFromText(0 -2, 4 2)', p13)
        …

, etc, for all six columns

其他推荐答案

So the following information can be cached:

  • Norm of the chosen vector
  • The dot product A.B, reusing it for both the numerator and the denominator in a given T(A,B) calculation.

If you only need the N closest vectors or if you are doing this same sorting process multiple times, there may be further tricks available. (Observations like T(A,B)=T(B,A), caching the vector norms for all the vectors, and perhaps some sort of thresholding/spatial sort).