如何在数据库中搜索文本片段[英] How to search for text fragments in a database

本文是小编为大家收集整理的关于如何在数据库中搜索文本片段的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

是否有任何可用的开源工具或商业工具可用于数据库内容的文本片段索引,可以从Java查询?

该问题的背景是一个大型MySQL数据库表,其中包含数十万个记录,其中包含几个VARCHAR列.在这些列中,人们想搜索内容的片段,因此完整索引(基于单词边界)不会有帮助.

编辑:[添加以说明为什么这些第一个建议无法解决问题:]

这就是为什么MySQL内置的全文索引无法完成这项工作,Lucene或Sphinx也不会在答案中提出所有这些.我已经看过这两个,但是据我所知,这些都是基于索引 words 的,不包括停止单词和做各种明智的事情,以进行真实的全文搜索.但是,这并不合适,因为我可能正在寻找必须像" Oison"这样的搜索词,该搜索词必须与" Roisonic Street"和" Poison-Ivy"相匹配.这里的关键区别在于,搜索词只是列内容的片段,不必由任何特殊字符或白空间界定.

edit2 :[添加了更多背景信息:] 基于此的请求的功能是对商品管理系统中的项目描述的非常松散的搜索.用户通常不知道正确的项目编号,而仅知道项目名称的一部分.不幸的是,这些描述的质量相当低,它们来自旧系统,不能轻易更改.例如,如果人们正在寻找锤锤,他们将进入"雪橇".有了单词/令牌索引,这将找不到存储为"大锤"的匹配项,而只会聆听"大雪橇锤".需要涵盖各种怪异的差异,使基于令牌的方法不切实际.

目前,我们唯一能做的就是LIKE '%searchterm%'查询,有效地禁用任何索引使用并需要大量资源和时间.

理想情况下,任何此类工具都会创建一个索引,使我能够很快获得类似查询的结果,以便我可以实现类似聚光灯的搜索,只能通过主键从主键中检索"真实"数据用户选择结果记录.

如果可能的话,索引应更新(无需完整的重建),因为数据可能会更改,并且应该立即被其他客户端搜索.

我很高兴获得建议和/或经验报告.

编辑3:商业解决方案发现" Just Works" 即使我为这个问题得到了很多好的答案,我还是想在这里注意到,最终我们使用了一种名为" QuickFind"的商业产品,由一家名为" HMB DatenTechnik"的德国公司制造和出售.请注意,我不是以任何方式与他们相关的,因为当我继续描述他们的产品可以做什么时,它可能看起来像这样.不幸的是,他们的网站看起来很糟糕,而且只有德语,但产品本身确实很棒.我目前有来自他们的试用版 - 您必须与他们联系,没有下载 - 我印象深刻.

由于在线没有全面的文档,我将尝试描述到目前为止的经验.

他们所做的是基于数据库内容构建自定义索引文件.他们可以通过ODBC整合,但是从我被告知客户的情况下很少这样做.相反,这就是我们可能要做的 - 您从主数据库中生成文本导出(例如CSV),并将其馈送到其索引器中.这使您可以完全独立于实际表结构(或任何SQL数据库);实际上,我们从几个表中导出数据将其融合在一起.索引可以稍后逐渐更新.

基于其服务器(仅250KB左右,作为控制台应用程序或Windows服务运行)在TCP端口上提供查询.该协议是基于文本的,看起来有点"旧",但它很简单并且有效.基本上,您只需传递要查询哪些可用索引以及搜索术语(片段),即划界空间. 有三种可用的输出格式,HTML/JavaScript数组,XML或CSV.目前,我正在为有点"过时的"线协议制作Java包装器.但是结果很棒:我目前有一个大约500.000个记录的示例数据集,其中8列索引,并且我的测试应用程序在每个键> 上的所有8列中触发了所有8列的搜索的jtextfield 的内容.编辑并可以实时更新结果显示(JTable)!这种情况发生了,而无需进入MySQL实例数据最初来自.基于您返回的列,您可以通过使用该行的主要键查询MySQL来询问"原始"记录(当然需要包含在QuickFind索引中).

该索引的大约是数据的文本导出版本的大小约30-40%.索引主要由磁盘I/O速度约束;我的500.000记录花了大约一两分钟的时间进行处理.

很难描述这一点,因为当我看到内部产品演示时,我发现甚至很难相信.他们展示了一个1000万行地址数据库,并搜索了名称,地址和电话号码的片段,并且在击中"搜索"按钮时,结果又回到了一秒钟以下 - 全部完成了笔记本!据我所知,他们经常与SAP或CRM系统集成在一起,以改善呼叫中心代理只能了解呼叫者名称或地址的片段时.

所以无论如何,我可能不会在描述这一点方面变得更好.如果您需要这样的东西,则绝对应该去检查一下. Google Translate 在将他们的网站从德语翻译为英语方面做得很好,所以这可能是一个很好的开端.<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

推荐答案

我本人还没有这个具体的要求,但是我的经验告诉我,Lucene可以做到这一点,尽管也许不是独立的.我肯定会在第一个答案中通过迈克尔·德拉·比塔(Michael Della Bitta)所描述的索尔(Solr)使用它.他给出的链接是现场的 - 阅读更多背景.

简要介绍,SOLR可让您定义自定义字段类型.这些由索引时间分析仪和查询时间分析仪组成.分析仪弄清楚该如何处理文本,每个文本都由令牌和零组成.令牌器将您的文本分成块,然后每个TokenFilter都可以添加,减去或修改令牌.

因此,该字段最终可以索引与原始文本完全不同的内容,包括必要时多个令牌.因此,您想要的是原始文本的多句副本,您通过发送Lucene的内容诸如" my_ngram_field:sledge"之类的东西来查询.不涉及通配符: - )

然后,您遵循类似于solrconfig.xml文件中提供的前缀搜索的模型:

<fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
</fieldType>

EdgangramFilterFactory是如何实现搜索框自动完成的前缀匹配.它使代币来自以前的阶段(单个空格划分的单词转换为较低的案例),并将它们粉丝粉丝粉丝进入前沿上的每个子字符串.大锤= S,SL,SLE,SLE,雪橇,雪橇,雪橇,雪橇等

您需要遵循此模式,但要用自己的eDgengramFilterFactory替换为实地中所有ngrams.默认的org.apache.solr.solr.analysis.ngramFilterFactory是一个很好的开始,但是它确实进行了字母换位以进行拼写检查.您可以将其复制并剥离 - 这是一个非常简单的类.

使用自己的myngramFilterFactory拥有自己的fieldType(称为ngram_text),只需创建您的原始字段和ngram字段,例如:

    <field name="title" type="text" indexed="true" stored="true"/>
    <field name="title_ngrams" type="ngram_text" indexed="true" stored="false"/>

然后告诉它将原始字段复制到奇特的一个:

<copyField source="title" dest="title_ngrams"/>

好吧,现在,当您搜索" title_ngrams:sledge"时,您应该获得包含此内容的文档列表.然后在查询的字段列表中,您只需告诉它检索名为title而不是字段title_ngrams的字段.

应该足以让您将其融合在一起,并将其调整为惊人的性能水平.在一项旧工作中,我们有一个数据库,其中包含超过100万种HTML描述的产品,并设法让Lucene在一家中型服务器上同时进行标准查询和拼写检查,并处理几十个同时查询.当您有很多用户时,缓存会启动并使其尖叫!

哦,增量(尽管不是实时)索引是一个束缚.它甚至可以在高负载下进行操作,因为它会在背景中创建并优化新索引并在将其交换之前自动化.

.

祝你好运!

其他推荐答案

这可能不是您想听的,因为我认为您正在尝试使用SQL代码解决此问题,但是 lucene 将是我的首选.您还可以使用其他工具来建立相当聪明的排名和提升技术. Lucene用Java编写,因此应该为您提供所需的界面.

如果您是Microsoft商店,那么您想要的大部分内容都内置在SQL Server中,并且可以启用通配符,这将使您能够进行部分单词匹配.

在Lucene和Lucene.net中,您可以使用 dif norefollow noreferrer"> wirdcard Match 你喜欢.但是,不支持将通配符用作搜索中的第一个符号.如果您希望能够使用第一个字符的通配符,则可能需要自己实现某种基于TRIE的索引,因为在许多情况下,这是一个昂贵的操作,可以将一组术语过滤为合理的事物.全文搜索应用程序最常用的索引,后缀茎通常更有价值.

您显然可以在Lucene中更改查询解析器实例,以通过将SetAlowLeadingWildCard设置为true来覆盖此规则.

我很确定,在单词上的通配符搜索本质上效率低下. Skip列表有时用于通过明文来提高此类搜索的性能,但我认为您更有可能在GREP中找到类似的实现,而不是广义文本索引工具.

您描述了一个单词在哪里拼写为两个,反之亦然,还有其他解决方案.例如,Lucene支持模糊的查询.可以通过提供基于某种贝叶斯机制的建议,或通过索引技巧来处理拼字和形态变体,即,索引频繁的变体语料库并用这些术语填充索引.我什至看到了塞入全文引擎中的结构化数据(例如,将城市名称和"酒店"一词添加到酒店桌上的记录中,以使"巴黎酒店"更有可能包括退休金的记录-house caisse desdépôts.

其他推荐答案

如果您的表是Myisam,则可以使用MySQL的全文搜索capabilites: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

如果不是,则"行业标准"为 http://www.sphinxsearch.com//p>

关于使用InnoDB的一些想法: http://www.mysqlperformanceblog.com/2009/09/09/what toby-to-do-do-do-do-with-mysql-full-full-text-search-search-search-search-search-search-search-search-search-search-search-search-search-同时迁移到innodb/

另外,介绍狮身人面像的好演示文稿并解释了架构+用法 scribd.com/doc/2670976/sphinx-high-performance-performance-full-text-search-for-mysql-presentation

更新
阅读了您对问题的澄清之后 - 狮身人面像可以进行子字符串匹配.您需要设置" Enable-Star"并使用适当的MIN_INFIX_LENGTH创建Infix索引(1将为您提供所有可能的子字符串,但是显然,设置越高,索引的越小,搜索速度就越快).请参阅 http://sphinxsearch.com/docs/docs/current.html >

本文地址:https://www.itbaoku.cn/post/597771.html

问题描述

Are there any open source or commercial tools available that allow for text fragment indexing of database contents and can be queried from Java?

Background of the question is a large MySQL database table with several hundred thousand records, containing several VARCHAR columns. In these columns people would like to search for fragments of the contents, so a fulltext index (which is based on word boundaries) would not help.

EDIT: [Added to make clear why these first suggestions would not solve the problem:]

This is why MySQL's built in fulltext index will not do the job, and neither will Lucene or Sphinx, all of which were suggested in the answers. I already looked at both those, but as far as I can tell, these are based on indexing words, excluding stop words and doing all sorts of sensible things for a real fulltext search. However this is not suitable, because I might be looking for a search term like "oison" which must match "Roisonic Street" as well as "Poison-Ivy". The key difference here is that the search term is just a fragment of the column content, that need not be delimited by any special characters or white space.

EDIT2: [Added some more background info:] The requested feature that is to be implemented based on this is a very loose search for item descriptions in a merchandise management system. Users often do not know the correct item number, but only part of the name of the item. Unfortunately the quality of these descriptions is rather low, they come from a legacy system and cannot be changed easily. If for example people were searching for a sledge hammer they would enter "sledge". With a word/token based index this would not find matches that are stored as "sledgehammer", but only those listen "sledge hammer". There are all kinds of weird variances that need to be covered, making a token based approach impractical.

Currently the only thing we can do is a LIKE '%searchterm%' query, effectively disabling any index use and requiring lots of resources and time.

Ideally any such tool would create an index that allowed me to get results for suchlike queries very quickly, so that I could implement a spotlight-like search, only retrieving the "real" data from the MySQL table via the primary key when a user picks a result record.

If possible the index should be updatable (without needing a full rebuild), because data might change and should be available for search immediately by other clients.

I would be glad to get recommendations and/or experience reports.

EDIT3: Commercial solution found that "just works" Even though I got a lot of good answers for this question, I wanted to note here, that in the end we went with a commercial product called "QuickFind", made and sold by a German company named "HMB Datentechnik". Please note that I am not affiliated with them in any way, because it might appear like that when I go on and describe what their product can do. Unfortunately their website looks rather bad and is German only, but the product itself is really great. I currently have a trial version from them - you will have to contact them, no downloads - and I am extremely impressed.

As there is no comprehensive documentation available online, I will try and describe my experiences so far.

What they do is build a custom index file based on database content. They can integrate via ODBC, but from what I am told customers rarely do that. Instead - and this is what we will probably do - you generate a text export (like CSV) from your primary database and feed that to their indexer. This allows you to be completely independent of the actual table structure (or any SQL database at all); in fact we export data joined together from several tables. Indexes can be incrementally updated later on the fly.

Based on that their server (a mere 250kb or so, running as a console app or Windows service) serves listens for queries on a TCP port. The protocol is text based and looks a little "old", but it is simple and works. Basically you just pass on which of the available indexes you want to query and the search terms (fragments), space delimited. There are three output formats available, HTML/JavaScript array, XML or CSV. Currently I am working on a Java wrapper for the somewhat "dated" wire protocol. But the results are fantastic: I currently have a sample data set of approximately 500.000 records with 8 columns indexed and my test application triggers a search across all 8 columns for the contents of a JTextField on every keystroke while being edited and can update the results display (JTable) in real-time! This happens without going to the MySQL instance the data originally came from. Based on the columns you get back, you can then ask for the "original" record by querying MySQL with the primary key of that row (needs to be included in the QuickFind index, of course).

The index is about 30-40% the size of the text export version of the data. Indexing was mainly bound by disk I/O speed; my 500.000 records took about a minute or two to be processed.

It is hard to describe this as I found it even hard to believe when I saw an in-house product demo. They presented a 10 million row address database and searched for fragments of names, addresses and phone numbers and when hitting the "Search" button, results came back in under a second - all done on a notebook! From what I am told they often integrate with SAP or CRM systems to improve search times when call center agents just understand fragments of the names or addresses of a caller.

So anyway, I probably won't get much better in describing this. If you need something like this, you should definitely go check this out. Google Translate does a reasonably good job translating their website from German to English, so this might be a good start.

推荐答案

I haven't had this specific requirement myself, but my experience tells me Lucene can do the trick, though perhaps not standalone. I'd definitely use it through Solr as described by Michael Della Bitta in the first answer. The link he gave was spot on - read it for more background.

Briefly, Solr lets you define custom FieldTypes. These consist of an index-time Analyzer and a query-time Analyzer. Analyzers figure out what to do with the text, and each consists of a Tokenizer and zero to many TokenFilters. The Tokenizer splits your text into chunks and then each TokenFilter can add, subtract, or modify tokens.

The field can thus end up indexing something quite different from the original text, including multiple tokens if necessary. So what you want is a multiple-token copy of your original text, which you query by sending Lucene something like "my_ngram_field:sledge". No wildcards involved :-)

Then you follow a model similar to the prefix searching offered up in the solrconfig.xml file:

<fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
</fieldType>

The EdgeNGramFilterFactory is how they implement prefix matching for search box autocomplete. It takes the tokens coming from the previous stages (single whitespace-delimited words transformed into lower case) and fans them out into every substring on the leading edge. sledgehammer = s,sl,sle,sled,sledg,sledge,sledgeh, etc.

You need to follow this pattern, but replace the EdgeNGramFilterFactory with your own which does all NGrams in the field. The default org.apache.solr.analysis.NGramFilterFactory is a good start, but it does letter transpositions for spell checking. You could copy it and strip that out - it's a pretty simple class to implement.

Once you have your own FieldType (call it ngram_text) using your own MyNGramFilterFactory, just create your original field and the ngram field like so:

    <field name="title" type="text" indexed="true" stored="true"/>
    <field name="title_ngrams" type="ngram_text" indexed="true" stored="false"/>

Then tell it to copy the original field into the fancy one:

<copyField source="title" dest="title_ngrams"/>

Alright, now when you search "title_ngrams:sledge" you should get a list of documents that contain this. Then in your field list for the query you just tell it to retrieve the field called title rather than the field title_ngrams.

That should be enough of a nudge to allow you to fit things together and tune it to astonishing performance levels rather easily. At an old job we had a database with over ten million products with large HTML descriptions and managed to get Lucene to do both the standard query and the spellcheck in under 200ms on a mid-sized server handling several dozen simultaneous queries. When you have a lot of users, caching kicks in and makes it scream!

Oh, and incremental (though not real-time) indexing is a cinch. It can even do it under high loads since it creates and optimizes the new index in the background and autowarms it before swapping it in. Very slick.

Good luck!

其他推荐答案

This may not be what you want to hear, because I presume you are trying to solve this with SQL code, but Lucene would be my first choice. You can also build up fairly clever ranking and boosting techniques with additional tools. Lucene is written in Java so it should give you exactly the interface you need.

If you were a Microsoft shop, the majority of what you're looking for is built into SQL Server, and wildcards can be enabled which will give you the ability to do partial word matches.

In Lucene and Lucene.Net, you can use wildcard matches if you like. However, it's not supported to use wildcards as the first symbol in a search. If you want the ability to use first character wildcards, you'll probably need to implement some sort of trie-based index on your own, since it's an expensive operation in many cases to filter the set of terms down to something reasonable for the kind of index most commonly needed for full text search applications, where suffix stemming is generally more valuable.

You can apparently alter the Query Parser instance in Lucene to override this rule by setting setAllowLeadingWildcard to true.

I'm fairly sure that wildcard-on-both-ends-of-a-word searches are inherently inefficient. Skip lists are sometimes used to improve performance on such searches with plaintext, but I think you're more likely to find an implementation like that in something like grep than a generalized text indexing tool.

There are other solutions for the problem that you describe where one word may occur spelled as two, or vice versa. Fuzzy queries are supported in Lucene, for example. Orthographic and morphological variants can be handled using either by providing a filter that offers suggestions based on some sort of Bayesian mechanism, or by indexing tricks, namely, taking a corpus of frequent variants and stuffing the index with those terms. I've even seen knowledge from structured data stuffed into the full text engine (e.g. adding city name and the word "hotel" to records from the hotel table, to make it more likely that "Paris Hotels" will include a record for the pension-house Caisse des Dépôts.) While not exactly a trivial problem, it's manageable without destroying the advantages of word-based searches.

其他推荐答案

If your table is MyISAM, you can use MySQL's full text search capabilites: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

If not, the "industry standard" is http://www.sphinxsearch.com/

Some ideas on what to do if you are using InnoDB: http://www.mysqlperformanceblog.com/2009/09/10/what-to-do-with-mysql-full-text-search-while-migrating-to-innodb/

Also, a good presentation that introduces Sphinx and explains architecture+usage http://www.scribd.com/doc/2670976/Sphinx-High-Performance-Full-Text-Search-for-MySQL-Presentation

Update
Having read your clarification to the question -- Sphinx can do substring matches. You need to set "enable-star" and create an infix index with the appropriate min_infix_length (1 will give you all possible substrings, but obviously the higher the set it, the smaller your index will be, and the faster your searches). See http://sphinxsearch.com/docs/current.html for details.