在数据库中找到重复的地址,阻止用户提前输入这些地址?[英] find duplicate addresses in database, stop users entering them early?

本文是小编为大家收集整理的关于在数据库中找到重复的地址,阻止用户提前输入这些地址?的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

如何在数据库中找到重复的地址,或者在填写表格时已经更好地停止人们?我想越早越好?

有什么好方法可以抽象街道,邮政编码等,以便可以检测到错别字和简单的尝试获取2个注册?喜欢:

Quellenstrasse 66/11 
Quellenstr. 66a-11

我在说德语地址... 谢谢!

推荐答案

约翰内斯:

@pconroy:这也是我的最初thougt.关于此的有趣部分是为地址的不同部分找到良好的转换规则!有什么好建议吗?

当我们之前从事此类项目时,我们的方法是采用现有的地址语料库(150k左右),然后为我们的域名应用最常见的转换"," rd" - >"道路"等).恐怕当时没有关于此类事情的全面在线资源,因此我们基本上是自己想出了一个清单,检查了电话簿之类的东西(在那里施加空间,地址以各种方式缩写! ).正如我之前提到的,您会惊讶于您只需添加一些常见规则即可检测到多少个"重复"!

最近,我最近偶然发现了一个相当全面的地址abbreviations的列表,尽管它是美国英语,所以我不确定在德国有多么有用! Google快速出现了几个网站,但它们似乎像是垃圾新闻通讯陷阱.尽管那是我用英语谷歌搜索,所以您可能会在德语中使用"德语地址缩写":)

其他推荐答案

您可以使用 Google Geocode api

实际上给出了两个示例的结果,只是尝试了它.这样,您可以获得可以保存在数据库中的结构化结果.如果查找失败,请用户以另一种方式编写地址.

其他推荐答案

您越早停止人们,从长远来看就越容易!

不太熟悉您的DB模式或数据输入表格,我建议一条类似的路线:

  • 对于每个地址"部分",您的DB中具有不同的字段,例如街,城市,邮政法规,兰德等

  • 让您的数据输入表格分类,例如街,城市等

上述原因是每个部分都可能具有自己的特定"规则",以检查略微改变的解决方案(" Quellenstrasse" - >" Quellenstr."," 66/11" - > 66A-> 66A-111 "上面),因此您的验证代码可以检查每个字段所显示的值是否存在于其各自的DB字段中.如果没有,您可以拥有适用每个给定字段的转换规则的类(例如" strasse" stem stem to to" str"),并再次检查重复.

显然上述方法具有缺点:

  • 它可能会很慢,取决于您的数据集,使用户等待

  • 用户可以尝试通过将地址"零件"放在错误字段(将邮政编码附加到城市等)中来解决它. 但是,从经验中,我们发现,即使像以上这样的简单检查也可以阻止大部分用户输入预先存在的地址.

完成基本检查后,您可以考虑优化所需的DB访问,完善规则等,以满足您的特定模式.您也可以看一下 mysql的MATT()功能用于制定类似的文本.

本文地址:https://www.itbaoku.cn/post/597658.html

问题描述

How do I find duplicate addresses in a database, or better stop people already when filling in the form ? I guess the earlier the better?

Is there any good way of abstracting street, postal code etc so that typos and simple attempts to get 2 registrations can be detected? like:

Quellenstrasse 66/11 
Quellenstr. 66a-11

I'm talking German addresses... Thanks!

推荐答案

Johannes:

@PConroy: This was my initial thougt also. the interesting part on this is to find good transformation rules for the different parts of the address! Any good suggestions?

When we were working on this type of project before, our approach was to take our existing corpus of addresses (150k or so), then apply the most common transformations for our domain (Ireland, so "Dr"->"Drive", "Rd"->"Road", etc). I'm afraid there was no comprehensive online resource for such things at the time, so we ended up basically coming up with a list ourselves, checking things like the phone book (pressed for space there, addresses are abbreviated in all manner of ways!). As I mentioned earlier, you'd be amazed how many "duplicates" you'll detect with the addition of only a few common rules!

I've recently stumbled across a page with a fairly comprehensive list of address abbreviations, although it's american english, so I'm not sure how useful it'd be in Germany! A quick google turned up a couple of sites, but they seemed like spammy newsletter sign-up traps. Although that was me googling in english, so you may have more look with "german address abbreviations" in german :)

其他推荐答案

You could use the Google GeoCode API

Wich in fact gives results for both of your examples, just tried it. That way you get structured results that you can save in your database. If the lookup fails, ask the user to write the address in another way.

其他推荐答案

The earlier you can stop people, the easier it'll be in the long run!

Not being too familiar with your db schema or data entry form, I'd suggest a route something like the following:

  • have distinct fields in your db for each address "part", e.g. street, city, postal code, Länder, etc.

  • have your data entry form broken down similarly, e.g. street, city, etc

The reasoning behind the above is that each part will likely have it's own particular "rules" for checking slightly-changed addressed, ("Quellenstrasse"->"Quellenstr.", "66/11"->"66a-11" above) so your validation code can check if the values as presented for each field exist in their respective db field. If not, you can have a class that applies the transformation rules for each given field (e.g. "strasse" stemmed to "str") and checks again for duplicates.

Obviously the above method has it's drawbacks:

  • it can be slow, depending on your data set, leaving the user waiting

  • users may try to get around it by putting address "Parts" in the wrong fields (appending post code to city, etc). but from experience we've found that introducing even simple checking like the above will prevent a large percentage of users from entering pre-existing addresses.

Once you've the basic checking in place, you can look at optimising the db accesses required, refining the rules, etc to meet your particular schema. You might also take a look at MySQL's match() function for working out similar text.