0
例如,我有下列两列,分别为Address1
和refAddr
。识别两列中的相似字符串值
表中的一些示例数据如下所示。
我想的比较两列用于匹配。显然在这张表中,5235 JFK BLVD
& 5235 John F Kennedy
是一对,424 N 2ND ST
& 424 NORTH SECOND
是一对。
无论如何SQL或SSIS我可以用来摆脱非对结果并保留对?
例如,我有下列两列,分别为Address1
和refAddr
。识别两列中的相似字符串值
表中的一些示例数据如下所示。
我想的比较两列用于匹配。显然在这张表中,5235 JFK BLVD
& 5235 John F Kennedy
是一对,424 N 2ND ST
& 424 NORTH SECOND
是一对。
无论如何SQL或SSIS我可以用来摆脱非对结果并保留对?
一个选项是您可以使用GOOGLE API对地址进行地理编码,解析JSON结果以返回更加标准化的结果。这可能会很耗时,但您会对数据更有信心。
该API允许(我相信)每天2500次点击,但您可以购买更多。
例如,我选择了5232 JFK Blvd并添加了72116的邮政编码以缩小搜索范围。如果没有邮政编码它返回了多个地址(NY,NJ,AR,等)
https://maps.googleapis.com/maps/api/geocode/json?address=5232%20JFK%20Blvd&72116sensor=false
的关键要素可以是:
formatted_address: "5232 J.F.K. Blvd, North Little Rock, AR 72116, USA",
or
long_name: "John F. Kennedy Boulevard",
返回
{
results: [
{
address_components: [
{
long_name: "5232",
short_name: "5232",
types: [
"street_number"
]
},
{
long_name: "J.F.K. Boulevard",
short_name: "J.F.K. Blvd",
types: [
"route"
]
},
{
long_name: "North Little Rock",
short_name: "North Little Rock",
types: [
"locality",
"political"
]
},
{
long_name: "Hill Township",
short_name: "Hill Township",
types: [
"administrative_area_level_3",
"political"
]
},
{
long_name: "Pulaski County",
short_name: "Pulaski County",
types: [
"administrative_area_level_2",
"political"
]
},
{
long_name: "Arkansas",
short_name: "AR",
types: [
"administrative_area_level_1",
"political"
]
},
{
long_name: "United States",
short_name: "US",
types: [
"country",
"political"
]
},
{
long_name: "72116",
short_name: "72116",
types: [
"postal_code"
]
}
],
formatted_address: "5232 J.F.K. Blvd, North Little Rock, AR 72116, USA",
geometry: {
bounds: {
northeast: {
lat: 34.8032656,
lng: -92.2538364
},
southwest: {
lat: 34.8032599,
lng: -92.2538538
}
},
location: {
lat: 34.8032599,
lng: -92.2538364
},
location_type: "RANGE_INTERPOLATED",
viewport: {
northeast: {
lat: 34.8046117302915,
lng: -92.2524961197085
},
southwest: {
lat: 34.8019137697085,
lng: -92.2551940802915
}
}
},
place_id: "EjI1MjMyIEouRi5LLiBCbHZkLCBOb3J0aCBMaXR0bGUgUm9jaywgQVIgNzIxMTYsIFVTQQ",
types: [
"route",
"street_address"
]
},
{
address_components: [
{
long_name: "5232",
short_name: "5232",
types: [
"street_number"
]
},
{
long_name: "John F. Kennedy Boulevard",
short_name: "John F. Kennedy Blvd",
types: [
"route"
]
},
{
long_name: "West New York",
short_name: "West New York",
types: [
"locality",
"political"
]
},
{
long_name: "Hudson County",
short_name: "Hudson County",
types: [
"administrative_area_level_2",
"political"
]
},
{
long_name: "New Jersey",
short_name: "NJ",
types: [
"administrative_area_level_1",
"political"
]
},
{
long_name: "United States",
short_name: "US",
types: [
"country",
"political"
]
},
{
long_name: "07093",
short_name: "07093",
types: [
"postal_code"
]
}
],
formatted_address: "5232 John F. Kennedy Blvd, West New York, NJ 07093, USA",
geometry: {
bounds: {
northeast: {
lat: 40.78574,
lng: -74.0231416
},
southwest: {
lat: 40.7857366,
lng: -74.0231598
}
},
location: {
lat: 40.78574,
lng: -74.0231416
},
location_type: "RANGE_INTERPOLATED",
viewport: {
northeast: {
lat: 40.78708728029149,
lng: -74.02180171970849
},
southwest: {
lat: 40.7843893197085,
lng: -74.0244996802915
}
}
},
place_id: "Ejc1MjMyIEpvaG4gRi4gS2VubmVkeSBCbHZkLCBXZXN0IE5ldyBZb3JrLCBOSiAwNzA5MywgVVNB",
types: [
"route",
"street_address"
]
}
],
status: "OK"
}
地址匹配和固定是特别通常不包含在数据库中的通用软件。 –
购买主数据管理软件来做到这一点。 – dfundako
在SSIS中使用带有正则表达式的脚本组件,并标记那些在附加列中匹配的行,然后您可以过滤这些行。 –