我的问题是什么是最快(质量也很重要,但不太重要)比较两个字符串的方法?更好的比较字符串方法
我正在寻找比较两个字符串的最有效方法。我比较的一些字符串可能超过5000个字符。我将大约80个字符串的列表与大约200个字符串的另一个列表进行比较。它需要永远,即使我穿线它。我使用Apache Commons的StringUtils.getLevenshteinDistance(String s, String t)
方法。我的方法如下。有一个更好的方法吗?
private void compareMe() {
List<String> compareStrings = MainController.getInstance().getCompareStrings();
for (String compare : compareStrings) {
int levenshteinDistance = StringUtils.getLevenshteinDistance(me, compare);
if (bestScore > levenshteinDistance
&& levenshteinDistance > -1) {
bestScore = levenshteinDistance; //global variable
bestString = compare; //global variable
}
}
}
这里有两个字符串的样本应该有一个好成绩:
串1:
SELECT
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = '${request.corp_vendor_id};')
字符串2:
SELECT
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = 'ACQ-169013')
你会注意到的唯一区别在于字符串末尾的'${request.corp_vendor_id};'
。这会导致它从LevenshteinDistance
方法得分26
。
定义“比较”的含义。 “比较”通常意味着== /!=或>/==/<,但由于您使用了距离函数,因此显然不需要二进制比较。 –
没有任何有关字符串内容的知识,没有真正的优化可能(其他避免比较AB和BA) –
关于你所能做的只是获得你的比较方法的来源,看看你是否可以“收紧它“ 不知何故。任何“距离”比较将是昂贵的。但是,如果您只需要“去/不去”结果(而不是所有情况下的实际得分),则可以根据数据的具体情况使用一些预处理测试。 –