在字符串

列表标记动态子假设这两组字符串：在字符串

file=sheet-2016-12-08.xlsx 
file=sheet-2016-11-21.xlsx 
file=sheet-2016-11-12.xlsx 
file=sheet-2016-11-08.xlsx 
file=sheet-2016-10-22.xlsx 
file=sheet-2016-09-29.xlsx 
file=sheet-2016-09-05.xlsx 
file=sheet-2016-09-04.xlsx 

size=1024KB 
size=22KB 
size=980KB 
size=15019KB 
size=202KB

我需要在这两组分别分别运行功能，收到以下输出：

file=sheet-2016-*.xlsx 

size=*KB

数据集可以是任何一组字符串。它不必与格式匹配。这里有一个例子另一个例子：

id.4030.paid 
id.1280.paid 
id.88.paid

其预期输出为：

id.*.paid

基本上，我需要一个函数来分析一组字符串，并用星号代替少见子（* ）

来源

2017-08-25 HyderA

您可以使用os.path.commonprefix来计算公共前缀。它用于计算文件路径列表中的共享目录，但可用于通用上下文中。

然后反转字符串，并再次申请共同的前缀，然后反转，来计算共同后缀（改编自https://gist.github.com/willwest/ca5d050fdf15232a9e67）

dataset = """id.4030.paid 
id.1280.paid 
id.88.paid""".splitlines() 

import os 


# Return the longest common suffix in a list of strings 
def longest_common_suffix(list_of_strings): 
    reversed_strings = [s[::-1] for s in list_of_strings] 
    return os.path.commonprefix(reversed_strings)[::-1] 

common_prefix = os.path.commonprefix(dataset) 
common_suffix = longest_common_suffix(dataset) 

print("{}*{}".format(common_prefix,common_suffix))

结果：

id.*.paid

编辑：如WIM注意到：

当所有字符串相等时，常用前缀&后缀为应该返回字符串本身而不是prefix*suffix：应检查所有字符串是否相同
当通用前缀&后缀重叠/有共享字母时，这也会混淆计算：应该计算字符串上的公共后缀减去公共前缀

因此，需要一种全方位的方法来预先测试列表以确保至少有2个字符串不同（在过程中凝结前缀/后缀公式），并计算公共后缀切片以删除常见前缀：

def compute_generic_string(dataset): 
    # edge case where all strings are the same 
    if len(set(dataset))==1: 
     return dataset[0] 

    commonprefix = os.path.commonprefix(dataset) 

    return "{}*{}".format(commonprefix,os.path.commonprefix([s[len(commonprefix):][::-1] for s in dataset])[::-1])

现在让我们来测试：

for dataset in [['id.4030.paid','id.1280.paid','id.88.paid'],['aBBc', 'aBc'],[]]: 
    print(compute_generic_string(dataset))

结果：

id.*.paid 
aB*c 
*

（当数据集为空，代码返回*，也许这应该是另一种边缘情况）

来源

2017-08-25 22:35:44

Dang，'os.path.commonprefix'！多久了。 – wim

upvote for commonprefix ...不知道它是否存在。 – Solaxun

相当不错的一个，加上一个 –

from os.path import commonprefix 

def commonsuffix(m): 
    return commonprefix([s[::-1] for s in m])[::-1] 

def inverse_glob(strs): 
    start = commonprefix(strs) 
    n = len(start) 
    ends = [s[n:] for s in strs] 
    end = commonsuffix(ends) 
    if start and not any(ends): 
     return start 
    else: 
     return start + '*' + end

这个问题比表面看起来更复杂。

根据目前的具体情况，问题仍然没有很好的约束，即没有独特的解决方案。对于输入['spamAndEggs', 'spamAndHamAndEggs']，spam*AndEggs和spamAnd*Eggs都是有效答案。对于输入['aXXXXz', 'aXXXz']有四个可能的解决方案。在上面给出的代码中，我们更愿意选择尽可能长的前缀，以使解决方案具有独特性。

指出JFF's answer用于指出os.path.commonprefix的存在。

Inverse glob - reverse engineer a wildcard string from file names是这个问题的一个相关和更难推广。

来源

2017-08-25 22:41:38 wim

感谢您的意见和帮助，使我的解决方案更好。有些人可能会反对你的解决方案是我的副本，但没有你的意见，我不可能实现一个工作。 –

FWIW我的[原始解决方案]（https://stackoverflow.com/revisions/45890262/1）与您的bug相同。当我看到你的实现时删除它，那更好。 – wim

我们可以说我们一起击败那一个:) –

回答

相关问题