Python3快速的方法来查找如果集合中的任何元素是字符串的子串

如果我有一个collection of strings是否有一个数据结构或函数可以提高检查集合中的任何元素是否为主串中的substrings ？Python3快速的方法来查找如果集合中的任何元素是字符串的子串

现在我正在循环访问我的字符串数组并使用in运算符。有更快的方法吗？

import timing 

## string match in first do_not_scan 
## 0:00:00.029332 

## string not in do_not_scan 
## 0:00:00.035179 
def check_if_substring(): 
    for x in do_not_scan: 
     if x in string: 
      return True 
    return False 

## string match in first do_not_scan 
## 0:00:00.046530 

## string not in do_not_scan 
## 0:00:00.067439 
def index_of(): 
    for x in do_not_scan: 
     try: 
      string.index(x) 
      return True 
     except: 
      return False 

## string match in first do_not_scan 
## 0:00:00.047654 

## string not in do_not_scan 
## 0:00:00.070596 
def find_def(): 
    for x in do_not_scan: 
     if string.find(x) != -1: 
      return True 
    return False 

string = '/usr/documents/apps/components/login' 
do_not_scan = ['node_modules','bower_components'] 

for x in range(100000): 
    find_def() 
    index_of() 
    check_if_substring()

来源

2016-03-04 ClickThisNick

有没有可能在这里粘贴了一些错误。或者'string ='a''只是一个示例。因为'node_modules'永远不会出现在'string'中。这就是说，你可以使用地图。钥匙是“do_not_scan”的项目。然后搜索是O（1） – Cripto

只是一个示例来演示'string'可能不包含'do_not_scan'的任何元素。我以前从未使用过地图，你会怎么做呢？ – ClickThisNick

你想要'grep -l -Ff collections_of_strings main_string'的模拟吗？其中'collections_of_strings'文件包含字符串集合（每行一个），'main_string'文件包含主字符串（按原样）。 – jfs

没有，有没有更快的内置的方式提及。

如果您有大量字符串需要测试，那么使用第三方Aho-Corasick包可能会更好，如J.F. Sebastian's answer所示。

使用内置的方法，在最糟糕的情况是：没有匹配，这意味着你已经在列表中测试每一个项目和每一个项目几乎每一个偏移量。

幸运的是，in运营商（至少在CPython的）非常快，是在我的测试中三个近一个因素更快：

0.3364804992452264 # substring() 
0.867534976452589 # any_substring() 
0.8401796016842127 # find_def() 
0.9342398950830102 # index_of() 
2.7920695478096604 # re implementation

这里是我用于测试的脚本：

from timeit import timeit 
import re 

def substring(): 
    for x in do_not_scan: 
     if x in string: 
      return True 
    return False 

def any_substring(): 
    return any(x in string for x in do_not_scan) 

def find_def(): 
    for x in do_not_scan: 
     if string.find(x) != -1: 
      return True 
    return False 

def index_of(): 
    for x in do_not_scan: 
     try: 
      string.index(x) 
      return True 
     except: 
      return False 

def re_match(): 
    for x in do_not_scan: 
     if re.search(string, x): 
      return True 
    return False 

string = 'a' 
do_not_scan = ['node_modules','bower_components'] 

print(timeit('substring()', setup='from __main__ import substring')) 
print(timeit('any_substring()', setup='from __main__ import any_substring')) 
print(timeit('find_def()', setup='from __main__ import find_def')) 
print(timeit('index_of()', setup='from __main__ import index_of')) 
print(timeit('re_match()', setup='from __main__ import re_match'))

来源

2016-03-04 19:51:56

这是不正确的。你可以比'O（n * m）'做得更好，例如[Aho-Corasick算法]（https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm）是'O（n + m） “及时。 [''grep'可能会将它用于固定字符串]（http://stackoverflow.com/questions/35803016/python3-fast-way-to-find-if-any-elements-in-collections-are-substring-of-字符串＃comment59300275_35803016） – jfs

@JFSebastian：修正，谢谢。 –

def check(): 
    if any(w in string for w in do_not_scan): 
     return True 
    else: 
     return False

或者简单：

def check(): 
    return any(w in string for w in do_not_scan)

由@两位方士

来源

2016-03-04 18:13:38 Karol

第一个字符串do_not_scan = 0：00：00.085493 | 字符串不在do_not_scan = 0：00：00.074540 – ClickThisNick

简单：'返回any（w在do_not_scan中w的字符串）' –

'any'与'find_def'和'index_of'一样慢。 –

我没有一个大的数据集来尝试：

但maybes像这样的东西会起作用吗？

python3

from builtins import any 
import timeit 

do_not_scan = ['node_modules', 'bower_components'] 
string = 'a' 


def check_if_substring(): 
    return any(string in x for x in do_not_scan) 


result = timeit.Timer("check_if_substring()", "from __main__ import check_if_substring") 
count = 10000 
print(result.timeit(count)/count)

或者周围的其他方式：

def check_if_substring(): 
    return any(x in string for x in do_not_scan)

我的结果：6.48119201650843e-07

来源

2016-03-04 21:18:52 Cripto

只是好奇 - 你为什么要重命名，为什么这样？ –

这是一个副本，从过去的旧代码。在这种情况下没有意义。我修复它 – Cripto

'任何'都像'find_def'和'index_of'一样慢。 –

是的，有执行found = any(s in main_string for s in collection_of_strings)例如更快的方法，有Aho-Corasick_algorithm允许改进any()-based O(n*m*k)算法到O(n + m*k)时间操作，其中n是len(main_string),m是len(collections_of_strings)和k代表集合中字符串的各个长度。

#!/usr/bin/env python 
import noaho # $ pip install noaho 

trie = noaho.NoAho() 
for s in collection_of_strings: 
    trie.add(s) 
found = trie.find_short(main_string)[0] is not None

注：没有点来测量微小的字符串，例如string = 'a'如果你有兴趣在大O行为的实时性能。要么使用更具代表性的样本进行基准测试，要么在您的案例中不需要更快（渐近）的算法。

来源

2016-03-05 16:44:46 jfs

你可以提供任何指导哪里的切入点是使用Aho-Corasick算法而不是'in'？ –

只有你的分析器知道。常量因素取决于实现的质量，例如，Python 3.5+中的''str.translate（）'在仅有ASCII的输入上的速度可能比先前Python 3版本中的相同代码快50倍]（http：// stackoverflow。 COM/q /4279分之34287893）。 – jfs

Python3快速的方法来查找如果集合中的任何元素是字符串的子串

回答

相关问题