2013-03-19 34 views
1

我需要在不改变字符顺序的情况下将字符串分割成所有可能的方式。 我的理解是这个任务可以看作是标记化或NLP中的词形化,但我从纯字符串搜索的角度来看它更简单,更强大。考虑,如何拆分字符串并将其子字符串匹配到子字符串列表? - Python

dictionary = ['train','station', 'fire', 'a','trainer','in'] 
str1 = "firetrainstation" 

任务1:如何生成的所有可能的子这样,我得到:

all_possible_substrings = [['f','iretrainstation'], 
['fo','retrainstation'], ... 
['firetrainstatio','n'], 
['f','i','retrainstation'], ... , ... 
['fire','train','station'], ... , ... 
['fire','tr','a','instation'], ... , ... 
['fire','tr','a','in','station'], ... , ... 
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n'] 

任务2:然后从all_possible_substring,我怎么能检查识破并说包含字典中所有元素的子字符串集合是正确的输出。所需的输出将是字典中与从左到右匹配最多字符数的子字符串列表。所需的输出就是:

"".join(desire_substring_list) == str1 and \ 
[i for i desire_substring_list if in dictionary] == len(desire_substring_list) 
#(let's assume, the above condition can be true for any input string since my english 
#language dictionary is very big and all my strings are human language 
#just written without spaces) 

所需的输出:

'fire','train','station' 

我做了什么?

对于任务1,我已经做到了这一点,但我知道它不会给我的所有可能的空白插入:

all_possible_substrings.append(" ".join(str1)) 

我已经做到了这一点,但是这不仅会任务2

import re 
seed = ['train','station', 'fire', 'a','trainer','in'] 
str1 = "firetrainstation" 
all_possible_string = [['f','iretrainstation'], 
['fo','retrainstation'], 
['firetrainstatio','n'], 
['f','i','retrainstation'], 
['fire','train','station'], 
['fire','tr','a','instation'], 
['fire','tr','a','in','station'], 
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']] 
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b') 
highest_match = "" 
for i in all_possible_string: 
    x = pattern.findall(" ".join(i)) 
    if "".join(x) == str1 and len([i for i in x if i in seed]) == len(x): 
    print " ".join(x) 
+0

请注意,您的字典实际上是一个“列表”。 – mgilson 2013-03-19 01:24:19

+0

此外,我很确定你需要做更多的解释。为什么“foo”,“bar”,“bar”,“str”是所需的输出? – mgilson 2013-03-19 01:25:39

+0

更新了所需的输出。 – alvas 2013-03-19 01:35:28

回答

3

在第一部分,你可以写一个类似的递归发生器:

>>> def all_substr(string): 
    for i in range(len(string)): 

     if i == len(string) - 1: 
      yield string 

     first_part = string[0:i+1] 
     second_part = string[i+1:] 

     for j in all_substr(second_part): 
      yield ','.join([first_part, j]) 


>>> for x in all_substr('apple'): 
    print(x) 


a,p,p,l,e 
a,p,p,le 
a,p,pl,e 
a,p,ple 
a,pp,l,e 
a,pp,le 
a,ppl,e 
a,pple 
ap,p,l,e 
ap,p,le 
ap,pl,e 
ap,ple 
app,l,e 
app,le 
appl,e 
apple 
相关问题