查找连接令牌

我写代码，获取文本标记为输入：查找连接令牌

tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]

的代码应该查找包含连字符或连接到彼此连字符的所有标记：基本上输出应该是：

[["Tap-", "Berlin"], ["Was-ISt"], ["das", "-ist"], ["Man", "-Hum", "-Zuh-UH-", "glit"]]

我写了一个码，但不知何故，我不是跟hypens得到连接令牌回：要尝试一下：http://goo.gl/iqov0q

def find_hyphens(self): 
    tokens_with_hypens =[] 


    for i in range(len(self.tokens)): 

     hyp_leng = 0 

     while self.hypen_between_two_tokens(i + hyp_leng): 
      hyp_leng += 1 

     if self.has_hypen_in_middle(i) or hyp_leng > 0: 
      if hyp_leng == 0: 
       tokens_with_hypens.append(self.tokens[i:i + 1]) 
      else: 
       tokens_with_hypens.append(self.tokens[i:i + hyp_leng]) 
       i += hyp_leng - 1 

    return tokens_with_hypens

我该怎么做？是否有更高性能的解决方案？由于

来源

2015-11-29 John Smith

我发现在你的代码3个错误：

1）您在这里比较tok1最后2个字符，而不是最后的tok1和第一tok2：

if "-" in joined[len(tok1) - 2: len(tok1)]: 
# instead, do this: 
if "-" in joined[len(tok1) - 1: len(tok1) + 1]:

2）您在此省略最后一个匹配的标记。 1增加你的切片这里的最终指数：

tokens_with_hypens.append(self.tokens[i:i + hyp_leng]) 
# instead, do this: 
tokens_with_hypens.append(self.tokens[i:i + 1 + hyp_leng])

3）你不能操纵在Python中for i in range循环的指标。下一次迭代将检索下一个索引，并覆盖您的更改。相反，你可以使用while循环是这样的：

i = 0 
while i < len(self.tokens): 
    [...] 
    i += 1

这3个更正导致测试合格：http://goo.gl/fd07oL

不过我忍不住从头开始写一个算法，解决你的问题尽可能简单：

def get_hyphen_groups(tokens): 
    i_start, i_end = 0, 1 
    while i_start < len(tokens): 
     while (i_end < len(tokens) and 
       (tokens[i_end].startswith("-")^tokens[i_end - 1].endswith("-"))): 
      i_end += 1 
     yield tokens[i_start:i_end] 
     i_start, i_end = i_end, i_end + 1 


tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"] 

for group in get_hyphen_groups(tokens): 
    print ("".join(group))

要在您预期的结果排除1元团一样，包裹yield这个if：

if i_end - i_start > 1: 
    yield tokens[i_start:i_end]

要包含1元团已经有一个连字符，即if改变这个例如：

这是不对您的方法

if i_end - i_start > 1 or "-" in tokens[i_start]: 
    yield tokens[i_start:i_end]

来源

2015-11-29 20:45:40 Felk

有一件事是试图改变在for i in range(len(self.tokens))循环中的值为i。它不会工作，因为i的值将在每次迭代中从range获得下一个值，而忽略您的更改。

我改变了你的算法，使用迭代算法从列表中弹出一个项目，并决定如何处理它。它使用缓冲区来存储属于一个链的物品，直到它完成。

完整的代码是：

class Hyper: 

    def __init__(self, tokens): 
     self.tokens = tokens 

    def find_hyphens(self): 
     tokens_with_hypens =[] 

     copy = list(self.tokens) 

     buffer = [] 
     while len(copy) > 0: 
      item = copy.pop(0) 
      if self.has_hyphen_in_middle(item) and item[0] != '-' and item[-1] != '-': 
       # words with hyphens that are not part of a bigger chain 
       tokens_with_hypens.append([item]) 
      elif item[-1] == '-' or (len(copy) > 0 and copy[0][0] == '-'): 
       # part of a chain - append to the buffer 
       buffer.append(item) 
      elif len(buffer) > 0: 
       # the last word in a chain - the buffer contains the complete chain 
       buffer.append(item) 
       tokens_with_hypens.append(buffer) 
       buffer = [] 

     return tokens_with_hypens 

    @staticmethod 
    def has_hyphen_in_middle(input): 
     return len(input) > 2 and "-" in input[1:-2] 


tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"] 

hyper = Hyper(tokens) 

result = hyper.find_hyphens() 

print(result) 

print(result == [["Tap-", "Berlin"], ["Was-ISt"], ["das", "-ist"], ["Man", "-Hum", "-Zuh-UH-", "glit"]])

来源

2015-11-29 20:45:56 Szymon

查找连接令牌

回答

相关问题