从单词索引获取字符索引

给定文本中单词的索引，我需要获取字符索引。例如，在下面的文字：从单词索引获取字符索引

"The cat called other cats."

“猫”字的指数是1 我需要猫即c的第一个字符的索引，这将是4 我不知道如果这是相关的，但我正在使用python-nltk来获取单词。现在我能想到这样做的唯一方法是：

- Get the first character, find the number of words in this piece of text 
- Get the first two characters, find the number of words in this piece of text 
- Get the first three characters, find the number of words in this piece of text 
Repeat until we get to the required word.

但是，这将是非常低效的。任何想法将不胜感激。

来源

2013-06-24 GDev

谢谢你的想法。但是，我不能仅仅在空白处分割文字。我正在使用TreebankWordTokenizer。 – GDev

import re 
def char_index(sentence, word_index): 
    sentence = re.split('(\s)',sentence) #Parentheses keep split characters 
    return len(''.join(sentence[:word_index*2]))

>>> s = 'The die has been cast' 
>>> char_index(s,3) #'been' has index 3 in the list of words 
12 
>>> s[12] 
'b' 
>>>

来源

2013-06-24 04:15:05

当在该例子中的单词的第一个字符（在本例中为“b”）在句子中较早使用时会发生什么。 – sberry

只要'was'没有用在 –

这个句子中，那么没关系啊，你正在寻找这个词。那么，当目标词在句子的早期使用时会发生什么？ – sberry

使用enumerate()

>>> def obt(phrase, indx): 
...  word = phrase.split()[indx] 
...  e = list(enumerate(phrase)) 
...  for i, j in e: 
...    if j == word[0] and ''.join(x for y, x in e[i:i+len(word)]) == word: 
...      return i 
... 
>>> obt("The cat called other cats.", 1) 
4

来源

2013-06-24 04:25:45 TerryA

这是错误的，如果单词的第一个字符出现在句子的前面，它会返回该索引。 –

@JeremyBentham我注意到了。我正在修复:) – TerryA

您可以使用dict这里：

>>> import re 
>>> r = re.compile(r'\w+') 
>>> text = "The cat called other cats." 
>>> dic = { i :(m.start(0), m.group(0)) for i, m in enumerate(r.finditer(text))} 
>>> dic 
{0: (0, 'The'), 1: (4, 'cat'), 2: (8, 'called'), 3: (15, 'other'), 4: (21, 'cats')} 
def char_index(char, word_ind): 
    start, word = dic[word_ind] 
    ind = word.find(char) 
    if ind != -1: 
     return start + ind 
...  
>>> char_index('c',1) 
4 
>>> char_index('c',2) 
8 
>>> char_index('c',3) 
>>> char_index('c',4) 
21

来源

2013-06-24 04:48:11

我认为你比OP要求的更进一步。我不**认为** OP希望找到给定单词索引的特定字符索引，而是每次都要查找第一个字符索引。但是，像这样的通用解决方案反正可能更好。 +1。 – sberry

从单词索引获取字符索引

回答

相关问题