如何在文档中找到一组关键字，其中的某些关键字在某个距离切断？

我有一组关键字，大约10个。我想在一个非常长的文档中执行搜索，并检查是否可以在那里找到关键字集，但不仅仅是它们在文本中的存在或存在，而且如果全部/它们中的一些或它们的子集位于例如3个句子或30个单词或任何其它接近度量度的距离截点处。如何做到这一点？我刚刚想过编写一些能够找到关键字的python代码，然后检查其他关键字是否大约包含3行文本。但是这需要很大的计算能力，而且效率很低。如何在文档中找到一组关键字，其中的某些关键字在某个距离切断？

来源

2015-10-15 flow

Lucene可以执行[邻近搜索]（https://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Proximity_Searches）。 – approxiblue

但是，如何应用Lucene提供的使用邻近标准查找10个单词的内容？ – flow

查找带有单词的所有单词和窗口的窗口有显着区别，它们甚至可能需要不同的算法才能获得最佳性能。 “长”，几兆字节还是几十兆字节需要多长时间？ – NikoNyrh

解决这个问题，将是创建一个（散列）地图A建议，输入每个单词作为关键字，并将单词的位置作为值添加到列表中，该列表是Map中的值。

对于正文快速棕色狐狸跳过懒狗这将导致模型，如下所示（在JSON格式）。

备注：这里所有单词都加入到索引中，就好像它们写成小写。

{ 
    "document": [ 
     { 
      "key": "the", 
      "value": [ 
       { 
        "location": 1 
       }, 
       { 
        "location": 7 
       } 
      ] 
     }, 
     { 
      "key": "quick", 
      "value": [ 
       { 
        "location": 2 
       } 
      ] 
     }, 
     { 
      "key": "brown", 
      "value": [ 
       { 
        "location": 3 
       } 
      ] 
     }, 
     { 
      "key": "fox", 
      "value": [ 
       { 
        "location": 4 
       } 
      ] 
     }, 
     { 
      "key": "jumps", 
      "value": [ 
       { 
        "location": 5 
       } 
      ] 
     }, 
     { 
      "key": "over", 
      "value": [ 
       { 
        "location": 6 
       } 
      ] 
     }, 
     { 
      "key": "lazy", 
      "value": [ 
       { 
        "location": 8 
       } 
      ] 
     }, 
     { 
      "key": "dog", 
      "value": [ 
       { 
        "location": 9 
       } 
      ] 
     } 
    ] 
}

一旦指数制成，可以很容易地看到不同的词语多远相互位置。如这个词所看到的，它位于位置1和7.

此外，单词在文本中显示的次数，可以很容易地通过位置的数量得到，一个字。

提示：产生额外添加位置信息，如该章/节/页等

来源

2015-10-21 08:52:32 Verhagen

要确定一组关键字是否位于较大文档内部的给定距离内，您可以使用长度等于给定距离的滑动窗口并将其移动到整个文档中。当你移动窗口时，跟踪每个进出窗口的单词。如果窗口包含所有关键字，则条件满足。该方法的时间复杂度为O(len(document))，存储器复杂度为O(len(window))。

下面是如上所述的方法的Python的一个示例实现：

from collections import defaultdict 
from sets import Set 
def isInProximityWindow(doc, keywords, windowLen): 
    words = doc.split() 
    wordsLen = len(words) 
    if (windowLen > wordsLen): 
     windowLen = wordsLen 

    keywordsLen = len(keywords) 
    allKeywordLocs = defaultdict(Set) 
    numKeywordsInWindow = 0 
    locKeyword = {} 
    for i in range(wordsLen): 
     windowContents = sorted([k for k in allKeywordLocs.keys() if allKeywordLocs[k]]) 
     print "On beginning of iteration #%i, window contains '%s'" % (i, ','.join(windowContents)) 

     oldKeyword = locKeyword.pop(i-windowLen, None) 
     if oldKeyword: 
      keywordLocs = allKeywordLocs[oldKeyword] 
      keywordLocs.remove(i-windowLen) 
      if not keywordLocs: 
       print "'%s' fell out of window" % oldKeyword 
       numKeywordsInWindow -= 1 
     word = words[i] 
     print "Next word is '%s'" % word 
     if word in keywords: 
      locKeyword[i] = word 
      keywordLocs = allKeywordLocs[word] 
      if not keywordLocs: 
       print "'%s' fell in window" % word 
       numKeywordsInWindow += 1 
       if numKeywordsInWindow == keywordsLen: 
        return True 
      keywordLocs.add(i) 
    return False

示例输出：

>>> isInProximityWindow("the brown cow jumped over the moon and the red fox jumped over the black dog", Set(["fox", "over", "the"]), 4) 
On beginning of iteration #0, window contains '' 
Next word is 'the' 
'the' fell in window 
On beginning of iteration #1, window contains 'the' 
Next word is 'brown' 
On beginning of iteration #2, window contains 'the' 
Next word is 'cow' 
On beginning of iteration #3, window contains 'the' 
Next word is 'jumped' 
On beginning of iteration #4, window contains 'the' 
'the' fell out of window 
Next word is 'over' 
'over' fell in window 
On beginning of iteration #5, window contains 'over' 
Next word is 'the' 
'the' fell in window 
On beginning of iteration #6, window contains 'over,the' 
Next word is 'moon' 
On beginning of iteration #7, window contains 'over,the' 
Next word is 'and' 
On beginning of iteration #8, window contains 'over,the' 
'over' fell out of window 
Next word is 'the' 
On beginning of iteration #9, window contains 'the' 
Next word is 'red' 
On beginning of iteration #10, window contains 'the' 
Next word is 'fox' 
'fox' fell in window 
On beginning of iteration #11, window contains 'fox,the' 
Next word is 'jumped' 
On beginning of iteration #12, window contains 'fox,the' 
'the' fell out of window 
Next word is 'over' 
'over' fell in window 
On beginning of iteration #13, window contains 'fox,over' 
Next word is 'the' 
'the' fell in window 
True

来源

2015-10-21 00:32:03 heenenee

我运行这些条件下，一些简单的基准测试：

的Python 3。4在Windows
150个不同的随机字，长度5 - 16个字符
10搜索术语，所有必须找到
迭代超过50万个字
，在总

字代：

def generator(gen_salt): 
    words = [word(i) for i in range(n_distinct_words)] 
    np.random.seed(123) 

    for i in range(int(n_words)): 
     yield words[np.random.randint(0, n_distinct_words)]

搜索代码，words = generator, search_words = set, window_len = int：

from collections import deque 
from time import time 

def deque_window(words, search_words, window_len): 
    start = time() 
    result = [] 
    pos = 0 

    window = deque([], window_len) 

    for word in words: 
     window.append(word) 

     if word in search_words: 
      all_found = True 
      for search_word in search_words: 
       if search_word not in window: 
        all_found = False 
        break 

      if all_found: 
       result.append(pos) 

     pos += 1 

    return result, time() - start

事实上，花费31秒来计算字符的总数，只用48秒就可以找到在搜索窗口中包含所有单词的索引。我不确定randint或list的查询是否真的很慢。我需要一个更高效的发生器，也许我会将结果存储在磁盘上，并尝试从那里读取（这将更接近您的方案）。

琛这样的计算长度：

sum(len(w) for w in words)

来源

2015-10-21 15:54:51 NikoNyrh

我真的很感兴趣，我怎么能联系你的研究论文的工作提案？ – flow

所有你需要的是一个开源的Apache Solr软件这一点。

Apache Solr实现是流行的，超快的，开放源码的企业搜索建基于Apache Lucene的™

点击这个link更多信息平台。相信我，即使对于terrabytes的数据，也能提供快速的结果。

来源

2015-10-24 12:46:58

如何在文档中找到一组关键字，其中的某些关键字在某个距离切断？

回答

相关问题