2015-10-15 30 views
4

我有一组关键字,大约10个。我想在一个非常长的文档中执行搜索,并检查是否可以在那里找到关键字集,但不仅仅是它们在文本中的存在或存在,而且如果全部/它们中的一些或它们的子集位于例如3个句子或30个单词或任何其它接近度量度的距离截点处。如何做到这一点?我刚刚想过编写一些能够找到关键字的python代码,然后检查其他关键字是否大约包含3行文本。但是这需要很大的计算能力,而且效率很低。如何在文档中找到一组关键字,其中的某些关键字在某个距离切断?

+0

Lucene可以执行[邻近搜索](https://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Proximity_Searches)。 – approxiblue

+0

但是,如何应用Lucene提供的使用邻近标准查找10个单词的内容? – flow

+0

查找带有单词的所有单词和窗口的窗口有显着区别,它们甚至可能需要不同的算法才能获得最佳性能。 “长”,几兆字节还是几十兆字节需要多长时间? – NikoNyrh

回答

1

解决这个问题,将是创建一个(散列)地图A建议,输入每个单词作为关键字,并将单词的位置作为值添加到列表中,该列表是Map中的值。

对于正文快速棕色狐狸跳过懒狗这将导致模型,如下所示(在JSON格式)。

备注:这里所有单词都加入到索引中,就好像它们写成小写。

{ 
    "document": [ 
     { 
      "key": "the", 
      "value": [ 
       { 
        "location": 1 
       }, 
       { 
        "location": 7 
       } 
      ] 
     }, 
     { 
      "key": "quick", 
      "value": [ 
       { 
        "location": 2 
       } 
      ] 
     }, 
     { 
      "key": "brown", 
      "value": [ 
       { 
        "location": 3 
       } 
      ] 
     }, 
     { 
      "key": "fox", 
      "value": [ 
       { 
        "location": 4 
       } 
      ] 
     }, 
     { 
      "key": "jumps", 
      "value": [ 
       { 
        "location": 5 
       } 
      ] 
     }, 
     { 
      "key": "over", 
      "value": [ 
       { 
        "location": 6 
       } 
      ] 
     }, 
     { 
      "key": "lazy", 
      "value": [ 
       { 
        "location": 8 
       } 
      ] 
     }, 
     { 
      "key": "dog", 
      "value": [ 
       { 
        "location": 9 
       } 
      ] 
     } 
    ] 
} 

一旦指数制成,可以很容易地看到不同的词语多远相互位置。如这个词所看到的,它位于位置1和7.

此外,单词在文本中显示的次数,可以很容易地通过位置的数量得到,一个字。

提示:产生额外添加位置信息,如该章/节/页等

3

要确定一组关键字是否位于较大文档内部的给定距离内,您可以使用长度等于给定距离的滑动窗口并将其移动到整个文档中。当你移动窗口时,跟踪每个进出窗口的单词。如果窗口包含所有关键字,则条件满足。该方法的时间复杂度为O(len(document)),存储器复杂度为O(len(window))

下面是如上所述的方法的Python的一个示例实现:

from collections import defaultdict 
from sets import Set 
def isInProximityWindow(doc, keywords, windowLen): 
    words = doc.split() 
    wordsLen = len(words) 
    if (windowLen > wordsLen): 
     windowLen = wordsLen 

    keywordsLen = len(keywords) 
    allKeywordLocs = defaultdict(Set) 
    numKeywordsInWindow = 0 
    locKeyword = {} 
    for i in range(wordsLen): 
     windowContents = sorted([k for k in allKeywordLocs.keys() if allKeywordLocs[k]]) 
     print "On beginning of iteration #%i, window contains '%s'" % (i, ','.join(windowContents)) 

     oldKeyword = locKeyword.pop(i-windowLen, None) 
     if oldKeyword: 
      keywordLocs = allKeywordLocs[oldKeyword] 
      keywordLocs.remove(i-windowLen) 
      if not keywordLocs: 
       print "'%s' fell out of window" % oldKeyword 
       numKeywordsInWindow -= 1 
     word = words[i] 
     print "Next word is '%s'" % word 
     if word in keywords: 
      locKeyword[i] = word 
      keywordLocs = allKeywordLocs[word] 
      if not keywordLocs: 
       print "'%s' fell in window" % word 
       numKeywordsInWindow += 1 
       if numKeywordsInWindow == keywordsLen: 
        return True 
      keywordLocs.add(i) 
    return False 

示例输出:

>>> isInProximityWindow("the brown cow jumped over the moon and the red fox jumped over the black dog", Set(["fox", "over", "the"]), 4) 
On beginning of iteration #0, window contains '' 
Next word is 'the' 
'the' fell in window 
On beginning of iteration #1, window contains 'the' 
Next word is 'brown' 
On beginning of iteration #2, window contains 'the' 
Next word is 'cow' 
On beginning of iteration #3, window contains 'the' 
Next word is 'jumped' 
On beginning of iteration #4, window contains 'the' 
'the' fell out of window 
Next word is 'over' 
'over' fell in window 
On beginning of iteration #5, window contains 'over' 
Next word is 'the' 
'the' fell in window 
On beginning of iteration #6, window contains 'over,the' 
Next word is 'moon' 
On beginning of iteration #7, window contains 'over,the' 
Next word is 'and' 
On beginning of iteration #8, window contains 'over,the' 
'over' fell out of window 
Next word is 'the' 
On beginning of iteration #9, window contains 'the' 
Next word is 'red' 
On beginning of iteration #10, window contains 'the' 
Next word is 'fox' 
'fox' fell in window 
On beginning of iteration #11, window contains 'fox,the' 
Next word is 'jumped' 
On beginning of iteration #12, window contains 'fox,the' 
'the' fell out of window 
Next word is 'over' 
'over' fell in window 
On beginning of iteration #13, window contains 'fox,over' 
Next word is 'the' 
'the' fell in window 
True 
1

我运行这些条件下,一些简单的基准测试:

  • 的Python 3。4在Windows
  • 150个不同的随机字,长度5 - 16个字符
  • 10搜索术语,所有必须找到
  • 窗口长度的75
  • 迭代超过50万个字
  • ,在总
  • 约5.14亿个字符

字代:

def generator(gen_salt): 
    words = [word(i) for i in range(n_distinct_words)] 
    np.random.seed(123) 

    for i in range(int(n_words)): 
     yield words[np.random.randint(0, n_distinct_words)] 

搜索代码,words = generator, search_words = set, window_len = int

from collections import deque 
from time import time 

def deque_window(words, search_words, window_len): 
    start = time() 
    result = [] 
    pos = 0 

    window = deque([], window_len) 

    for word in words: 
     window.append(word) 

     if word in search_words: 
      all_found = True 
      for search_word in search_words: 
       if search_word not in window: 
        all_found = False 
        break 

      if all_found: 
       result.append(pos) 

     pos += 1 

    return result, time() - start 

事实上,花费31秒来计算字符的总数,只用48秒就可以找到在搜索窗口中包含所有单词的索引。我不确定randint或list的查询是否真的很慢。我需要一个更高效的发生器,也许我会将结果存储在磁盘上,并尝试从那里读取(这将更接近您的方案)。

琛这样的计算长度:

sum(len(w) for w in words) 
+0

我真的很感兴趣,我怎么能联系你的研究论文的工作提案? – flow

0

所有你需要的是一个开源的Apache Solr软件这一点。

Apache Solr实现是流行的,超快的,开放源码的企业搜索 建基于Apache Lucene的™

点击这个link更多信息平台。相信我,即使对于terrabytes的数据,也能提供快速的结果。

相关问题