从列表中创建一个字典

我想构建一个倒排索引，即将文本映射到它来自的文档。它在列表/文档中的位置。从列表中创建一个字典

在我来说，我已经解析包含列表清单（即的列表）。

我的输入是这样的。

 [ 
     ['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'], 
     ['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'], 
     ['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'], 
     ['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'], 
     ['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me'] 
     ]

这是我的代码

def create_inverted(mylists): 
    myDict = {} 
    for sublist in mylists: 
     for i in range(len(sublist)): 
      if sublist[i] in myDict: 
       myDict[sublist[i]].append(i) 
      else: 
       myDict[sublist[i]] = [i] 

    return myDict

它确实建字典，但是当我做了搜索我没有得到正确的结果。我正在尝试做这样的事情。

documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']] 

index = {'owl': [0, 2], 
     'lion': [0, 1], # IDs are sorted. 
     'deer': [1], 
     'leopard': [2]} 

def indexed_search(documents, index, query): 
    return [documents[doc_id] for doc_id in index[query]] 

print indexed_search(documents, index, 'lion')

在哪里我可以输入搜索文本，它会得到列表id。

任何想法。

来源

2014-01-30 user3247054

您是否需要存储每个单词来自哪个文档的信息？您只能存储有关文档中位置的信息。 – user2357112

是的。所以当我搜索我得到那些包含文本列表.http：//stackoverflow.com/questions/17554977/inverted-index-in-python-not-returning-desired-results – user3247054

你每个单词映射到它被发现在每一个文档中的位置，而不是它记录它在找到。你应该索引存储到文件，而不是索引到文档本身的列表，或者只是地图文字直接代替索引：

def create_inverted_index(documents): 
    index = {} 
    for i, document in enumerate(documents): 
     for word in set(document): 
      if word in index: 
       index[word].append(i) 
      else: 
       index[word] = [i] 
    return index

大部分情况与您的代码相同。的主要区别是在以下两行：

for i, document in enumerate(documents): 
     for word in set(document):

对应于代码的以下部分：

for sublist in mylists: 
     for i in range(len(sublist)):

enumerate遍历索引和一个序列的元素。由于enumerate位于外部循环中，因此我的代码中的i是文档的索引，而代码中的i是文档中某个单词的索引。

set(document)创建的文档，其中每个字只出现一次的字的set。这确保了每个字仅每个文档计算一次，而不是10次出现的2在列表中'Cheetos'如果'Cheetos'出现在文献2的10倍。

来源

2014-01-30 01:52:48 user2357112

btw，我认为' index.setdefualt（word，[]）。append（i）'而不是if-else更好 – Elisha

@Elisha：如果我自己编写程序，我会使用'defaultdict（list）'。我决定和OP一起使用。 – user2357112

@ user2357112，可能'defaultdict（set）'最好避免重复。看到我的回答 –

起初我将提取所有可能的字，并将其存储在一个set。然后，我查找每个列表中的每个单词，并收集该单词恰好在...中的所有列表索引。

source = [ 
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'], 
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'], 
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'], 
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'], 
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me'] 
] 

allWords = set(word for lst in source for word in lst) 

wordDict = { word: [ 
        i for i, lst in enumerate(source) if word in lst 
        ] for word in allWords } 

print wordDict 
Out[30]: 
{'a': [1, 2, 3], 
'afraid': [3], 
'always': [1, 4], 
'and': [2], 
...

来源

2014-01-30 01:54:40 koffein

有趣的方法，但它确实意味着你正在扫描每个单词的所有文档，所以这不会很好地扩展。 –

感谢您的回答。 – user3247054

@gnibbler是的你是对的，它不会很好地扩展，代码可以被阅读，因为你想解决的问题：我想要所有的单词，我想知道，我可以在哪里找到它们......这就是两条线都说... – koffein

我积累指数为一组，以避免重复，然后排序

>>> documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']] 
>>> from collections import defaultdict 
>>> D = defaultdict(set) 
>>> for i, doc in enumerate(documents): 
...  for word in doc: 
...   D[word].add(i) 
... 
>>> D ## Take a look at the defaultdict 
defaultdict(<class 'set'>, {'owl': {0, 2}, 'leopard': {2}, 'lion': {0, 1}, 'deer': {1}}) 
>>> {k:sorted(v) for k,v in D.items()} 
{'lion': [0, 1], 'owl': [0, 2], 'leopard': [2], 'deer': [1]}

来源

2014-01-30 01:55:25

感谢您的回答。 – user3247054

这很简单，只要你并不需要高效的代码：

documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']] 

def index(docs): 
    doc_index = {} 
    for doc_id, doc in enumerate(docs, 1): 
     for term_pos, term in enumerate(doc, 1): 
      doc_index.setdefault(term, {}).setdefault(doc_id, []).append(term_pos) 
    return doc_index

现在你会得到一个两级字典，让你可以访问文档ID，然后查看本文中术语的位置：

>>> index(documents) 
{'lion': {1: [2], 2: [1]}, 'leopard': {3: [2]}, 'deer': {2: [2]}, 'owl': {1: [1], 3: [1]}}

这只是索引的初步步骤;之后，您需要将术语词典与职位发布中的文档发布分开。通常，字典存储在树形结构中（有Python包），文档发布和职位发布被表示为无符号整数数组。

来源

2014-01-30 02:01:35 michaelmeyer

感谢您的回答。 – user3247054

从列表中创建一个字典

回答

相关问题