通过抓取数据解析的最佳方法

我设法通过scrapy抓取大量数据，并且所有数据当前都以JSON对象的形式存储在MongoDB中。我主要想知道如何有效地解析和理解数据。我想将数据提取到小节中。例如，假装我有作为数据存储：通过抓取数据解析的最佳方法

{ 
    "data": "category 1: test test \n category2: test test \n test test \n category 3: test test test \n category 4: this is data in category 4 " 
}

基本上我想通过关键字去提取关键字，直到下一个关键词之后到来的一切。所有类别1后的信息（“测试测试”）应存储在“类别1”下。对于类别顺序没有真正的押韵或节奏，也没有每个类别之后的文本数量，但所有类别都在那里。

我想知道是否有任何库可以用来编写脚本来执行此操作或任何可以自动为我执行此操作的工具。要么是一个资源指针，我可以学习如何做这样的事情。

来源

2016-03-09 Jason

-2

这听起来像一个足够特定的任务，你可能需要做另一个数据处理。 pymongo是我的首选库，用于与python中的Mongo数据库中的数据进行交互（并且是mongodb本身推荐的）。

为了解析字符串中去，读了正则表达式，特别是.findall方法：

>>> import re 
>>> data_string = "category 1: test test \n category2: test test \n test test \n category 3: test test test \n category 4: this is data in category 4 " 
>>> m = re.findall(r'(category\s*\d+): (.*)', data_string) 
>>> m 
[('category 1', 'test test '), ('category2', 'test test '), ('category 3', 'test test test '), ('category 4', 'this is data in category 4 ')]

来源

2016-03-09 22:40:02 user2926055

OP询问如何解析他的字符串。它存储在MongoDB中的事实是切合实际的。 –

感谢您的建议！我同意我可能需要做多次传球。我并不担心数据库部分，因为这与我对如何实际分析这些数据的问题没有那么相关。我觉得我可以用非常愚蠢的方式做到这一点（可能效率低下，不能处理所有数据），但是我想知道是否有更好的方法来做到这一点。 – Jason

编辑答案包括链接到're.findAll' – user2926055

我会创建关键字列表，然后通过查找这些关键字的索引内数据开始，如果存在。（我重新排列了关键字出现在数据的顺序来演示稍后的一点）。

d = {"data": "category 1: test test \n category 3: test test test \n category2: test test \n test test \n category 4: this is data in category 4 " } 
keywords = ['category 1', 'category2', 'category 3', 'category 4'] 
kw_indices = [-1]*len(keywords) 
data = d['data'] 

for i in range(len(keywords)): 
    kw = keywords[i] 
    if kw in data: 
     kw_indices[i] = data.index(kw) 

kw_indices_sorted = sorted(kw_indices)

在数据找到的每个关键字的开始位置由它的值在kw_indices给出，其中-1表示该关键字不是在数据找到。

要了解每个关键字的结束索引，我们发现从下一个起始索引kw_indices_sorted列表，然后找出哪些关键字有开始索引，那么获得下一届的起始索引值。

data_by_category = {} 
for j in range(len(keywords)): 
    kw = keywords[j] 

    if kw_indices[j] > -1: 
     # The keyword was found in the data and we know where in the string it starts 
     kw_start = kw_indices[j] 
     sorted_index = kw_indices_sorted.index(kw_start) 
     if sorted_index < len(kw_indices_sorted) - 1: 
      # This index is not the last/largest value in the list of sorted indices 
      # so there will be a next value. 
      next_kw_start = kw_indices[kw_indices.index(kw_indices_sorted[sorted_index + 1])] 
      kw_data = data[kw_start:next_kw_start] 
     else: 
      kw_data = data[kw_start:] 

     # If you don't want the keyword included in the result you can strip it out here 
     kw_data = kw_data.replace(kw + ':', '') 
     data_by_category[kw] = kw_data 
    else: 
     # The keyword was not found in the data, enter an empty value for it or handle this 
     # however else you want. 
     data_by_category[kw] = '' 

print(data_by_category)

{ '类别1'： '测试测试\ n'， '类别2'： '测试测试\ n个测试测试\ n'， '类别3'： '测试测试测试\ n'，“类4'：'这是类别4中的数据'}

来源

2016-03-09 23:36:20

通过抓取数据解析的最佳方法

回答

相关问题