2017-06-29 75 views
0

我正在处理一个需要我将大量XML文件解析为JSON的项目。我写了代码,但它太慢了。我曾看过使用lxmlBeautifulSoup但我不确定如何继续。将大量XML文件解析为JSON

我已经包含了我的代码。它的工作原理应该如何,除非它太慢。大约需要24小时才能通过一个低于100Mb的文件来解析100,000条记录。

product_data = open('productdata_29.xml', 'r') 
read_product_data = product_data.read() 


def record_string_to_dict(record_string): 
'''This function takes a single record in string form and iterates through 
it, and sorts it as a dictionary. Only the nodes present in the parent_rss dict 
are appended to the new dict (single_record_dict). After each record, 
single_record_dict is flushed to final_list and is then emptied.''' 

    #Iterating through the string to find keys and values to put in to 
    #single_record_dict. 
    while record_string != record_string[::-1]: 

     try: 
      k = record_string.index('<') 

      l = record_string.index('>') 
      temp_key = record_string[k + 1:l] 
      record_string = record_string[l+1:] 
      m = record_string.index('<') 
      temp_value = record_string[:m] 

      #Cleaning thhe keys and values of unnecessary characters and symbols. 
      if '\n' in temp_value: 
       temp_value = temp_value[3:] 
      if temp_key[-1] == '/': 
       temp_key = temp_key[:-1] 

      n = record_string.index('\n') 
      record_string = record_string[n+2:] 

      #Checking parent_rss dict to see if the key from the record is present. If it is, 
      #the key is replaced with keys and added to single_record_dictionary. 
      if temp_key in mapped_nodes.keys(): 
       temp_key = mapped_nodes[temp_key] 
       single_record_dict[temp_key] = temp_value 

     except Exception: 
      break 


    while len(read_product_data) > 10: 

     #Goes through read_product_data to create blocks, each of which is a single 
     #record. 
     i = read_product_data.index('<record>') 
     j = read_product_data.index('</record>') + 8 
     single_record_string = read_product_data[i:j] 
     single_record_string = single_record_string[9:-10] 

     #Runs previous function with the input being the single string found previously. 
     record_string_to_dict(single_record_string) 

     #Flushes single_record_dict to final_list, and empties the dict for the next 
     #record. 
     final_list.append(single_record_dict) 
     single_record_dict = {} 

     #Removes the record that was previously processed. 
     read_product_data = read_product_data[j:] 

     #For keeping track/ease of use. 
     print('Record ' + str(break_counter) + ' has been appended.') 

     #Keeps track of the number of records. Once the set value is reached 
     #in the if loop, it is flushed to a new file. 
     break_counter += 1 
     flush_counter += 1 

     if break_counter == 100 or flush_counter == break_counter: 
      record_list = open('record_list_'+str(file_counter)+'.txt', 'w') 
      record_list.write(str(final_list)) 

      #file_counter keeps track of how many files have been created, so the next 
      #file has a different int at the end. 
      file_counter += 1 
      record_list.close() 

      #resets break counter 
      break_counter = 0 
      final_list = [] 
     #For testing purposes. Causes execution to stop once the number of files written 
     #matches the integer. 
     if file_counter == 2: 
      break 

    print('All records have been appended.') 
+0

请为[可重现](https://stackoverflow.com/help/mcve)示例包含输入xml和所需的输出json。 – Parfait

回答

2

任何理由,你为什么不考虑包如xml2jsonxml2dict。看到这个职位的工作的例子: How can i convert an xml file into JSON using python?

从上面的帖子转载

相关代码:

xml2json

import xml2json 
s = '''<?xml version="1.0"?> 
    <note> 
     <to>Tove</to> 
     <from>Jani</from> 
     <heading>Reminder</heading> 
     <body>Don't forget me this weekend!</body> 
    </note>''' 
print xml2json.xml2json(s) 

xmltodict

import xmltodict, json 
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>') 
json.dumps(o) # '{"e": {"a": ["text", "text"]}}' 

看到这个帖子,如果在工作Python 3: https://pythonadventures.wordpress.com/2014/12/29/xml-to-dict-xml-to-json/

import json 
import xmltodict 

def convert(xml_file, xml_attribs=True): 
    with open(xml_file, "rb") as f: # notice the "rb" mode 
     d = xmltodict.parse(f, xml_attribs=xml_attribs) 
     return json.dumps(d, indent=4) 
+0

我肯定会尝试一些item_callback参数在这里添加元素在JSON文件的末尾。事实上,不确定整个文件作为字典可以保存在内存中。查看帮助(xmltodict.parse)了解更多信息。 –

0

你肯定不希望手工解析XML。与其他人提到的库一样,您可以使用XSLT 3.0处理器。要达到100Mb以上,您将受益于Saxon-EE等流媒体处理器,但开放源代码Saxon-HE应该能够破解这种水平。你没有显示源XML或目标JSON,所以我不能给你具体的代码 - XSLT 3.0中的假设是你可能想要一个定制的转换,而不是一个现成的转换,所以总的想法是编写模板规则,以定义应如何处理输入XML的不同部分。