2013-07-22 149 views
1

我有一个包含超过100000行的txt文件,并且我想创建一个XML树。但所有的行都共享相同的根。在python中创建一个带有For循环的xml文件

这里txt文件:

LIBRARY: 
1,1,1,1,the 
1,2,1,1,world 
2,1,1,2,we 
2,5,2,1,have 
7,3,1,1,food 

所需的输出:

<LIBRARY> 
    <BOOK ID ="1"> 
     <CHAPTER ID ="1"> 
      <SENT ID ="1"> 
       <WORD ID ="1">the</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
    <BOOK ID ="1"> 
     <CHAPTER ID ="2"> 
      <SENT ID ="1"> 
       <WORD ID ="1">world</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
    <BOOK ID ="2"> 
     <CHAPTER ID ="1"> 
      <SENT ID ="1"> 
       <WORD ID ="2">we</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
    <BOOK ID ="2"> 
     <CHAPTER ID ="5"> 
      <SENT ID ="2"> 
       <WORD ID ="1">have</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
    <BOOK ID ="7"> 
     <CHAPTER ID ="3"> 
      <SENT ID ="1"> 
       <WORD ID ="1">food</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
</LIBRARY> 

我使用元树txt文件转换为XML文件,这是代码我运行

def expantree(): 
    lines = txtfile.readlines() 
    for line in lines: 
    split_line = line.split(',') 
    BOOK.set('ID ', split_line[0]) 
    CHAPTER.set('ID ', split_line[1]) 
    SENTENCE.set('ID ', split_line[2]) 
    WORD.set('ID ', split_line[3]) 
    WORD.text = split_line[4] 
    tree = ET.ElementTree(Root) 
    tree.write(xmlfile) 

好吧,代码工作,但我没有得到所需的输出,我得到以下内容:

<LIBRARY> 
    <BOOK ID ="1"> 
     <CHAPTER ID ="1"> 
      <SENT ID ="1"> 
       <WORD ID ="1">the</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
</LIBRARY> 
<LIBRARY> 
    <BOOK ID ="1"> 
     <CHAPTER ID ="2"> 
      <SENT ID ="1"> 
       <WORD ID ="1">world</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
</LIBRARY> 
<LIBRARY> 
    <BOOK ID ="2"> 
     <CHAPTER ID ="1"> 
      <SENT ID ="1"> 
       <WORD ID ="2">we</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
</LIBRARY> 
<LIBRARY> 
    <BOOK ID ="2"> 
     <CHAPTER ID ="5"> 
      <SENT ID ="2"> 
       <WORD ID ="1">have</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
</LIBRARY> 
<LIBRARY> 
    <BOOK ID ="7"> 
     <CHAPTER ID ="3"> 
      <SENT ID ="1"> 
       <WORD ID ="1">food</WORD> 
      </SENT> 
     </CHAPTER> 
    </BOOK> 
</LIBRARY> 

如何统一树根,所以而不是得到许多根标签我得到一个根标签?

回答

0

一种方法是创建完整的树并打印它。我用下面的代码:

from lxml import etree as ET 

def create_library(lines): 
    library = ET.Element('LIBRARY') 
    for line in lines: 
     split_line = line.split(',') 
     library.append(create_book(split_line)) 
    return library 

def create_book(split_line): 
    book = ET.Element('BOOK',ID=split_line[0]) 
    book.append(create_chapter(split_line)) 
    return book 

def create_chapter(split_line): 
    chapter = ET.Element('CHAPTER',ID=split_line[1]) 
    chapter.append(create_sentence(split_line)) 
    return chapter 

def create_sentence(split_line): 
    sentence = ET.Element('SENT',ID=split_line[2]) 
    sentence.append(create_word(split_line)) 
    return sentence 

def create_word(split_line): 
    word = ET.Element('WORD',ID=split_line[3]) 
    word.text = split_line[4] 
    return word 

那么你的代码来创建该文件看起来像:

def expantree(): 
    lines = txtfile.readlines() 
    library = create_library(lines) 
    ET.ElementTree(lib).write(xmlfile) 

如果你不希望加载整个树在内存中(你提到有更多的超过10万行),您可以手动创建标签,每次写入一本书,然后添加标签。在这种情况下,你的代码看起来像:

def expantree(): 
    lines = txtfile.readlines() 
    f = open(xmlfile,'wb') 
    f.write('<LIBRARY>') 
    for line in lines: 
     split_line = line.split(',') 
     book = create_book(split_line) 
     f.write(ET.tostring(book)) 
    f.write('</LIBRARY>') 
    f.close() 

我没有与LXML那么多的经验,所以可能会有更多的优雅的解决方案,但是这两种工作。

+0

谢谢,你的回答很有价值 –

+0

很高兴我能帮到你。 –

1

这也许是更简洁的另一个选择是如下:

from xml.etree import ElementTree as ET 
import io 
import os 

# Setup the test input 
inbuf = io.StringIO(''.join(['LIBRARY:\n', '1,1,1,1,the\n', '1,2,1,1,world\n', 
          '2,1,1,2,we\n', '2,5,2,1,have\n', '7,3,1,1,food\n'])) 

tags = ['BOOK', 'CHAPTER', 'SENT', 'WORD'] 
with inbuf as into, io.StringIO() as xmlfile: 
    root_name = into.readline() 
    root = ET.ElementTree(ET.Element(root_name.rstrip(':\n'))) 
    re = root.getroot() 
    for line in into: 
     values = line.split(',') 
     parent = re 
     for i, v in enumerate(values[:4]): 
      parent = ET.SubElement(parent, tags[i], {'ID': v}) 
      if i == 3: 
       parent.text = values[4].rstrip(':\n') 
    root.write(xmlfile, encoding='unicode', xml_declaration=True) 
    xmlfile.seek(0, os.SEEK_SET) 
    for line in xmlfile: 
     print(line) 

什么这个代码是构建从输入数据的ElementTree并将其写入作为XML文件一个类文件的对象。此代码可以与标准Python xml.etree包或lxml一起使用。代码使用Python 3.3进行测试。

1

这是一个建议,使用lxml(用Python 2.7测试)。代码可以很容易地适用于ElementTree,但很难得到漂亮的打印输出(参见https://stackoverflow.com/a/16377996/407651)。

输入文件是library.txt,输出文件是library.xml。

from lxml import etree 

lines = open("library.txt").readlines() 
library = etree.Element('LIBRARY') # The root element 

# For each line with data in the input file, create a BOOK/CHAPTER/SENT/WORD structure 
for line in lines: 
    values = line.split(',') 
    if len(values) == 5: 
     book = etree.SubElement(library, "BOOK") 
     book.set("ID", values[0]) 
     chapter = etree.SubElement(book, "CHAPTER") 
     chapter.set("ID", values[1]) 
     sent = etree.SubElement(chapter, "SENT") 
     sent.set("ID", values[2]) 
     word = etree.SubElement(sent, "WORD") 
     word.set("ID", values[3]) 
     word.text = values[4].strip() 

etree.ElementTree(library).write("library.xml", pretty_print=True) 
+1

我upvoted,但由于SubElement允许属性设置为'book = etree.SubElement(library,'BOOK',ID = values [0])',set()操作可以被消除。 – tdelaney