从.doc文件中提取文本python

我试着从.doc文件中提取文本。文本被提取，但问题是它始终与这些输出：从.doc文件中提取文本python

ࡱ > n ln个字符。

这里是我的代码：

doc=open(input_file,'r') 
    read_text_file = doc.readline() 
    doc_text = "" 
    for line in read_text_file: 
     doc_text+=str(line) 

    return doc_text

有没有办法删除或重新编码成UTF-8？

来源

2014-02-10 Bazinga

'.doc'可能是一个专有的微软Word文件。你不能像纯文本文件那样阅读它。 – 2014-02-10 09:39:56

你可以用word打开它们，并将它们保存为.txt文件吗？ –

@tk，还没有尝试过那个。它安全吗？如果用户不具有单词应用程序会怎么样？ – Bazinga

一个docx文件只是一个zip文件（尝试运行unzip就可以了！），其中包含大量定义良好的XML和附属文件。

import zipfile 
from lxml import etree 

def get_word(docx_file_name): 
    with open(docx_file_name) as f: 
     zip = zipfile.SipFile(f) 
     xml_content = zip.read('word/document.xml') 
return xml_content 

#parse the string containing XML into a usable tree 
def get_xml_tree(xml_string): 
    return etree.fromstring(xml_string) 
#xml has functions for traversing the XML tree, but I used the iter instead that 
#will traverse every node given a starting node ”my_etree”, and return every 
#text node and it’s containing text 
def _itertext(self, myetree): 
    """goes through the xml tree and extracts nodes""" 
    for node in my_etree.iter(tag=etree.Element): 
     if self._check_element_is(node, 't'): 
      yield(node, node.text) 

def _check_element_is(self, element, typr_char): 
    word_schema = "http://schemas.openxmlformats.org/wordprocessingml/2006/main" 
    return element.tag == '{%s}%s' %(word_schema, type_char) 

xml_from_file = self.get_word_xml(wod_filename) 
xml_tree = self.get_xml_tree(xml_from_file) 
for node, txt in self._itertext(xml_tree): 
    print txt

找到更多here

来源

2014-02-10 10:16:21 Olu

我需要阅读的是doc格式而不是docx。 – Bazinga

然后尝试此链接的详细信息。 http://www.decalage.info/python/olefileio – Olu

从.doc文件中提取文本python

回答

相关问题