导入复杂的.docx文件为.xml和提取章节

--update--也许有人可以承担的另一种可能性，从而分割.docx文档转换为章节，进口.docx至R导入复杂的.docx文件为.xml和提取章节

首先，我想感谢这个真棒论坛。我为即将发生的问题找到了几种解决方案但是这一次我还没有找到任何东西...

但是，我有一个复杂的.docx文件，其中包含一个索引，格式为.xml。

library(XML) 
xmlfile <- xmlParse("C:/Users/Documents/stihl.xml", options = HUGE) 

topxml <- xmlRoot(xmlfile) 

topxml <- xmlSApply(topxml, function(x) xmlSApply(x, xmlValue)) 
xml_df <- data.frame(t(topxml), row.names = NULL, node)

和其他读取XML文件的可能性。我的.docx文件有一个索引，现在我想提取几个索引内容。作为一个例子.docx

1. Introduction 
    This is an introduction importing XML by R. 
2. UserGuide 
    Userguides are often helpful. 
2.1 Style 
    The style should be always the same. 
2.2 Language 
    I hope my Language is readable, because I'm contacting you from Germany.

结果这将是很好以接收所述分离式章节的内容，例如存储在向量。

result 
[1]This is an introduction importing XML by R. 
[2]Userguides are often helpful. 
[3]The style should be always the same. 
[4]I hope my Language is readable, because I'm contacting you from Germany.

也许还有其他的可能性保持结构，但我提到了一个包含树结构作为最简单方法的XML导入。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<?mso-application progid="Word.Document"?> 
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage"> 

    <pkg:part 
    pkg:name="/_rels/.rels" 
    pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" 
    pkg:padding="512"> 
    <pkg:xmlData> 
     <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> 
      <Relationship 
      Id="rId3" 
      Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" 
      Target="docProps/app.xml"/> 
      <Relationship 
      Id="rId2" 
      Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" 
      Target="docProps/core.xml"/> 
      <Relationship Id="rId1" 
      Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" 
      Target="word/document.xml"/> 
     </Relationships> 
    </pkg:xmlData> 
    </pkg:part> 

    <pkg:part 
    #serveral relationships 
    </pkg:part> 

    <pkg:part 
    pkg:name="/word/document.xml" 
    pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"> 
    <pkg:xmlData> 

     <w:document mc:Ignorable="w14 w15 wp14" 




    xmlns:wpc:http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas 
    xmlns:mc:http://schemas.openxmlformats.org/markup-compatibility/2006 
    xmlns:o:urn:schemas-microsoft-com:office:office 
    xmlns:r:http://schemas.openxmlformats.org/officeDocument/2006/relationships 
    xmlns:m:http://schemas.openxmlformats.org/officeDocument/2006/math 
    xmlns:v:urn:schemas-microsoft-com:vml 
    xmlns:wp14:http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing 
    xmlns:wp:http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing 
    xmlns:w10:urn:schemas-microsoft-com:office:word 
    xmlns:w:http://schemas.openxmlformats.org/wordprocessingml/2006/main 
    xmlns:w14:http://schemas.microsoft.com/office/word/2010/wordml 
    xmlns:w15:http://schemas.microsoft.com/office/word/2012/wordml 
    xmlns:wpg:http://schemas.microsoft.com/office/word/2010/wordprocessingGroup 
    xmlns:wpi:http://schemas.microsoft.com/office/word/2010/wordprocessingInk 
    xmlns:wne:http://schemas.microsoft.com/office/word/2006/wordml 
    xmlns:wps:http://schemas.microsoft.com/office/word/2010/wordprocessingShape 

     <w:body> 

      <w:p> ... 
      </w:p> 

      <w:p w14:paraId="5BB64FEF" w14:textId="77777777" w:rsidR="005A3789" w:rsidRDefault="005A3789" w:rsidP="005A3789"> 
      <w:pPr> 
      <w:pStyle w:val="Inhaltsverzeichnisberschrift"/> 
      </w:pPr> 
      <w:r> 
      <w:lastRenderedPageBreak/> 
      <w:t>Inhaltsverzeichnis</w:t> 
      </w:r> 
      </w:p>

'Inhaltsverzeichnis'是我指数的标准。路径是包 - > 3.part - > XMLDATA - >文件 - >体 - >点

的信息被存储在这里例如

<w:p w14:paraId="15ECF978" w14:textId="77777777" w:rsidR="009B5500" w:rsidRDefault="005A3789"> 
<w:pPr> 
<w:pStyle w:val="Verzeichnis1"/> 
<w:rPr> 
<w:rFonts w:eastAsiaTheme="minorEastAsia"/> 
<w:b w:val="0"/> 
<w:noProof/> 
<w:color w:val="auto"/> 
<w:lang w:eastAsia="de-DE"/> 
</w:rPr> 
</w:pPr> 
<w:r> 
<w:rPr> 
<w:b w:val="0"/> 
</w:rPr> 
<w:fldChar w:fldCharType="begin"/> 
</w:r> 
<w:r> 
<w:instrText xml:space="preserve"> TOC \o "1-4" \h \z \u 
</w:instrText> 
</w:r> 
<w:r> 
<w:rPr> 
<w:b w:val="0"/> 
</w:rPr> 
<w:fldChar w:fldCharType="separate"/> 
</w:r> 
<w:hyperlink w:anchor="_Toc474825312" w:history="1"> 
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220"><w:rPr> 
<w:rStyle w:val="Hyperlink"/> 
<w:noProof/> 
</w:rPr> 
        **<w:t>1</w:t>** 
</w:r> 
<w:r w:rsidR="009B5500"><w:rPr><w:rFonts w:eastAsiaTheme="minorEastAsia"/> 
<w:b w:val="0"/> 
<w:noProof/> 
<w:color w:val="auto"/> 
<w:lang w:eastAsia="de-DE"/> 
</w:rPr><w:tab/> 
</w:r> 
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220"> 
<w:rPr> 
<w:rStyle w:val="Hyperlink"/> 
<w:noProof/> 
</w:rPr> 
        **<w:t>Management Summary</w:t>** 
</w:r> 
<w:r w:rsidR="009B5500"> 
<w:rPr> 
<w:noProof/> 
<w:webHidden/> 
</w:rPr> 
<w:tab/> 
</w:r> 
<w:r w:rsidR="009B5500"> 
<w:rPr> 
<w:noProof/> 
<w:webHidden/> 
</w:rPr><w:fldChar w:fldCharType="begin"/> 
</w:r> 
<w:r w:rsidR="009B5500"> 
<w:rPr> 
<w:noProof/> 
<w:webHidden/> 
</w:rPr> 
<w:instrText xml:space="preserve"> PAGEREF _Toc474825312 \h </w:instrText> 
</w:r> 
<w:r w:rsidR="009B5500"> 
<w:rPr> 
<w:noProof/> 
<w:webHidden/> 
</w:rPr> 
</w:r> 
<w:r w:rsidR="009B5500"> 
<w:rPr> 
<w:noProof/> 
<w:webHidden/> 
</w:rPr> 
<w:fldChar w:fldCharType="separate"/> 
</w:r> 
<w:r w:rsidR="009B5500"> 
<w:rPr> 
<w:noProof/> 
<w:webHidden/> 
</w:rPr> 
       **<w:t>6</w:t>** 
</w:r> 
<w:r w:rsidR="009B5500"> 
<w:rPr> 
<w:noProof/> 
<w:webHidden/> 
</w:rPr> 
<w:fldChar w:fldCharType="end"/> 
</w:r> 
</w:hyperlink> 
</w:p>

这是索引的第一个条目，1.管理摘要6

来源

2017-02-14 wolf_wue

这将有助于有输入XML – GGamba

我希望这是现在更好地理解，这最后一个P包的下方，有包含内容serveral的p包内容之间的一个小例子， –

我认为它错过了'w'前缀定义。你可以在xml中找到类似'w =“http：// schemas ......”'的东西吗？ – GGamba

我们可以使用：

library(xml2) 
library(magrittr) 

x <- read_xml("path/to/file.xml") 

titles <- xml_find_all(x, 
       "/pkg:package//pkg:part/pkg:xmlData/w:document/w:body/w:p/w:hyperlink/w:r/w:t") %>% 
     xml_text() %>% 
     matrix(ncol = 3, byrow = T) %>% 
     as.data.frame() 

colnames(titles)<- c('numChapter', 'title', 'numPage')

这里面retrives所有节点对应于100文关注那个xpath。

基于你给出的例子，xpath包含（我想是）numChapter，它的title和它的numPage。

如上所述，如果xml格式不正确和/或某些名称空间丢失，将会出现错误。

希望这有助于

来源

2017-02-14 14:37:05 GGamba

Mhh不幸的不是真的。（part_xml），row.names =“data_frame”，其中包含一个或多个元素， NULL）'具有相同的输出。我的问题是，输出没有结构，所以文本没有按章节标题链接或分组。 –

xml中的结果是你想要“结构化”的结果吗？我认为'Inhaltsverzeichnis'是你感兴趣的第一个节点，后面有几个类似的节点。 – GGamba

编辑，检查这是你的兴趣 – GGamba

导入复杂的.docx文件为.xml和提取章节

回答

相关问题