2017-07-31 16 views
0

从包含tei文件的字符串中,我生成一个索引来导航到它们的块,我检索所有的div标记,我也想得到如果存在的内容当前div内的标签(标签<head>)。解析tei domxpath在评估循环中获取文本子标记

示例地文件:

<div type="lib" n="1"><head>LIBER I</head>... 
<div type="pr">...</div> 
<div type="cap" n="1"><head>CAP EX</head><p><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div> 
<div type="cap" n="2"><head>CAP EX</head><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div> 
</div> 

我试过,但不起作用:

//source file: 
    $fulltext = '<div type="lib" n="1"><head>LIBER I</head>...<div type="pr">...</div><div type="cap" n="1"><head>CAP EX</head><p><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div><div type="cap" n="2"><head>CAP EX</head><milestone unit="par" n="1" />...<milestone unit="par" n="2" />...</div></div>'; 
    $dom = new DOMDocument(); 
    @$dom->loadHTML($fulltext); 
    $domx = new DOMXPath($dom); 
    $entries = $domx->evaluate("//div"); 
    echo '<ul>'; 
    foreach ($entries as $entry){ 
    $title = ''; 
    type = $entry->getAttribute('type'); 
    $n = $entry->getAttribute('n'); 
    $head = $domx->evaluate("string(./head[1])",$entry); 
    if($head != '') $title = $head; else $title = $n; 
    echo '<li><a href="#'.$type.'-'.$n.'">'.$title.'</li>'; 
    } 
    echo '</ul>'; 

行不起作用:

$head = $domx->evaluate("string(./head[1])",$entry); 

返回错误:

DOMDocument::loadHTML(): htmlParseStartTag: misplaced <head> tag in Entity, line: 3 

此行的目的是让孩子标签头的环内的文本(本例中“LIBER I”)

回答

0

解决使用的XMLReader:

$level = 0; 
       $indici_bc = array(); 
       $indici_head = array(); 
       $passed_milestone = false; 
       $xml = new XMLReader(); 
       $xml->open($pathTei); 
       //$xml->xml($testo); 
       while ($xml->read()){ 
        if($xml->nodeType == XMLReader::END_ELEMENT && $xml->name == 'div'){ 
         $level--; 
         $last_blocco = $xml->name; 
         if($passed_milestone){ $level--; $passed_milestone = false; } 
        } 
        if($xml->nodeType == XMLReader::ELEMENT && ($xml->name == 'div' || $xml->name == 'milestone')){ 
         $blocco = $xml->name; 
         $type = $xml->getAttribute('type'); 
         $n = $xml->getAttribute('n'); 
         $unit = isset($xml->getAttribute('unit')) ? $xml->getAttribute('unit') : ''; 

//here I get the child node 
$node = new SimpleXMLElement($xml->readOuterXML()); 
         $head = $node->head ? (string)$node->head : ''; 

         $indici_head[] = $head; 
         if($last_blocco != 'milestone') $level++; 
         if($blocco == 'div') $bc[$level] = $n; else $bc[($level+1)] = $n; 
         $bc_str = ''; 
         for($j=1;$j<$level;$j++){ 
          if($bc_str != '') $bc_str.='.'; 
          $bc_str.=$bc[$j]; 
         } 
         if($bc_str != '') $bc_str.='.'; 
         $bc_str.=$n; 

         $last_blocco = $xml->name; 
         if($blocco == 'milestone') $passed_milestone = true; 

         $indici_bc[]=$bc_str; 
        } 
       } 
       $xml->close(); 
0

负载使用@符号可以隐藏的各种问题。所以如果你把它拿出来,你的文档会出错。

然而,如果你改变了行

$dom->loadXML($fulltext); 

输出向你以后有什么。

+0

补充说,隐藏displaing警告错误。为什么你认为我不能在头标签内获取内容? – steplab

+0

如果您从负载中拿走@,您会收到有关''标签的错误。 –

+0

它返回此:DOMDocument :: loadHTML():htmlParseStartTag:错位实体中的标记,行:3有人知道为什么吗? – steplab