2013-10-14 82 views
0

我想解析html文档。 'h2'之后我需要所有'p'的内容。HTML DOMDocument从标签后面的段落获取字符串

的HTML解析:(例子)

<h1>Lorem ipsum</h1> 
<p> 
    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, 
</p> 

<h2>Aenean commodo</h2> 
<p> 
    Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. 
</p> 

<h2>consectetuer adipiscing</h2> 
<p> 
    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, 
</p> 

在这里,我想最后两个 'P' 标签(动态)。


这里我的PHP代码:

$dom = new DOMDocument(); 
$dom->loadHTMLFile($html_file); 
libxml_use_internal_errors(true); 

$h2_tags = $dom->getElementsByTagName('h2'); 

foreach($h2_tags as $single_tag) { 

    echo $single_tag->textContent;   
    print_r($single_tag); 

} 

这只是给了我h2的文本内容。但是在h2之后我需要'p'。 这是可能的还是我需要使用其他课程?

回答

2

你可以试试下面的代码:

$dom = new DOMDocument(); 
$dom->loadHTMLFile($html_file); 
libxml_use_internal_errors(true); 

$xpath = new DomXPath($dom); 
$nodeList = $xpath->evaluate('//p[preceding::h2]/text()'); 

foreach ($nodeList as $domElement){ 
    echo $domElement->textContent."<br><br>"; 
} 

参考输出:http://phpfiddle.org/main/code/7i5-3ir

0
<?php 

$items = array(); 

$document = new DOMDocument; 
@$document->loadHTMLFile('example.html'); 

foreach ($document->getElementsByTagName('h2') as $node) { 
    while ($node = $node->nextSibling) { 
     if ($node->nodeType == XML_ELEMENT_NODE) { 
      if ($node->nodeName == 'p') { 
       $items[] = $node->textContent; 
      } 

      break; 
     } 
    } 
} 

print_r($items);