php - 简单的HTML DOM - 其他元素之间的元素

我想写一个PHP脚本来抓取一个网站，并保留在数据库中的一些元素。php - 简单的HTML DOM - 其他元素之间的元素

这里是我的问题：一个网页是这样写的：

<h2>The title 1</h2> 
<p class="one_class"> Some text </p> 
<p> Some interesting text </p> 

<h2>The title 2</h2> 
<p class="one_class"> Some text </p> 
<p> Some interesting text </p> 

<p class="one_class"> Some different text </p> 
<p> Some other interesting text </p> 

<h2>The title 3</h2> 
<p class="one_class"> Some text </p> 
<p> Some interesting text </p>

我想只有H2和P有趣的文本，而不是在p类=“one_class”。

我尝试这样做PHP代码：

<?php 
$numberP = 0; 
foreach($html->find('p') as $p) 
{ 
    $pIsOneClass = PIsOneClass($html, $p); 

    if($pIsOneClass == false) 
    { 
     echo $p->outertext; 
       $h2 = $html->find("h2", $numberP); 
       echo $h2->outertext; 
       $numberP++; 
     } 

} 
?>

功能PIsOneClass（$ HTML，$ p）为：

<?php 
function PIsOneClass($html, $p) 
{ 
foreach($html->find("p.one_class") as $p_one_class) 
    { 
     if($p == $p_one_class) 
     { 
      return true; 
     }   
    } 
    return false; 
} 
?>

它不工作，我明白为什么，但我不知道如何解决它。

我们怎么说“我想每个没有班级的人都在两个h2之间？”

Thx很多！

来源

2014-10-19 Maxime Thizeau

如果他们都是'p.one_class'，那么为什么不在输出保存结果之前查找这些'p'标签并将其删除？ – 2014-10-19 14:07:19

但是我怎样才能订购h2和p？有了这个脚本，它会打印h2 p h2 p h2 p，但我想要类似h2 p p h2 p – 2014-10-19 14:29:49

使用XPath可以更轻松地完成此任务，因为您正在抓取多个元素，并且要保持源代码的顺序。您可以使用PHP的DOM库，其中包括DOMXPath，查找和筛选需要的元素：

$html = '<h2>The title 1</h2> 
<p class="one_class"> Some text </p> 
<p> Some interesting text </p> 

<h2>The title 2</h2> 
<p class="one_class"> Some text </p> 
<p> Some interesting text </p> 

<p class="one_class"> Some different text </p> 
<p> Some other interesting text </p> 

<h2>The title 3</h2> 
<p class="one_class"> Some text </p> 
<p> Some interesting text </p>'; 

# create a new DOM document and load the html 
$dom = new DOMDocument; 
$dom->loadHTML($html); 
# create a new DOMXPath object 
$xp = new DOMXPath($dom); 

# search for all h2 elements and all p elements that do not have the class 'one_class' 
$interest = $xp->query('//h2 | //p[not(@class="one_class")]'); 

# iterate through the array of search results (h2 and p elements), printing out node 
# names and values 
foreach ($interest as $i) { 
    echo "node " . $i->nodeName . ", value: " . $i->nodeValue . PHP_EOL; 
}

输出：

node h2, value: The title 1 
node p, value: Some interesting text 
node h2, value: The title 2 
node p, value: Some interesting text 
node p, value: Some other interesting text 
node h2, value: The title 3 
node p, value: Some interesting text

正如你所看到的，原文停留在秩序，它的容易消除你不想要的节点。

来源

2014-10-19 15:31:13

谢谢，我不知道存在。是否可以同时使用Simple Html Dom或无用？ – 2014-10-19 17:57:01

您无法使用Simple HTML DOM执行XPath操作，但可以从DOMDocument输出HTML，然后使用SHD读取它。你应该可以用DOM来做你想做的一切，不过这是一个处理XML的非常全面的库。 [这是手册]（http://php.net/manual/en/book.dom.php）。 – 2014-10-20 07:34:57

从已经具有一定值的指定属性的simpleHTML dom manual

[attribute=value]

匹配元素。或

[!attribute]

匹配没有指定属性的元素。

来源

2014-10-19 14:58:19 Billy

php - 简单的HTML DOM - 其他元素之间的元素

回答

相关问题