1
我试图抓住一些节点出我指定的HTML字符串:如何正确抓取某个html字符串中的某些节点?
$html = <<<'HTML'
<h1>Details außen</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;
我需要第一h1
和字符串中的最后一个节点img
。
为此,我使用了DOMDocument,因为使用preg_match_all
或类似的东西我们可能会漏掉一些东西。
完整代码:
$html = <<<'HTML'
<h1>Details außen</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;
$dom = new \DOMDocument();
// since the libxml was designed for ISO-8859-1, this is a backwards hack
// @see https://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258
$dom->loadHTML(iconv('UTF-8', 'ISO-8859-1', $html),
\LIBXML_HTML_NOIMPLIED
);
$h1List = $dom->getElementsByTagName('h1');
$h1 = $h1List->item(0);
$imgList = $dom->getElementsByTagName('img');
$img = $imgList->item($imgList->length - 1);
$data = array(
'tabTitle' => trim($dom->saveHTML($h1)),
'tabImg' => trim($dom->saveHTML($img))
);
// remove both wrappers if empty
$imgWrapper = $img->parentNode;
$imgWrapper->removeChild($img);
if (!$imgWrapper->hasChildNodes()) {
$imgWrapper->parentNode->removeChild($imgWrapper);
}
$h1Wrapper = $h1->parentNode;
$h1Wrapper->removeChild($h1);
if (!$h1Wrapper->hasChildNodes()) {
$h1Wrapper->parentNode->removeChild($h1Wrapper);
}
$data['content'] = $dom->saveHTML();
var_dump($data);
预期输出:
array(
'tabTitle' => '<h1>Details außen</h1>',
'tabImg' => '<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path=\'media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg\'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">',
'content' => '
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p>
'
);
,但我得到了以下的输出:
array(3) {
'tabTitle' =>
string(501) "<h1>Details außen<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
</h1>"
'tabImg' =>
string(373) "<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">"
'content' =>
string(108) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
"
}
这里有什么错?我正在使用PHP 5.6。如果问题与PHP版本相关,则可以更改为PHP 7。
你不应该有倍数H1在HTML – lloiacono
我从来没有听说过这个规矩这个。在我看来,这是没有道理的。试想一下有索引的网站。第一个有序的标题是主要的一点,你用h2等直接指向它。无论如何,我GOOGLE了这个话题。基本上,是的,我们不应该。但这不是功能上的突破。 – alpham8