php从html页面获取正文

我想从完整的html代码中剥离一些html-body代码。php从html页面获取正文

我使用下面的脚本。

<?php  
    function getbody($filename) { 
     $file = file_get_contents($filename); 

     $bodystartpattern = ".*<body>"; 
     $bodyendpattern = "</body>.*"; 

     $noheader = eregi_replace($bodystartpattern, "", $file); 

     $noheader = eregi_replace($bodyendpattern, "", $noheader); 

     return $noheader; 
    } 
    $bodycontent = getbody($_GET['url']); 
?>

但在某些情况下，标签<body>不字面上存在，但标签可能是<body style="margin:0;">什么的。谁能告诉我在这种情况下通过在$ bodystartpattern中使用正则表达式来寻找body-tag的解决方案，该正则表达式查找开始body标签的关闭 - “>”？

来源

2014-06-25 Guido Lemmens 2

旁注：['eregi_replace（）']（http://www.php.net//manual/en/function.eregi-replace.php）该函数已被弃用的PHP 5.3.0 。依靠这个功能是非常不鼓励的。 –

检查[这个答案]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454）使用正则表达式来解析HTML ... –

@ 1nflktd我曾尝试下面的代码。

<?php 
    header('Content-Type:text/html; charset=UTF-8'); 

    function getbody($filename) { 
     $file = file_get_contents($filename);  
     $dom = new DOMDocument; 
     $dom->loadHTML($file); 
     $bodies = $dom->getElementsByTagName('body'); 
     assert($bodies->length === 1); 
     $body = $bodies->item(0); 
     for ($i = 0; $i < $body->children->length; $i++) { 
      $body->remove($body->children->item($i)); 
     } 
     $stringbody = $dom->saveHTML($body); 
     return $stringbody; 
    } 

    $url = "http://www.barcelona.com/"; 
    $bodycontent = getbody($url); 
?> 

<html> 
<head></head> 
<body> 
<?php 
    echo "BODY ripped from: ".$url."<br/>"; 
    echo "<textarea rows='40' cols='200' >".$bodycontent."</textarea>"; 
?> 
</body> 
</html>

来源

2014-06-26 00:06:24

我只是在我的机器上试过你的代码，它工作正常。你有没有犯错误？如果您没有启用错误，请执行此操作。 –

它在我的机器上不起作用。您可以在http://www.kunstplantenonline.nl/test/test.php上看到此脚本，并查看php-warnings。 –

检查此http://stackoverflow.com/questions/9149180/domdocumentloadhtml-error，并检查我更新的答案 –

为什么不使用html解析器？

function getbody($filename) { 
    $file = file_get_contents($filename); 

    $dom = new DOMDocument(); 
    libxml_use_internal_errors(true); 
    $dom->loadHTML($file); 
    libxml_use_internal_errors(false); 
    $bodies = $dom->getElementsByTagName('body'); 
    assert($bodies->length === 1); 
    $body = $bodies->item(0); 
    for ($i = 0; $i < $body->children->length; $i++) { 
     $body->remove($body->children->item($i)); 
    } 
    $stringbody = $dom->saveHTML($body); 
    return $stringbody; 
}

DOM loadHTML reference

来源

2014-06-25 18:15:44

我已经复制了你的代码，但现在它什么也没有返回......任何想法？ –

@ GuidoLemmens2你有没有得到任何PHP代码里面..更具体一些'$'？它可能会破坏事情。你有错误报告吗？你从它得到一些回应？ –

您是否看到我在下一封邮件中粘贴的代码？ –

php从html页面获取正文

回答

相关问题