PHP - 完整显示远程页面的内容

我需要获取远程页面，修改一些元素（使用'PHP Simple HTML DOM Parser'库）并输出修改后的内容。PHP - 完整显示远程页面的内容

远程页面在源代码中没有完整的URL时出现问题，因此CSS元素和图像未加载。当然，它不会阻止我修改元素，但输出看起来很糟糕。

例如，打开https://www.raspberrypi.org/downloads/

但是，如果使用代码

$html = file_get_html('http://www.raspberrypi.org/downloads'); 
echo $html;

它会看起来很糟糕。我试图申请一个简单的黑客，但可以帮助一点点：

$html = file_get_html('http://www.raspberrypi.org/downloads'); 
$html=str_ireplace("</head>", "<base href='http://www.raspberrypi.org'></head>", $html); 
echo $html;

有什么办法来“指示”脚本来分析从“http://www.raspberrypi.org” $ HTML变量的所有链接？换句话说，如何让raspberrypi.org成为获取的所有图像/ CSS元素的“主要”来源？

我不知道如何更好地解释它，但我相信你有一个想法。

来源

2017-01-24 Mindaugas Li

由于只有尼古拉Ganovski提供了一个解决方案，我写了通过查找不完整的CSS转换部分页面充分代码/ img/form标签并使它们满。万一有人需要它，找到下面的代码：

//finalizes remote page by completing incomplete css/img/form URLs (path/file.css becomes http://somedomain.com/path/file.css, etc.) 
function finalize_remote_page($content, $root_url) 
{ 
$root_url_without_scheme=preg_replace('/(?:https?:\/\/)?(?:www\.)?(.*)\/?$/i', '$1', $root_url); //ignore schemes, in case URL provided by user was http://domain.com while URL in source is https://domain.com (or vice-versa) 

$content_object=str_get_html($content); 
if (is_object($content_object)) 
    { 
    foreach ($content_object->find('link.[rel=stylesheet]') as $entry) //find css 
     { 
     if (substr($entry->href, 0, 2)!="//" && stristr($entry->href, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com 
      { 
      $entry->href=$root_url.$entry->href; 
      } 
     } 

    foreach ($content_object->find('img') as $entry) //find img 
     { 
     if (substr($entry->src, 0, 2)!="//" && stristr($entry->src, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com 
      { 
      $entry->src=$root_url.$entry->src; 
      } 
     } 

    foreach ($content_object->find('form') as $entry) //find form 
     { 
     if (substr($entry->action, 0, 2)!="//" && stristr($entry->action, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com 
      { 
      $entry->action=$root_url.$entry->action; 
      } 
     } 
    } 

return $content_object; 
}

来源

2017-01-25 12:08:54

我只是试图在这个地方，而且我发现（在源代码中）在HTML链接标签是这样的：

<link rel='stylesheet' href='/wp-content/themes/mind-control/js/qtip/jquery.qtip.min.css' />

这显然需要，应该在一个文件我本地目录（如localhost/wp-content/etc ... /）。链接标签的href必须像

<link rel='stylesheet' href='https://www.raspberrypi.org/wp-content/themes/mind-control/js/qtip/jquery.qtip.min.css' />

所以，你可能想要做的是找到所有链接标签，并添加在休息前他们href属性“https://www.raspberrypi.org/”。

编辑：嘿，我实际上已经作出的工作作风，试试这个代码：

$html = file_get_html('http://www.raspberrypi.org/downloads'); 
$i = 0; 
foreach($html->find('link') as $element) 
{ 
     $html->find('link', $i)->href = 'http://www.raspberrypi.org'.$element->href; 
     $i++; 
} 
echo $html;die;

来源

2017-01-24 19:32:07

是的，它看起来像一个可行的解决方案（这将需要一些额外的编码） - 寻找每一个“无效”链接不包含远程域的里面，添加域并输出内容 –

感谢您的努力，现在看来它是唯一的解决方案。当然，我需要稍微修改代码（只将前缀添加到没有域URL的链接，因为代码将用于解析许多不同的页面），但很可能这将是公认的答案（除非有人甚至有人提供了一些更容易的想法）:) :) –

是的，你可以检查如果href包含域。对于图像，我建议遍历body元素并检查整个元素的字符串值是否包含图像扩展名（如'.gif'，'.png'）。然后你可以编辑当前元素的图像url或src，不管它是什么。 –

PHP - 完整显示远程页面的内容

回答

相关问题