如何从.html页面中提取链接和标题？

为我的网站，我想添加一个新的功能。如何从.html页面中提取链接和标题？

我希望用户能够上传自己的书签备份文件（从任何浏览器如果可能的话），这样我就可以把它上传到他们的个人资料，他们不必插入所有手动他们......

我错过了这个唯一的一部分，这是从上传的文件中提取标题和URL的一部分..任何人都可以提供一个线索从哪里开始或在哪里阅读？

使用搜索选项和（how to extract data from a raw html file）这个姐姐为我和它没有谈论它的最相关的问题..

我真的如果使用jQuery或PHP

谢不介意你很

来源

2010-12-12 Toni Michel Caubet

它可能会帮助大家，如果你能忍受的类型的书签备份文件的例子，你想支持（每个浏览器） – scoates 2010-12-12 18:41:58

网景格式为常见的是：http：/ /msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx – Matthew 2010-12-12 18:56:34

谢谢大家，我知道了！

最终代码：这说明您分配锚文本并在.html文件

$html = file_get_contents('bookmarks.html'); 
//Create a new DOM document 
$dom = new DOMDocument; 

//Parse the HTML. The @ is used to suppress any parsing errors 
//that will be thrown if the $html string isn't valid XHTML. 
@$dom->loadHTML($html); 

//Get all links. You could also use any other tag name here, 
//like 'img' or 'table', to extract other tags. 
$links = $dom->getElementsByTagName('a'); 

//Iterate over the extracted links and display their URLs 
foreach ($links as $link){ 
    //Extract and show the "href" attribute. 
    echo $link->nodeValue; 
    echo $link->getAttribute('href'), '<br>'; 
}

再次各个环节的HREF，非常感谢。

来源

2010-12-12 20:18:17

这可能已经足够：

$dom = new DOMDocument; 
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $node) 
{ 
    echo $node->nodeValue.': '.$node->getAttribute("href")."\n"; 
}

来源

2010-12-12 18:50:07 Matthew

whre $ html它是文件的路径？感谢这么快速的回答：D – 2010-12-12 18:53:36

@Toni，'$ html'是包含HTML的字符串。你可以使用'$ dom-> loadHTMLFile（）'直接从文件中加载。（你可能想用'@'作为前缀来压制警告。） – Matthew 2010-12-12 18:54:34

哇！非常感谢你！看起来就像差不多完成了！我可以得到链接，但我有名称或标题的麻烦（我都试过） – 2010-12-12 19:06:56

假设存储链接在一个HTML文件的最佳解决方案可能是牛逼o使用一个html解析器，例如PHP Simple HTML DOM Parser（从未尝试过）。（另一种选择是使用基本的字符串搜索或正则表达式进行搜索，并且您应该使用regexp来解析html，否则绝不会使用）。

从教程：

使用的解析器使用它的功能来找到a标签读取HTML文件后

// Find all links 
foreach($html->find('a') as $element) 
     echo $element->href . '<br>';

来源

2010-12-12 18:53:17

这是一个例子，你可以在你的情况下使用：

$content = file_get_contents('bookmarks.html');

运行以下命令：

<?php 

$content = '<html> 

<title>Random Website I am Crawling</title> 

<body> 

Click <a href="http://clicklink.com">here</a> for foobar 

Another site is http://foobar.com 

</body> 

</html>'; 

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME 
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)[email protected])?"; // User and Pass 
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP 
$regex .= "(\:[0-9]{2,5})?"; // Port 
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path 
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query 
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor 


$matches = array(); //create array 
$pattern = "/$regex/"; 

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0]))); 
echo "<br><br>"; 
echo implode("<br>", array_values(array_unique($matches[0])));

输出：

Array 
(
    [0] => http://clicklink.com 
    [1] => http://foobar.com 
)

http://clicklink.com

http://foobar.com

来源

2015-03-28 20:59:50

$html = file_get_contents('your file path'); 

$dom = new DOMDocument; 

@$dom->loadHTML($html); 

$styles = $dom->getElementsByTagName('link'); 

$links = $dom->getElementsByTagName('a'); 

$scripts = $dom->getElementsByTagName('script'); 

foreach($styles as $style) 
{ 

    if($style->getAttribute('href')!="#") 

    { 
     echo $style->getAttribute('href'); 
     echo'<br>'; 
    } 
} 

foreach ($links as $link){ 

    if($link->getAttribute('href')!="#") 
    { 
     echo $link->getAttribute('href'); 
     echo'<br>'; 
    } 
} 

foreach($scripts as $script) 
{ 

     echo $script->getAttribute('src'); 
     echo'<br>'; 

}

来源

2016-01-08 08:20:56 Raghavendra

造型失败，答案难以阅读。请编辑您的答案并使其更具可读性 – michaldo 2016-01-08 08:29:26

给定问题的代码太多... – 2016-01-08 08:43:17

如何从.html页面中提取链接和标题？

回答

相关问题