我如何使用easyhtmlparser从python的html文件中获取所有链接？

-2

我试图使用HTML解析器 http://easyhtmlparser.sourceforge.net/我如何使用easyhtmlparser从python的html文件中获取所有链接？

fd = open('file.html', 'r') 
data = fd.read() 
fd.close() 
html = Html() 
dom = html.feed(data) 
for ind in dom.sail(): 
    if ind.name == 'a': 
     print ind.attr['ref']

来源

2013-07-03 tau

你嫁给了easyhtmlparser吗？美丽的汤是我的英雄。 – vroomfondel

嘛，我不是特别想读easyhtmlparser的文档获取某个页面中的所有链接和图像，但如果你愿意使用Beautiful Soup：

from bs4 import BeautifulSoup 
fd = open('file.html', 'r') 
data = fd.read() 
fd.close() 
soup = BeautifulSoup(data) 
for link in soup.find_all('a'): 
    print(link.get('href')) #or do whatever with it

应该可以工作，但我没有测试过它。祝你好运！

编辑：现在我有。有用。

编辑2：要找到图像，搜索所有图像标签等，找到src链接。我相信你可以在Beautiful Soup或easyhtmlparser文档中找到。

要下载并放入一个文件夹，

import urllib 
urllib.urlretrieve(IMAGE_URL, path_to_folder/imagename)

或者你可以只从urllib的阅读，因为最终一切都只是一个字符串，读比获取更直接。

来源

2013-07-03 07:57:14 vroomfondel

好。我尝试了美丽的汤，但easyhtmlparser文档似乎更简单。我特别不喜欢beautifulsoup它似乎没有其他方法来处理其他事情。无论如何它的罚款。我会继续在这里尝试。 – tau

@ barroieuoeiru适合你的任何东西。它看起来好像美丽的汤有更多的功能，更可靠，并有更好的记录。 – vroomfondel

我想我知道我的代码为什么没有工作。我使用'ref'而不是'href'。显然我可以使用方法dom.find（'a'）遍历所有链接。 – tau

我会这样做。

from ehp import * 

with open('file.html', 'r') as fd: 
    data = fd.read() 

html = Html() 
dom = html.feed(data) 

for ind in dom.sail(): 
    if ind.name == 'a': 
     print ind.attr['href'] 
    elif ind.name == 'img': 
     print ind.attr['src']

来源

2013-07-03 08:16:58 godknows

我如何使用easyhtmlparser从python的html文件中获取所有链接？

回答

相关问题