在树中存储站点的链接

我想存储我从一个非二进制树中的站点刮取的链接。这些链接按层次排列（显然）。问题是如何生成树？我的意思是，我要如何通过链接提供的页面工作，以便我知道谁是谁的孩子。在树中存储站点的链接

现在我可以得到第一级和第二级链接，但不知道如何从这里走，除了我必须递归必须建立它，并有一种方法来停止当我到叶子（我有）。

我在想什么了类似的信息（在Python代码）：

def buildTree(root): 
for node in root.children: 
    if <end condition here>: 
     continue 
    else: 
     nodes = getNodes(urllib2.urlopen(node.url).read()) 
     node.addChildren(nodes) 
     buildTree(node)

其中root和节点是用户定义的节点类

来源

2010-09-07 hyperboreean

只是为了确保我理解正确，你基本上要通过整个网站做爬行，并创造一切，从一个起源的链接的家谱父母链接？你在正确的轨道上，但它听起来像有两件事让你感到困惑 - 存储信息的数据结构以及如何编写递归函数。之后你想怎么处理数据？可视化它？序列化它？ – Dave 2010-09-07 12:46:50

你是对的，我想通过整个网站并创建树。递归函数我认为我知道了，但我不确定数据结构，树是否适合这种情况。 – hyperboreean 2010-09-07 12:58:37

显然，在一个网站上的链接是不是树，但是图。您应该有一个由URL标识的Page对象和一个从一个页面指向另一个页面的链接对象（并且页面A可以指向页面B，而页面B指向页面A，使其成为图形，而不是树）。

扫描算法伪代码：

process_page(current_page): 
    for each link on the current_page: 
    if target_page is not already in your graph: 
     create a Page object to represent target_page 
     add it to to_be_scanned set 
    add a link from current_page to target_page 

scan_website(start_page) 
    create Page object for start_page 
    to_be_scanned = set(start_page) 
    while to_be_scanned is not empty: 
     current_page = to_be_scanned.pop() 
     process_page(current_page)

来源

2010-09-07 13:16:56

是的，它完全是一张图而不是一棵树。谢谢！ – hyperboreean 2010-09-08 09:36:42

在树中存储站点的链接

回答

相关问题