webcrawler链接提取问题

即时通讯编写一个简单的网络爬虫。问题在于链接提取。webcrawler链接提取问题

我使用cpp-netlib和boost。这里有几行CLink Class。

CLink::CLink(const CLink& father, const std::string& relUrl) 
    { 
    uri = relUrl; 
    boost::network::uri::uri instance(relUrl); 
    boost::network::uri::uri instanceFather(father.uri); 

    if ((valid = boost::network::uri::is_valid(instance)) == 1) 
     { 
     scheme = boost::network::uri::scheme(instance); 
     user_info = boost::network::uri::user_info(instance); 
     host  = boost::network::uri::host(instance); 
     port  = boost::network::uri::port(instance); 
     path  = boost::network::uri::path(instance); 
     query  = boost::network::uri::query(instance); 
     fragment = boost::network::uri::fragment(instance); 

     uri = scheme; 
     uri += "://"; 
     uri += host; 
     uri += path; 

     } 
    else 
     { 
     if ((valid = boost::network::uri::is_valid(instanceFather)) == 1) 
     { 

     scheme = boost::network::uri::scheme(instanceFather); 
     user_info = boost::network::uri::user_info(instanceFather); 
     host  = boost::network::uri::host(instanceFather); 
     port  = boost::network::uri::port(instanceFather); 
     path  = boost::network::uri::path(instance); 
     query  = boost::network::uri::query(instance); 
     fragment = boost::network::uri::fragment(instance); 

     uri = scheme; 
     uri += "://"; 
     uri += host; 
     uri += path; 

     } 
     } 
    }; 

    CLink::CLink(const std::string& _url) 
    { 

    uri = _url; 
    boost::network::uri::uri instance(_url); 
     if ((valid = boost::network::uri::is_valid(instance)) == 1) 
     { 
     scheme = boost::network::uri::scheme(instance); 
     user_info = boost::network::uri::user_info(instance); 
     host  = boost::network::uri::host(instance); 
     port  = boost::network::uri::port(instance); 
     path  = boost::network::uri::path(instance); 
     query  = boost::network::uri::query(instance); 
     fragment = boost::network::uri::fragment(instance); 


     uri = scheme; 
     uri += "://"; 
     uri += host; 
     uri += path; 

     } 
     else 
     std::cout << "err " << std::endl; 
    };

从我用htmlcxx库获取的网页的链接。我拿了HTML :: Node并且用boost文件系统对它们进行了标准化。

if (url.find("http://") == std::string::npos) 
    { 
    std::string path = link.get_path() + url; 
    url = link.get_host() + path; 

    boost::filesystem::path result; 
    boost::filesystem::path p(url); 
    for(boost::filesystem::path::iterator it=p.begin(); it!=p.end(); ++it) 
    { 
    if(*it == "..") 
     { 
     if(boost::filesystem::is_symlink(result)) 
    result /= *it; 
     else if(result.filename() == "..") 
    result /= *it; 
     else 
    result = result.parent_path(); 
     } 
    else if(*it == ".") 
     { 
     // Ignore 
     } 
    else 
     { 
     // Just cat other path entries 
     result /= *it; 
     } 
    } 

    url = "http://" + result.string(); 
    } 

return ret;

现在的问题是。

我尝试获取http://www.wikipedia.de/和我得到的URL像

性能 http://wikimedia.de/wiki/Vereinszeitung ......

，并在网站上http://wikimedia.de/wiki/Vereinszeitung有经常喜欢/wiki/vereinsatzung

链接我得到的链接像

http://wikimedia.de/wiki/Vereinszeitung/wiki/Freies_Wissen

有人有一个idee？

来源

2011-05-16 Roby

您需要有一个绝对链接的特例（那些以/开头的链接）。

如果href开始与/，然后将得到的链接应该是（使用The URI template它来自RFC条款）：

[scheme]://[authority][what you got in href]

什么你正在建造的是：

[scheme]://[authority][path][what you got in href]

所以你正在复制路径信息。

所以，如果link.get_path()开始与/，你应该简单地改变：

std::string path = link.get_path() + url; 
url = link.get_host() + path; // this is incorrect btw, missing the [port]

到

url = link.get_host() + ":" + link.get_port() + url;

它很可能是清洁剂来做路径上的路径正常化而已，不是在URL （即在规范化路径之后添加host:port）。

[我想如果遇到https链接你的代码将失败。]

来源

2011-05-16 05:54:18 Mat

THX的答案。但get_path正在返回像/rob/index.html那样的填充路径，所以它的快速然后是/rob/index.html/blaaa ... das suxx :( – Roby 2011-05-16 06:19:53

但作品:) thx – Roby 2011-05-16 06:28:58

webcrawler链接提取问题

回答

相关问题