用Html Agility Pack刮去网站。 GET的响应不如预期

使用System.Net.HttpRequest我想在我的代码中在以下搜索引擎上模仿用户搜索。用Html Agility Pack刮去网站。 GET的响应不如预期

http://www.scirus.com

搜索URL的一个例子是如下：

http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s

我有以下代码来执行HTTP GET。注意我正在使用HtmlAgilityPack。

protected override HtmlDocument MakeRequestHtml(string requestUrl) 
{ 
    try 
    { 
     HttpWebRequest request = WebRequest.Create(requestUrl) as HttpWebRequest; 
     request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"; 
     HttpWebResponse response = request.GetResponse() as HttpWebResponse; 

     HtmlDocument htmlDoc = new HtmlDocument(); 
     htmlDoc.Load(response.GetResponseStream()); 
     return (htmlDoc); 

    } 
    catch (Exception e) 
    { 
     Console.WriteLine(e.Message); 
     Console.Read(); 
     return null; 
    } 
}

其中“requestUrl”是上面显示的示例搜索URL。

htmlDoc.DocumentNode.InnerHtml的内容不包含任何搜索结果，并且看起来完全不像您复制粘贴上面显示的示例搜索URL到浏览器中的搜索结果页面。

我猜这是因为你必须先有一个会话才能执行请求。任何人都可以建议是否有可行的方法来复制用户代理的行为？或者，也许有一种更好的方式来达到“刮”我不知道的搜索结果的目标？建议请。

robots.txt的内容：htmlDoc.DocumentNode.InnerHtml

Response

来源

2012-06-03 dior001

OK我实际上WebClient的

 static void Main(string[] args) 
    { 
     WebClient client = new WebClient(); 
     client.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0"); 
     string str = client.DownloadString("http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s"); 
     byte[] bit = new System.Text.ASCIIEncoding().GetBytes(str); 
     FileStream fil = File.OpenWrite("test.txt"); 
     fil.Write(bit,0,bit.Length); 
    }

测试，这里是下载的文件http://pastebin.com/qswtgC4n

来源

2012-06-03 06:35:07 Lakis

谢谢你的作品。其实原始代码也适用。问题是由于MakeRequestHtml方法的requestUrl参数的格式不正确造成的。 – dior001

的

#/robots.txt file for http://www.scirus.com 

User-agent: NetMechanic 
Disallow: /srsapp/sciruslink 

User-agent: * 
Disallow: /srsapp/sciruslink 
Disallow: /srsapp/search 
Disallow: /srsapp/search_simple 
Disallow: /search_simple 
# for dev and accept server uncomment below line at Build time to disallow robots completely 
##Disallow:/

内容你可能需要设置一个用户代理，例如

request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";

您还应该检查该网站的Robots.txt文件以确保您受到欢迎。

来源

2012-06-03 02:38:45

感谢您的回复。在通过代码生成的响应中，我仍然通过浏览器生成完全不同的HTML。我发布了包含robots.txt的更新代码。你有进一步的建议吗？ – dior001

我测试了更改用户代理并完美运行 – Lakis

感谢您对它进行测试。我更新了帖子以显示我从请求中获得的回复。这与在浏览器http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s中打开此链接不同。您是否可以通过在浏览器中打开上面的链接来确认您是否获得了运行代码的相同HTML，并且如果您获得了相同的HTML，那么我可能会错误地获取我在时刻。 – dior001

-1

确保您不会ping服务器过度，特别是如果代码加载文档先前的工作。您可能遇到了将您发送到robots.txt或类似页面的服务器规则。

来源

2014-12-10 02:04:02 alec

用Html Agility Pack刮去网站。 GET的响应不如预期

回答

相关问题