VB.net非常缓慢的网页抓取

我有一个工作网站刮板代码，但我的问题是它刮擦链接非常缓慢。就像每刮一次随机分钟间隔一样。我的程序做了什么，它会删除第一个HTML中的所有链接，然后在第一个HTML的每个刮取链接中删除一个链接。有什么办法可以让这个更快吗？我正在使用后台工作。所以后台工作者不是这里的问题，而是代码本身。VB.net非常缓慢的网页抓取

这是我工作的代码：

Dim sList As New List(Of String) 
Dim INwebClient As New System.Net.WebClient 
Dim INWebSource As String = INwebClient.DownloadString("http://www.yelp.com/search?find_desc=Hotels&find_loc=CA&ns=1&ls=88145bf794a78999#") 
Dim INhtmlDoc As New HtmlAgilityPack.HtmlDocument() 
INhtmlDoc.LoadHtml(INWebSource) 
Dim counter As Integer = 0 

For Each INlink As HtmlNode In INhtmlDoc.DocumentNode.SelectNodes("//a[@href]") 

    Dim INatt As HtmlAttribute = INlink.Attributes("href") 
    If INatt.Value.Contains("/biz") Then 
     Dim INholder = INlink.Attributes("href").Value 
     Dim INconverter As String = INholder.ToString 
     INoutput = INconverter.Insert(INconverter.IndexOf("/biz"), "http://www.yelp.com") 
     sList.Add(INoutput) 
    End If 
Next 

For Each Uri As String In sList 
    Dim webClient As New System.Net.WebClient 
    Dim WebSource As String = webClient.DownloadString(Uri) 
    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument() 
    htmlDoc.LoadHtml(WebSource) 

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//a[@href]") 
     Dim att As HtmlAttribute = link.Attributes("href") 
     If att.Value.Contains("/biz_share") Then 
      Dim holder = link.Attributes("href").Value 
      Dim converter As String = holder.ToString 
      Dim output As String = converter.Insert(converter.IndexOf("/biz"), "http://www.yelp.com") 
      If output.Contains("reviewid") = False Then 
       If Not ListBox1.Items.Contains(output) Then 
        ListBox1.Items.Add(output) 
        counter = counter + 1 
       End If 
      End If 
     End If 
    Next 
    Label1.Text = counter 

Next

来源

2013-08-30 Marc Intes

您在不同时间“过度存储”和“过度分析”相同的东西。您可以在代码中执行各种改进。尽管如此，还是有一些问题不太清楚：你能否解释一下“柜台”预计会计算什么？请记住它在两个嵌套循环中，它们可能无法提供您之后的确切性能：它正在计算特定条件下的链接数（包含“biz_share”，而不是“reviewID”等），但将其提升为二的力量;也就是说，如果你有3个链接，它就提供了9个。请详细解释你想要的。 – varocarbas

我的柜台会告诉我显示在列表框中的链接的数量。 –

我在第一次看后就误解了你的代码：我以为你一遍又一遍地分析相同的链接。你不需要依赖列表（这会让事情变得更慢），但除此之外，浪费时间的大量工作是每次都浏览一个新的链接，这显然是无法避免的。这种方法可以加快一点，但根据定义它非常耗时。 – varocarbas

你应该使用yelp API，而不是报废自己的网页。我很确定它会更快更可靠。

来源

2013-08-30 14:30:12

VB.net非常缓慢的网页抓取

回答

相关问题