我试图从一个网站检索所有的http和https链接，但有时我得到空例外

public partial class Form1 : Form 
{ 
    int y = 0; 
    string url = @"http://www.google.co.il"; 
    string urls = @"http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n"; 

    public Form1() 
    { 
     InitializeComponent(); 
     //webCrawler(urls, 3); 
     List<string> a = webCrawler(urls, 1); 
     //GetAllImages(); 
    } 

    private int factorial(int n) 
    { 
     if (n == 0) return 1; 
     else y = n * factorial(n - 1); 
     listBox1.Items.Add(y); 
     return y; 
    } 

    private List<string> getLinks(HtmlAgilityPack.HtmlDocument document) 
    { 
     List<string> mainLinks = new List<string>(); 

     if (document.DocumentNode.SelectNodes("//a[@href]") == null) 
     { } 

     foreach (HtmlNode link in document.DocumentNode.SelectNodes("//a[@href]")) 
     { 
      var href = link.Attributes["href"].Value; 
      mainLinks.Add(href); 
     } 

     return mainLinks; 
    } 

    private List<string> webCrawler(string url, int levels) 
    { 
     HtmlAgilityPack.HtmlDocument doc; 
     HtmlWeb hw = new HtmlWeb(); 

     List<string> webSites;// = new List<string>(); 
     List<string> csFiles = new List<string>(); 

     csFiles.Add("temp string to know that something is happening in level = " + levels.ToString()); 
     csFiles.Add("current site name in this level is : "+url); 
     /* later should be replaced with real cs files .. cs files links..*/ 

     doc = hw.Load(url); 
     webSites = getLinks(doc); 

     if (levels == 0) 
     { 
     return csFiles; 
     } 
     else 
     { 
     int actual_sites = 0; 

     for (int i = 0; i < webSites.Count() && i< 100000; i++) // limiting ourseleves for 20 sites for each level for now.. 
     //or it will take forever. 
     { 
      string t = webSites[i]; 
      /* 
        if (!webSites.Contains(t)) 
        { 
         webCrawler(t, levels - 1); 
        } 
      */ 

      if ((t.StartsWith("http://")==true) || (t.StartsWith("https://")==true)) // replace this with future FilterJunkLinks function 
      { 
       actual_sites++; 
       csFiles.AddRange(webCrawler(t, levels - 1)); 
       richTextBox1.Text += t + Environment.NewLine; 
      } 
      } 

      // report to a message box only at high levels.. 
      if (levels==1) 
      MessageBox.Show(actual_sites.ToString()); 

      return csFiles; 
     }     
    }

几个网站已发送到getLinks函数后抛出异常。我试图从一个网站检索所有的http和https链接，但有时我得到空例外

唯一的例外是在上线的getLinks功能：不设置到对象的实例

foreach (HtmlNode link in document.DocumentNode.SelectNodes("//a[@href]"))

对象引用

我试图用那里，如果要检查其null然后我做了return mainLinks;这是一个列表。

但是，如果我这样做，我没有得到所有的网站链接。

现在我在构造函数中使用url如果我使用url（www.google.co.il）我在几秒钟后得到相同的异常。

我不明白为什么这个例外是呕吐。这个例外有什么理由吗？

System.NullReferenceException不是设置为一个对象的一个实例，未处理的
消息=对象引用。
源= GatherLinks
堆栈跟踪：
在GatherLinks.Form1.getLinks（的HTMLDocument文件）中d：\ C-夏普\ GatherLinks \ GatherLinks \ GatherLinks \ Form1.cs中：在GatherLinks.Form1.webCrawler线55
（字符串URL，Int32级别）在D：\ C-Sharp \ GatherLinks \ GatherLinks \ GatherLinks \ Form1.cs中：第76行
at GatherLinks.Form1.webCrawler（String url，Int32 levels）in D：\ C-Sharp \ GatherLinks \ GatherLinks \ GatherLinks \ Form1.cs中：线104
在GatherLinks.Form1..ctor（）中d：\ C-夏普\ GatherLinks \ GatherLinks \ GatherLinks \ Form1.cs中：行29
在GatherLinks.Program.Main （）在D：\ C-Sharp \ GatherLinks \ GatherLinks \ GatherL中油墨\的Program.cs：线18
在System.AppDomain._nExecuteAssembly（大会组件，字串[] args）
在System.AppDomain.ExecuteAssembly（字符串assemblyFile，证据assemblySecurity，字串[] args）
在微软。 VisualStudio.HostingProcess.HostProc.RunUsersAssembly（）
在System.Threading.ThreadHelper.ThreadStart_Context（对象状态）
在System.Threading.ExecutionContext.Run（的ExecutionContext的ExecutionContext，ContextCallback回调，对象状态）
在的System.Threading。 ThreadHelper.ThreadStart（）

来源

2012-05-16 user1398388

它会帮助，如果你可以突出其行'getLinks'是55行。 – HackedByChinese

你需要弄清楚为什么你的对象引用是空的，并确保在你对所有对象进行任何操作之前确认它不是null。在我们能够帮助你之前，你有更多的工作要做。 –

也许你发现了一个没有任何链接的页面？ –

这个问题似乎是你正在测试的空但后来什么都不做这件事 - 在这里

  if (document.DocumentNode.SelectNodes("//a[@href]") == null) 
      { 
      }

要处理的空情况，但还没有写代码来做到这一点，我怀疑。你可能想是这样的：

private List<string> getLinks(HtmlAgilityPack.HtmlDocument document) 
     { 
      List<string> mainLinks = new List<string>(); 
      if (document.DocumentNode.SelectNodes("//a[@href]") != null) 
      { 

       foreach (HtmlNode link in document.DocumentNode.SelectNodes("//a[@href]")) 
       { 
        var href = link.Attributes["href"].Value; 
        mainLinks.Add(href); 
       } 
      } 
      return mainLinks; 
     }

你可能要收拾的东西更像：

private List<string> getLinks(HtmlAgilityPack.HtmlDocument document) 
     { 
      List<string> mainLinks = new List<string>(); 
      var linkNodes = document.DocumentNode.SelectNodes("//a[@href]"); 
      if (linkNodes != null) 
      { 
       foreach (HtmlNode link in linkNodes) 
       { 
        var href = link.Attributes["href"].Value; 
        mainLinks.Add(href); 
       } 
      } 
      return mainLinks; 
     }

来源

2012-05-16 11:14:33 joocer

joocer我试过你的代码我不明白为什么当我使用这个字符串字符串url = @“http://www.bing.com/images/search?q=cat&go=&form=QB &qs=n";即时通讯只有7个链接。即时通讯浏览到该网站，我看到了更多的链接，也有很多很多的照片，每张照片都有自己的链接。那么，为什么只有7个链接？奇怪。 – user1398388

没有详细阅读页面的来源，我只是猜测，但我认为七个链接是页面顶部的七个链接（网页图片视频购物新闻地图更多），并且该页面的其余部分是由JavaScript动态创建的，或者在名为historyFrame的iFrame中创建的，这似乎是搜索结果。 – joocer

我试图从一个网站检索所有的http和https链接，但有时我得到空例外

回答

相关问题