如何使用C＃

我试图从这个网站标签如何使用C＃

sometext

提取文本，我有这样的代码来提取网页数据：

using System; 
using System.Net; 
using HtmlAgilityPack; 

namespace GC_data_console 
{ 
    class Program 
    { 
     public static void Main(string[] args) 
     { 

      using (var client = new WebClient()) 
      { 
       // Download the HTML 
       string html = client.DownloadString("https://www.requestedwebsite.com"); 


       HtmlDocument doc = new HtmlDocument(); 
       doc.LoadHtml(html); 


       foreach(HtmlNode link in 
         doc.DocumentNode.SelectNodes("//span")) 
       { 
        HtmlAttribute href = link.Attributes["id='example1'"]; 


        if (href != null) 
        { 
        Console.WriteLine(href.Value.ToString()); 
         Console.ReadLine(); 
        } 
       } 
       } 
      } 
     } 
    } 
}

但我仍然没有得到文字“sometext”。

但是，如果我插入HtmlAttribute href = link.Attributes [“id”]; 我会得到所有的ID名称。

我在做什么错了？

来源

2017-04-09 Shiwers

您可以分享您试图获取内容的实际URL吗？你也试图获得'HtmlAttribute'的值而不是元素。你需要尝试获得的是'link.InnerText'。 –

你好，例如从这个网页https://www.geocaching.com/geocache/GC257YR_slivercup-studios-east ，我想从标签中获取文本： SliverCup Studios East – Shiwers

知道了....你尝试了我建议的另一种方式吗？你是否也调试过并检查你是否获得了正确的元素？ –

您需要先了解HTML节点和HTMLAttribute之间的区别。你的代码远没有解决问题。

HTMLNode表示HTML中使用的标签，如span,div,p,a等等。 HTMLAttribute表示用于HTMLNode的属性，例如href属性用于a和style,class,id,name等属性用于几乎所有HTML标签。

在下面HTML

<span id="firstName" style="color:#232323">Some Firstname</span>

span是HTMLNode而id和style是HTMLAttributes。您可以使用HtmlNode.InnerText属性获得值Some FirstName。

也从HtmlDocument中选择HTMLNode并不那么简单。你需要提供适当的XPath来选择你想要的节点。

现在在您的代码中，如果您想要获得<span id="ctl00_ContentBody_CacheName">SliverCup Studios East</span>（它是someurl.com的HTML的一部分）中编写的文本，则需要编写以下代码。

using (var client = new WebClient()) 
{ 
    string html = client.DownloadString("https://www.someurl.com"); 

    HtmlDocument doc = new HtmlDocument(); 
    doc.LoadHtml(html); 

    //Selecting all the nodes with tagname `span` having "id=ctl00_ContentBody_CacheName". 
    var nodes = doc.DocumentNode.SelectNodes("//span") 
     .Where(d => d.Attributes.Contains("id")) 
     .Where(d => d.Attributes["id"].Value == "ctl00_ContentBody_CacheName"); 

    foreach (HtmlNode node in nodes) 
    { 
     Console.WriteLine(node.InnerText); 
    } 
}

上面的代码将选择所有span标签，它们可直接在HTML的文档节点下。您需要使用不同的XPath，位于层次结构内部的标签。

这应该有助于您解决问题。

来源

2017-04-09 15:03:53

谢谢！这解决了我的问题，也感谢解释。这是很久以前，因为我已经在html中创建了一些东西。现在我通过WebClient以某种方式“登录”，因此我可以存储数据，这些数据仅提供给登录用户，但我将在未来进行此操作。 – Shiwers

回答

相关问题