2011-06-14 88 views
3

这里是一个HTML代码片段,我想要的只是获取文本节点并迭代它们。请让我知道。谢谢。HTMLAgilityPack只遍历所有文本节点

<div> 
    <div> 
     Select your Age: 
     <select> 
      <option>0 to 10</option> 
      <option>20 and above</option> 
     </select> 
    </div> 
    <div> 
     Help/Hints: 
     <ul> 
      <li>This is required field. 
      <li>Make sure select the right age. 
     </ul> 
     <a href="#">Learn More</a> 
    </div> 
</div> 

结果:

  1. 选择您的年龄:
  2. 0至10
  3. 20及以上
  4. 帮助/提示:
  5. 这是必填字段。
  6. 确保选择合适的年龄。
  7. 了解更多

回答

17

事情是这样的:

HtmlDocument doc = new HtmlDocument(); 
    doc.Load(yourHtmlFile); 

    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) 
    { 
     Console.WriteLine(node.InnerText.Trim()); 
    } 

将输出这样的:

Select your Age: 
0 to 10 
20 and above 
Help/Hints: 
This is required field. 
Make sure select the right age. 
Learn More 
+0

工程很好......谢谢。 – 2011-06-14 17:23:44

1

我测试了谷歌主页上@Simon Mourier答案,并得到了很多CSS和Javascript,所以我添加了一个额外的过滤器来删除它:

public string getBodyText(string html) 
    { 
     string str = ""; 

     HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
     doc.LoadHtml(html); 

     try 
     { 
      // Remove script & style nodes 
      doc.DocumentNode.Descendants().Where(n => n.Name == "script" || n.Name == "style").ToList().ForEach(n => n.Remove()); 

      // Simon Mourier's Answer 
      foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) 
      { 
       str += node.InnerText.Trim() + " "; 
      } 
     } 
     catch (Exception) 
     { 
     } 

     return str; 
    }