使用Html Agility Pack获取两个HTML标签之间的内容

我们在Word中创建了一个绝对巨大的帮助文档，该文档用于生成更加大规模和无差别的HTM文档。使用C＃和这个库，我只想在我的应用程序的任何位置抓取并显示这个文件的一部分。部分被划分成这样：使用Html Agility Pack获取两个HTML标签之间的内容

<!--logical section starts here --> 
<div> 
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section A</a></h1> 
</div> 
<div> Lots of unnecessary markup for simple formatting... </div> 
..... 
<!--logical section ends here --> 

<div> 
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section B</a></h1> 
</div>

按理来说，有一个H1在一个a标签的栏目名称。我想从包含div的外部选择所有内容，直到遇到另一个h1并排除该div。

各部分名称位于一个<a>标签下h1其中有多个孩子（每次约6）
逻辑部分标有注释
这些意见不实际的文档

我尝试：

var startNode = helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(., '"+sectionName+"')]"); 
//go up one level from the a node to the h1 element 
startNode=startNode.ParentNode; 

//get the start index as the index of the div containing the h1 element 
int startNodeIndex = startNode.ParentNode.ChildNodes.IndexOf(startNode); 

//here I am not sure how to get the endNode location. 
var endNode =?; 

int endNodeIndex = endNode.ParentNode.ChildNodes.IndexOf(endNode); 

//select everything from the start index to the end index 
var nodes = startNode.ParentNode.ChildNodes.Where((n, index) => index >= startNodeIndex && index <= endNodeIndex).Select(n => n);

正弦我一直没能找到这方面的文档，我不知道如何从我的开始节点到下一个h1元素。任何建议，将不胜感激。

来源

2012-05-29 Rondel

我认为这样做，尽管它假设H1标签只出现在部分头部。如果情况并非如此，可以在后代添加一个Where来检查它找到的任何H1节点上的其他过滤器。请注意，这将包括它找到的div的所有兄弟，直到它出现在具有节名的下一个兄弟。

private List<HtmlNode> GetSection(HtmlDocument helpDocument, string SectionName) 
{ 
    HtmlNode startNode = helpDocument.DocumentNode.Descendants("div").Where(d => d.InnerText.Equals(SectionName, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault(); 
    if (startNode == null) 
     return null; // section not found 

    List<HtmlNode> section = new List<HtmlNode>(); 
    HtmlNode sibling = startNode.NextSibling; 
    while (sibling != null && sibling.Descendants("h1").Count() <= 0) 
    { 
     section.Add(sibling); 
     sibling = sibling.NextSibling; 
    } 

    return section; 
}

来源

2012-05-29 23:19:16

不错。我不得不稍微修改过滤器，因为我在文档中有多个部分名称。我结束了使用'HtmlNode startNode = helpDocument.DocumentNode.Descendants（“h1”）。其中（d => d.InnerText.Contains（SectionName））。FirstOrDefault（）;并从那里移动到父节点。其余的工作完美。谢谢 – Rondel

优秀。我很高兴工作。 –

那么，你真正想要的结果是h1-Tag周围的div吗？如果是，那么这应该工作。

helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(@name, '"+sectionName+"')]/ancestor::div");

而且具有取决于你的HTML SelectNodes工作。像这样：

helpDocument.DocumentNode.SelectNodes("//h1/a[starts-with(@name,'_Toc')]/ancestor::div");

哦，虽然测试此我发现工作不适合我的东西就是点在contains方法，一旦我将其更改为属性的名称，一切工作正常。

来源

2012-05-29 22:33:56 shriek

不完全。我希望围绕'h1'标签的div，但是我还希望将所有未来的div/spans直到下一个'h1'标签的周围div。不过谢谢。 – Rondel

使用Html Agility Pack获取两个HTML标签之间的内容

回答

相关问题