HTML到RichTextBox作为纯文本与超链接

阅读这么多关于not using RegExes for stripping HTML，我想知道如何获得一些链接到我的RichTextBox没有得到所有的杂乱的HTML也是在我从一些报纸网站下载的内容。HTML到RichTextBox作为纯文本与超链接

我有什么：从一个报纸网站的HTML。

我想要什么：作为纯文本在RichTextBox中的文章。但与链接（即，<a href="foo">bar</a>替换为<Hyperlink NavigateUri="foo">bar</Hyperlink>）。

HtmlAgilityPack给我HtmlNode.InnerText（剥去所有HTML标签）和HtmlNode.InnerHtml（带有所有标签）。我可以通过articlenode.SelectNodes(".//a")获取链接的网址和文本，但我应该如何知道在HtmlNode.InnerText的纯文本中插入的位置？

任何提示，将不胜感激。

来源

2013-06-03 Rokus

这里是你如何能做到这一点（与样本控制台应用程序，但这个想法是为Silverlight相同）：

让我们假设你有这样的HTML：

<html> 
<head></head> 
<body> 
Link 1: <a href="foo1">bar</a> 
Link 2: <a href="foo2">bar2</a> 
</body> 
</html>

那么这个代码：

HtmlDocument doc = new HtmlDocument(); 
doc.Load(myFileHtm); 

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a")) 
{ 
    // replace the HREF element in the DOM at the exact same place 
    // by a deep cloned one, with a different name 
    HtmlNode newNode = node.ParentNode.ReplaceChild(node.CloneNode("Hyperlink", true), node); 

    // modify some attributes 
    newNode.SetAttributeValue("NavigateUri", newNode.GetAttributeValue("href", null)); 
    newNode.Attributes.Remove("href"); 
} 
doc.Save(Console.Out);

将输出这样的：

<html> 
<head></head> 
<body> 
Link 1: <hyperlink navigateuri="foo1">bar</hyperlink> 
Link 2: <hyperlink navigateuri="foo2">bar2</hyperlink> 
</body> 
</html>

来源

2013-06-03 13:55:22

很好！这工作，谢谢。但我仍然不得不从其他所有html标签（img，ul，li，p，div ...）中删除我的文本。正则表达式'<[^a].*?>'匹配除链接之外的所有html标签，但我也必须保留''。我不知道如何让那里的OR运算符匹配每个'<.*>'，除了或''。 – Rokus

这个问题的答案，顺便说一句，将是'<(?!a|/a)^>] +>'。我现在想到了这一切。 – Rokus

HTML到RichTextBox作为纯文本与超链接

回答

相关问题