使用正则表达式来获取多个HTML标签之间的文本

使用正则表达式，我想能够获得多个DIV标签之间的文本。例如，以下内容：使用正则表达式来获取多个HTML标签之间的文本

<div>first html tag</div> 
<div>another tag</div>

将输出：

first html tag 
another tag

我使用的正则表达式模式的匹配我的最后一个div标签，并错过了第一个。代码：

static void Main(string[] args) 
    { 
     string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>"; 
     string pattern = "(<div.*>)(.*)(<\\/div>)"; 

     MatchCollection matches = Regex.Matches(input, pattern); 
     Console.WriteLine("Matches found: {0}", matches.Count); 

     if (matches.Count > 0) 
      foreach (Match m in matches) 
       Console.WriteLine("Inner DIV: {0}", m.Groups[2]); 

     Console.ReadLine(); 
    }

输出：发现

相符：1

内DIV：这是另一个考验

来源

2013-04-14 ben

是势在必行这个任务，你使用正则表达式？ HTML是一种上下文无关语法，不能用正则表达式进行分析。通常情况下，您可以关闭，但使用HTML解析器会更好。请参阅http://stackoverflow.com/a/1732454/2022565 –

与非贪婪匹配

static void Main(string[] args) 
{ 
    string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>"; 
    string pattern = "<div.*?>(.*?)<\\/div>"; 

    MatchCollection matches = Regex.Matches(input, pattern); 
    Console.WriteLine("Matches found: {0}", matches.Count); 

    if (matches.Count > 0) 
     foreach (Match m in matches) 
      Console.WriteLine("Inner DIV: {0}", m.Groups[1]); 

    Console.ReadLine(); 
}

更换您的模式

来源

2013-04-14 23:19:07 coolmine

它发现两个匹配，但在我的程序上显示空值（s） – ben

上面的代码应该工作，请注意它的m.Groups [1]而不是m .Groups [2]，因为我没有理由捕获标签本身。 http://www.rubular.com/r/XQrcobmfAK – coolmine

首先记住r在HTML文件中，您将有一个新的行符号（“\ n”），您没有将其包含在用来检查您的正则表达式的字符串中。

采取你二的正则表达式：

((<div.*>)(.*)(<\\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag. 

((<div.*>)(.*)(<\\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

也是一个很好的地方去寻找这类信息：

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

Mayman

来源

2013-04-14 23:20:19 Mayman

短版本就是你在所有情况下都无法正确执行此操作。总是会出现一些有效的HTML格式，因此正则表达式将无法提取您想要的信息。

原因是因为HTML是一种上下文无关语法，它比正则表达式更复杂。

下面是一个示例 - 如果您有多个堆叠的div，该怎么办？

<div><div>stuff</div><div>stuff2</div></div>

列为其他的答案的正则表达式会抢：

<div><div>stuff</div> 
<div>stuff</div> 
<div>stuff</div><div>stuff2</div> 
<div>stuff</div><div>stuff2</div></div> 
<div>stuff2</div> 
<div>stuff2</div></div>

，因为这是当他们试图解析HTML正则表达式做。

你不能写一个正则表达式来理解如何解释所有的情况，因为正则表达式不能这样做。如果你正在处理一组非常特定的HTML，这可能是可能的，但是你应该记住这个事实。

来源

2013-04-14 23:28:30

你看了Html Agility Pack（见https://stackoverflow.com/a/857926/618649）？

CsQuery也看起来很有用（基本上使用CSS选择器风格的语法来获取元素）。请参阅https://stackoverflow.com/a/11090816/618649。

CsQuery基本上是“jQuery for C＃”，它几乎是我用来找到它的确切搜索条件。

如果你可以在网络浏览器中做到这一点，你可以很容易地使用jQuery，使用类似于$("div").each(function(idx){ alert(idx + ": " + $(this).text()); }的语法（只有你明显地将结果输出到日志或屏幕上，或者使用它进行web服务调用，或者你需要做的任何事情）。

来源

2013-04-15 01:55:31 Craig

downvote没有任何解释或评论。谢谢！事实是，HTML/XML在处理使用正则表达式方面非常痛苦。并不是说你无法做到这一点，而且我的确有很多场合，但CSS选择器语法是一个更清晰的命题。 – Craig

我觉得这个代码应工作：

string htmlSource = "<div>first html tag</div><div>another tag</div>"; 
string pattern = @"<div[^>]*?>(.*?)</div>"; 
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
ArrayList l = new ArrayList(); 
foreach (Match match in matches) 
{ 
    l.Add(match.Groups[1].Value); 
}

来源

2014-07-15 03:12:09

至于其他球员并没有提到HTML tags with attributes，这里是我的解决方案来处理是：

// <TAG(.*?)>(.*?)</TAG> 
// Example 
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>"); 
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!"); 
Console.Write(m.Groups[2].Value); // will print -> World

来源

2016-10-01 11:58:41

使用正则表达式来获取多个HTML标签之间的文本

回答

相关问题