2015-05-18 80 views
0

我目前正在编写一个脚本来解析HTML文档中的内容位。如何在第一次出现字符串时多次使用

下面是代码我解析的例子:

<div class="tab-content"> 
<div class="tab-pane fade in active" id="how-to-take"> 
<div class="panel-body"> 
<h3>What is Pantoprazole?</h3> 
Pantoprazole is a generic drug used to treat certain conditions where there is too much acid in the stomach. It is 
used to treat gastric and duodenal ulcers, erosive esophagitis, and gastroesophageal reflux disease (GERD). GERD is 
a condition where the acid in the stomach washes back up into the esophagus. <br/> Pantoprazole is a proton pump 
inhibitor (PPI). It works by decreasing the amount of acid produced by the stomach. 
<h3>How To Take</h3> 
Take the tablets 1 hour before a meal without chewing or breaking them and swallow them whole with some water 
</div> 
</div> 
<div class="tab-pane fade" id="alternative-treatments"> 
<div class="panel-body"> 
<h3>Alternatives</h3> 
Antacids taken as required Antacids are alkali liquids or tablets 
that can neutralise the stomach acid. A dose may give quick relief. 
There are many brands which you can buy. You can also get some on 
prescription. If you have mild or infrequent bouts of dyspepsia you 
may find that antacids used as required are all that you need.<br/> 
</div> 
</div> 
<div class="tab-pane fade" id="side-effects"> 
<div class="panel-body"> 
<p>Most people who take acid reflux medication do not have any side-effects. 
However, side-effects occur in a small number of users. The most 
common side-effects are:</p> 
<ul> 

我试图解析所有的内容:

<div class="tab-pane fade in active" id="how-to-take"> 
<div class="panel-body"> 

</div> 

我已经写以下正则表达式代码:

<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n(?:<\/div>) 

,并曾尝试:

<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n<\/div> 

但它似乎并没有在第一<\/div>要停止继续直到代码的最后<div>

+3

[不使用正则表达式来解析HTML]做到这一点很容易(http://stackoverflow.com /问题/ 1732348 /正则表达式匹配开放标签,除了-XHTML-自足标签/ 1732454#1732454)。你可以使用'HtmlAgilityPack'。 –

+0

这个软件只是内部的,只是想让它快速完成:)。不会在我强制执行后使用:) – user1838222

+1

[如何使用HTML敏捷包](http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack)。这是你正在寻找的正则表达式,但你必须使用解析器。 '(?s)

\s*
\s*((?:(?!
)。)*?)\ s *
' –

回答

3

Don't use regex to parse HTML。您可以使用HtmlAgilityPack

然后这个工程根据需要:

var doc = new HtmlAgilityPack.HtmlDocument(); 
doc.LoadHtml(File.ReadAllText("Path")); 
var divPanelBody = doc.DocumentNode.SelectSingleNode("//div[@class='panel-body']"); 
string text = divPanelBody.InnerText.Trim(); // null check omitted 

结果:

什么是泮托拉唑?泮托拉唑是一种仿制药,用于治疗某些胃酸过多的某些病症。用于治疗胃和十二指肠溃疡,糜烂性食管炎和胃食管反流病(GERD)的是 。 GERD是胃中的酸被冲回食道的一种病症。泮托拉唑 是质子泵抑制剂(PPI)。它通过减少胃产生的酸的量来起作用。如何采取饭前药片1小时 不加咀嚼或破坏它们,并与一些水

这是另一个LINQ的做法,我更喜欢在XPath语法吞下整个 :

var divPanelBody = doc.DocumentNode.Descendants("div") 
    .FirstOrDefault(d => d.GetAttributeValue("class", "") == "panel-body"); 

请注意,这两种方法都区分大小写,因此它们不会找到Panel-Body。你可以把过去的做法不区分大小写容易:

var divPanelBody = doc.DocumentNode.Descendants("div") 
    .FirstOrDefault(d => d.GetAttributeValue("class", "").Equals("panel-body", StringComparison.InvariantCultureIgnoreCase)); 
0

您可以通过使用HtmlAgilityPack

public string GetInnerHtml(string html) 
{ 
     HtmlDocument doc = new HtmlDocument(); 
     doc.LoadHtml(html); 
     var nodes = doc.DocumentNode.SelectNodes("//div[@class=\"panel-body\"]"); 
     StringBuilder sb = new StringBuilder(); 
     foreach (var n in nodes) 
     { 
      sb.Append(n.InnerHtml); 
     } 
     return sb.ToString(); 
} 
相关问题