从字符串包括：在C＃NBSP

如何删除所有的HTML标签，包括在C＃中使用正则表达式& NBSP删除HTML标签。我的字符串看起来像从字符串包括：在C＃NBSP

"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

来源

2013-10-22 rampuriyaaa

不要使用正则表达式，检查出的HTML敏捷性包。 http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack – Tim

感谢蒂姆，但应用程序是相当大的，完整的，添加或下载HTML敏捷包将无法正常工作。 – rampuriyaaa

172

如果你不能使用HTML解析器以过滤标签为主的解决方案，这是一个简单的正则表达式。

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

理论上，应该再拍该负责多个空格作为

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

来源

2013-10-22 17:08:21

我还没有测试过这个就像我需要，但它的工作比我预期的要更好地工作。我将发布我在下面写的方法。 –

懒惰匹配（？'<[^>] +>'按@大卫S.）可能使这个稍快一点，但只用在现场的项目该解决方案 - 很开心:) +1 –

Regex.Replace（inputHTML，@ “<[^>] +> |＆nbsp | \ n;”，“”）.Trim（）; \ n不得到去除 –

这样的：

(<.+?> | &nbsp;)

将匹配任何标记或 

string regex = @"(<.+?>|&nbsp;)"; 
var x = Regex.Replace(originalString, regex, "").Trim();

则x = hello

来源

2013-10-22 17:08:10 Jonesopolis

我一直在使用这个功能了一会儿穿过一个正则表达式过滤器。删除几乎任何杂乱的HTML，你可以扔在它，并保持文本完好无损。

 private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled); 

     //add characters that are should not be removed to this regex 
     private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled); 

     public static String UnHtml(String html) 
     { 
      html = HttpUtility.UrlDecode(html); 
      html = HttpUtility.HtmlDecode(html); 

      html = RemoveTag(html, "<!--", "-->"); 
      html = RemoveTag(html, "<script", "</script>"); 
      html = RemoveTag(html, "<style", "</style>"); 

      //replace matches of these regexes with space 
      html = _tags_.Replace(html, " "); 
      html = _notOkCharacter_.Replace(html, " "); 
      html = SingleSpacedTrim(html); 

      return html; 
     } 

     private static String RemoveTag(String html, String startTag, String endTag) 
     { 
      Boolean bAgain; 
      do 
      { 
       bAgain = false; 
       Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase); 
       if (startTagPos < 0) 
        continue; 
       Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase); 
       if (endTagPos <= startTagPos) 
        continue; 
       html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length); 
       bAgain = true; 
      } while (bAgain); 
      return html; 
     } 

     private static String SingleSpacedTrim(String inString) 
     { 
      StringBuilder sb = new StringBuilder(); 
      Boolean inBlanks = false; 
      foreach (Char c in inString) 
      { 
       switch (c) 
       { 
        case '\r': 
        case '\n': 
        case '\t': 
        case ' ': 
         if (!inBlanks) 
         { 
          inBlanks = true; 
          sb.Append(' '); 
         } 
         continue; 
        default: 
         inBlanks = false; 
         sb.Append(c); 
         break; 
       } 
      } 
      return sb.ToString().Trim(); 
     }

来源

2013-10-22 17:14:30

只需确认：SingleSpacedTrim（）函数与字符串noHTMLNormalised = Regex.Replace（noHTML，@“\ s {2，}”，“”）的作用相同。来自Ravi Thapliyal的回答？ – Jimmy

@Jimmy据我所知，该正则表达式不会像SingleSpacedTrim（）那样捕获单个标签或换行符。这可能是一个理想的效果，在这种情况下，只需根据需要移除这些案例。 –

不错，但它似乎用空格替换单引号和双引号，虽然它们不在“_notOkCharacter_”列表中，或者我在那里丢失了什么？解码/编码方法的这一部分在开始时被称为？有必要保持这些角色的完整性？ – vm370

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

来源

2014-06-11 06:27:50 MRP

我把@Ravi Thapliyal的代码，并提出了方法：这是简单的，并且可能不干净的一切，但到目前为止，它是做什么的，我需要做的事。

public static string ScrubHtml(string value) { 
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim(); 
    var step2 = Regex.Replace(step1, @"\s{2,}", " "); 
    return step2; 
}

来源

2014-07-31 14:50:46

-1

清理Html文档涉及很多棘手的事情。该软件包的帮助可能： https://github.com/mganss/HtmlSanitizer

来源

2016-01-04 19:54:16 ehsan88

-1

(<([^>]+)>|&nbsp;)

你可以在这里进行测试： https://regex101.com/r/kB0rQ4/1

来源

2017-02-10 17:58:20

从字符串包括：在C＃NBSP

回答

相关问题