2011-06-09 50 views
-5

我正在使用C#,并且想要在网站上抓取所有内容(但不包括可能附加到页面的图像,脚本或文件)。我如何用C#和ASP.NET做到这一点?只从网站页面读取HTML内容

+1

你想在服务器端读取页面的HTML或什么? – PSK 2011-06-09 10:50:03

+0

你需要提供更多的细节,你的问题不清楚。 – PSK 2011-06-09 10:54:13

+1

您想仅从网页中提取文字? – 2011-06-09 10:57:02

回答

1

嗨,你可以使用下面的代码片段从HERE做到这一点:

StringBuilder sb = new StringBuilder(); 
byte[]  buf = new byte[8192]; 

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.your-url.com"); 
HttpWebResponse response = (HttpWebResponse)request.GetResponse(); 

Stream resStream = response.GetResponseStream(); 

string tempString = null; 
int count  = 0; 
do 
{ 
    count = resStream.Read(buf, 0, buf.Length); 

    if (count != 0) 
    { 
     tempString = Encoding.ASCII.GetString(buf, 0, count); 
     sb.Append(tempString); 
    } 
} 
while (count > 0); 

Console.WriteLine(sb.ToString()); 
0

您还可以在PageRender方法获取HTML如下。

protected override void Render(System.Web.UI.HtmlTextWriter writer) 
     { 

      StringBuilder sb = new StringBuilder(); 
      StringWriter sw = new StringWriter(sb); 

      HtmlTextWriter writer = new HtmlTextWriter(sw); 
      base.Render(writer); 
      string markupText = sb.ToString(); 
      // markupText will contain the HTML of the Page 
      writer.Write(markupText); 
     }