2012-09-08 28 views
1

所以我的问题是,我已经提取了很多论坛帖子到单独的txt文件,现在在我的硬盘上。每个文件都包含我想要提取的信息,其中一些我已经计算出如何提取。我需要提取的信息是以下形式:如何从HTML文件中提取特定内容为TXT格式?

在相同的“HTML块”

1:(X)在该线程消息
2:消息是答复(一些HTML代码) A HREF =“链接”(一些HTML代码=

在任务1是简单地需要提取X
在任务2 I需要提取该消息是答复

我已经看过的链接到不同的tm和XML包中,但一直未能实现找出要使用的东西。任何建议表示赞赏。

这就是txt文件的一个看起来像

`<HTML> 
<HEAD> 
<TITLE>Dear LEGO : 5668 </TITLE> 
<META NAME="ROBOTS" CONTENT="ALL, INDEX, FOLLOW"> 
<META NAME="KEYWORDS" CONTENT="lego, legos, legoland, toy, construction, community, education, technic, mindstorms, toolo, duplo, primo, dacta"> 
<META NAME="DESCRIPTION" CONTENT="Dear LEGO : 5668 - LUGNET: The international fan-created LEGOÆ Users Group Network. A place for LEGOÆ fans of all ages to find information, meet one another, and share ideas. As an independent site by fans, for fans, it is neither sponsored nor endorsed by the LEGO Company."> 
<SCRIPT LANGUAGE="JavaScript" SRC="http://www.lugnet.com/js/common.js"></SCRIPT> 
</HEAD> 

<BODY 
LEFTMARGIN=0 TOPMARGIN=0 MARGINWIDTH=0 MARGINHEIGHT=0 
BGCOLOR="#FFFFFF" TEXT="#000000" xLINK="#0000FF" xVLINK="#501080" xALINK="#B0C8EC"> <TABLE BORDER=0 CELLPADDING=9 CELLSPACING=0 WIDTH="100%" BGCOLOR="#B0C8EC"> 
    <TR ALIGN=CENTER VALIGN=BOTTOM> 

    <TD ALIGN=LEFT><NOBR><A TARGET="_top" HREF="http://www.lugnet.com/"><IMG BORDER=0 WIDTH=28 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-home.gif" ALT="To LUGNET Homepage"></A><A TARGET="_top" HREF="http://news.lugnet.com/"><IMG BORDER=0 WIDTH=27 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-news.gif" ALT="To LUGNET News Homepage"></A><A TARGET="_top" HREF="http://guide.lugnet.com/"><IMG BORDER=0 WIDTH=37 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-guide.gif" ALT="To LUGNET Guide Homepage"></A></NOBR><BR></TD>  <FORM NAME="search" ACTION="http://www.lugnet.com/search.cgi" METHOD=POST 
     onSubmit="return(MetaSearch(document.search))"> <TD> 
     <INPUT TYPE=HIDDEN NAME="category" VALUE="/dear-lego/"> 
     <NOBR><SELECT NAME="scope"> 
      <OPTION VALUE="SetGuide">Set Reference 
      <OPTION VALUE="QuickSet">Set Reference (Popup) 
      <OPTION VALUE="PartsRef">Parts Reference <OPTION VALUE="News">News 
      <OPTION VALUE="NewsRel" SELECTED>News (Dear LEGO)   </SELECT>&nbsp;<A HREF="http://www.lugnet.com/help/search/"><IMG BORDER=0 WIDTH=16 HEIGHT=16 HSPACE=0 VSPACE=0 SRC="http://www.lugnet.com/help/help.gif" ALT="Help on Searching"></A></NOBR><BR> <NOBR><INPUT TYPE=TEXT NAME="query" VALUE="" SIZE=16 MAXLENGTH=200><SMALL>&nbsp;<INPUT TYPE=SUBMIT NAME="SUBMIT" VALUE="Search"></SMALL></NOBR><BR> 
     </TD> 
     </FORM> 

    <TD ALIGN=RIGHT><NOBR><A HREF="/news/post/?lugnet.dear-lego"><IMG BORDER=0 WIDTH=22 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-post.gif" ALT="Post new message to lugnet.dear-lego"></A><A HREF="news://lugnet.com/lugnet.dear-lego"><IMG BORDER=0 WIDTH=30 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-nntp.gif" ALT="Open lugnet.dear-lego in your NNTP Newsreader"></A><A HREF="http://news.lugnet.com/news/traffic/"><IMG BORDER=0 WIDTH=32 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-traffic.gif" ALT="To LUGNET News Traffic Page"></A><IMG BORDER=0 WIDTH=3 HEIGHT=44 HSPACE=6 VSPACE=0 SRC="/news/icon-sep.gif"><A HREF="http://www.lugnet.com/people/members/sign-in/"><IMG BORDER=0 WIDTH=37 HEIGHT=44 HSPACE=0 VSPACE=0 SRC="/news/icon-signin-key.gif" ALT="Sign In (Members)"></A></NOBR><BR></TD> 

    </TR> 
</TABLE> 
<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%" BGCOLOR="#8899BB"><TR><TD><SPACER TYPE=BLOCK WIDTH=1 HEIGHT=1></TD></TR></TABLE> <TABLE BORDER=0 CELLPADDING=7 CELLSPACING=0 WIDTH="100%" BGCOLOR="#E8F0FF"> <TR ALIGN=CENTER VALIGN=CENTER> 
     <TD COLSPAN=2 ALIGN=CENTER VALIGN=CENTER> 
<script type="text/javascript"><!-- 
google_ad_client = "pub-0089902038208374"; 
//LUGNET 728x15, Erstellt 13.12.07 
google_ad_slot = "6645292597"; 
google_ad_width = 728; 
google_ad_height = 15; 
//--></script> 
<script type="text/javascript" 
src="http://pagead2.googlesyndication.com/pagead/show_ads.js"> 
</script> 
     </TD> 
     </TR> <TR ALIGN=LEFT VALIGN=CENTER> <TD> <BIG><FONT FACE="Geneva,Arial,Helvetica"> 
     &nbsp;<A HREF="/dear-lego/">Dear&nbsp;LEGO</A>&nbsp;<FONT COLOR="#8899BB">/</FONT> 5668 <BR></FONT></BIG> </TD> <TD ALIGN=RIGHT><SMALL><FONT FACE="Geneva,Arial,Helvetica"> 
     <A HREF="/dear-lego/?n=5667">5667</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="/dear-lego/?n=5669">5669</A> 
     <BR></SMALL></FONT></TD> </TR> 

</TABLE> 
<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%" BGCOLOR="#8899BB"><TR><TD><SPACER TYPE=BLOCK WIDTH=1 HEIGHT=1></TD></TR></TABLE> <!-- google_ad_section_start --> <CENTER> <TABLE BORDER=0 CELLPADDING=16 CELLSPACING=0 WIDTH="100%"><TR><TD ALIGN=LEFT> <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0><TR ALIGN=LEFT VALIGN=TOP><TD> <TABLE BORDER=0 CELLPADDING=8 CELLSPACING=0> 

     <TR BGCOLOR="#E0E0E0"><TD ALIGN=LEFT> <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%"><TR ALIGN=CENTER VALIGN=TOP> <TD ALIGN=LEFT VALIGN=TOP> 

    <TABLE BORDER=0 CELLPADDING=2 CELLSPACING=0> <TR VALIGN=MIDDLE> 

      <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Subject:&nbsp;<BR></FONT></TD> 

      <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><BIG><BIG><B>Online PAB and Design-by-me needs more parts for Lego Train</B></BIG></BIG><BR></FONT></TD> 

      </TR> <TR VALIGN=MIDDLE> 

      <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Author:&nbsp;<BR></FONT></TD> 

      <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><B>Benjamin Medinets</B><BR></FONT></TD> 

      </TR> <TR VALIGN=MIDDLE> 

      <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Newsgroups:&nbsp;<BR></FONT></TD> 

      <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><A HREF="/dear-lego/">lugnet.dear-lego</A>, <A HREF="/trains/">lugnet.trains</A><BR></FONT></TD> 

      </TR> <TR VALIGN=MIDDLE> 

      <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Followup-To:&nbsp;<BR></FONT></TD> 

      <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><A HREF="/trains/">lugnet.trains</A><BR></FONT></TD> 

      </TR> <TR VALIGN=MIDDLE> 

      <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Date:&nbsp;<BR></FONT></TD> 

      <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1">Thu, 6 Oct 2011 03:44:44 GMT<BR></FONT></TD> 

      </TR> <TR VALIGN=MIDDLE> 

      <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">From:&nbsp;<BR></FONT></TD> 

      <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><FONT COLOR="#7070A0">Benjamin Medinets &lt;[email protected]+stopspammers+&gt;</FONT><BR></FONT></TD> 

      </TR> <TR VALIGN=MIDDLE> 

      <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Highlighted:&nbsp;<BR></FONT></TD> 

      <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><FONT COLOR="#D57F7F"><B>!</B></FONT> 

<A HREF="/news/ahh.cgi?lugnet.dear-lego,5668">(details)</A><BR></FONT></TD> 

      </TR> <TR VALIGN=MIDDLE> 

      <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Viewed:&nbsp;<BR></FONT></TD> 

      <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1">3013 times<BR></FONT></TD> 

      </TR> </TABLE> 

    </TD> <TD WIDTH=20>&nbsp;&nbsp;</TD> 

     <TD ALIGN=CENTER VALIGN=TOP> 

     <FONT FACE="Geneva,Arial,Helvetica" SIZE="-2"><A HREF="/news/raw.cgi?lugnet.dear-lego,5668">View Raw<BR>Message</A><BR><BR></FONT> <A HREF="/news/post/?lugnet.dear-lego,5668"><IMG BORDER=0 WIDTH=30 HEIGHT=44 HSPACE=10 VSPACE=10 SRC="/news/icon-reply.gif" TITLE="Post a public reply to this message"></A><BR> </TD> </TR></TABLE> </TD></TR> 

     <TR BGCOLOR="#F0F0F0"><TD ALIGN=LEFT NOWRAP><TT>I was using Lego Digital Designer and am disappointed the downhill availabilty<BR> 
of certain important parts to build &quot;buyable&quot; models.<BR> 
<BR> 
I would like to see a return of &quot;warehouse&quot; sliding doors to make<BR> 
box cars.<BR> 
<BR> 
Train-style doors would also be nice as well as train windows (both in<BR> 
2x3 and 4x3)... please.<BR> 
<BR> 
I looked at the instructions to build a mail car from the 7722, and<BR> 
found that I really only need 2 red sliding rail doors, the pair of<BR> 
&quot;decorated train doors&quot; and a set of two 2x3 thin yellow train<BR> 
windows.<BR> 
<BR> 
Yes, there was a bit of minor substitution but it is mostly distiguishable<BR> 
as the model.<BR> 
<BR> 
Here is what it looks like:<BR> 
<BR> 
<A HREF="http://www.lugnet.com/jump.cgi?http://www.brickshelf.com/gallery/medib/lego-fun/7722mailvan.jpg">http://www.brickshelf.com/gallery/medib/lego-fun/7722mailvan.jpg</A><BR> 
<BR> 
Yeah I know... where are the f-in doors???<BR> 
<BR> 
<BR> 
Ben<BR> 
</TT> 
</TD></TR> 

     <TR BGCOLOR="#E0E0E0"><TD ALIGN=LEFT></TD></TR> 

    </TABLE> <BR> <BR> <FONT FACE="Verdana,Geneva,Helvetica" SIZE="-1" COLOR="#990000"> 



     <B>1 Message in This Thread:</B><BR> <NOBR><IMG WIDTH=9 HEIGHT=11 VSPACE=2 SRC="/news/here.gif" TITLE="You are here"></NOBR><BR><NOBR></NOBR> 
<DL> 

     <DT>Entire Thread on One Page: 

     <SMALL><FONT COLOR="#000000"> 

     <DD><B>Nested:&nbsp;</B> 

     <A HREF="/dear-lego/?n=5668&t=i&v=a">All</A> | <A HREF="/dear-lego/?n=5668&t=i&v=b">Brief</A> | <A HREF="/dear-lego/?n=5668&t=i&v=c">Compact</A> | <A HREF="/dear-lego/?n=5668&t=i&v=d">Dots</A> 

     <BR><B>Linear:&nbsp;</B> 

     <A HREF="/dear-lego/?n=5668&t=f&v=a">All</A> | <A HREF="/dear-lego/?n=5668&t=f&v=b">Brief</A> | <A HREF="/dear-lego/?n=5668&t=f&v=c">Compact</A> 

     </FONT></SMALL> </DL> 



     </FONT> </TD> 

    <TD WIDTH=20>&nbsp;&nbsp;&nbsp;&nbsp;<BR></TD> 

    <TD><FONT FACE="Verdana,Geneva,Arial,Helvetica" SIZE="-1"> 
<script type="text/javascript"><!-- 
google_ad_client = "pub-0089902038208374"; 
//LUGNET 160x600, Erstellt 14.12.07 
google_ad_slot = "5985678701"; 
google_ad_width = 160; 
google_ad_height = 600; 
//--></script> 
<script type="text/javascript" 
src="http://pagead2.googlesyndication.com/pagead/show_ads.js"> 
</script> 
<BR> 
<style type="text/css"> @import url(http://www.google.com/cse/api/branding.css); 
</style> 
<div class="cse-branding-bottom" style="background-color:#FFFFFF;color:#000000"> 
    <div class="cse-branding-form"> 
    <form action="http://www.google.com/cse" id="cse-search-box"> 
     <div> 
     <input type="hidden" name="cx" value="partner-pub-0089902038208374:9n7bh3k27mb" /> 
     <input type="hidden" name="ie" value="ISO-8859-1" /> 
     <input type="text" name="q" size="31" /> 
     <input type="submit" name="sa" value="Search" /> 
     </div> 
    </form> 
    </div> 
    <div class="cse-branding-logo"> 
    <img src="http://www.google.com/images/poweredby_transparent/poweredby_FFFFFF.gif" alt="Google" /> 
    </div> 
    <div class="cse-branding-text"> 
    Custom Search 
    </div> 
</div> </FONT></TD> 

    </TR></TABLE> <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> 
<TR VALIGN=TOP> </TR></TABLE> </TD></TR></TABLE> 
    </CENTER> 
<!-- google_ad_section_end --> <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 BGCOLOR="#8899BB" WIDTH="100%"><TR> 
<TD><SPACER TYPE=BLOCK WIDTH=1 HEIGHT=1></TD></TR></TABLE> 

<TABLE BORDER=0 CELLPADDING=4 CELLSPACING=0 BGCOLOR="#E8F0FF" WIDTH="100%"> 
    <TR VALIGN=TOP> 
    <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" SIZE="-2" COLOR="#000033"> <A HREF="/sitemap.cgi">Newsgroup Tree</A> &nbsp;|&nbsp; <A HREF="http://www.lugnet.com/admin/terms/agreement">Terms of Use</A> &nbsp;|&nbsp; <A HREF="http://www.lugnet.com/admin/feedback/">Feedback</A><BR> 
    </FONT></TD> 
    <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" SIZE="-2" COLOR="#000033"> &copy;2005 LUGNET. All rights reserved. - hosted by <a href="http://www.steinbruch.info/" target="_blank">steinbruch.info GbR</a><BR> 
    </FONT></TD> 
    </TR> 
</TABLE> 

<script type="text/javascript"> 
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); 
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); 
</script> 
<script type="text/javascript"> 
var pageTracker = _gat._getTracker("UA-3258989-12"); 
pageTracker._initData(); 
pageTracker._trackPageview(); 
</script> 
</BODY> 
</HTML> ` 
+1

我建议XML包。请提供一些示例代码。 – sgibb

+0

所以我想,如果可以构建一段代码,可以检测到“消息在答复:”,然后采取下一个A HREF =“链接”...? –

+0

我在您提供的html中没有看到“此线程中的消息”。请编辑您的问题,并添加相关的html代码 – GSee

回答

0

如果这是你的字符串,那么你就可以得到由弦为界的材料“A HREF =“”使用strsplit

txt <- '</TABLE> <BR> <BR> <FONT FACE="Verdana,Geneva,Helvetica" SIZE="-1" COLOR="#990000"><B> 

    Message has 2 Replies: </B></FONT><BR> <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR VALIGN=TOP BGCOLOR="#E0E0E0"><TD ALIGN=LEFT><A HREF="/dear-lego/?n=14"><IMG BORDER=5 HEIGHT=3 WIDTH=3 SRC="/news/x.gif"></A></TD><TD><FONT SIZE="-2">&nbsp;&nbsp;</FONT></TD><TD ALIGN=LEFT><FONT FACE="Verdana,Geneva,Helvetica" SIZE="-2"><A HREF="/dear-lego/?n=14">Re: Plate Paks</A><BR></FONT></TD><TD ALIGN=RIGHT><FONT FACE="Verdana,Geneva,Helvetica" SIZE="-2">&nbsp;Tom Stangl<BR></FONT></TD></TR><TR BGCOLOR="#F8F8F8"><TD COLSPAN=4 ALIGN=LEFT VALIGN=TOP><FONT FACE="Verdana,Geneva,Helvetica" SIZE="-2" ' 

这是第二个片段:

> strsplit(txt, split='A HREF="')[[1]][2] 
[1] "/dear-lego/?n=14\"><IMG BORDER=5 HEIGHT=3 WIDTH=3 SRC=\"/news/x.gif\"></A></TD><TD><FONT SIZE=\"-2\">&nbsp;&nbsp;</FONT></TD><TD ALIGN=LEFT><FONT FACE=\"Verdana,Geneva,Helvetica\" SIZE=\"-2\"><" 

可能有真正的XML和HTML处理步骤,但他们通常需要所有标题的例子,你已经删除了所有的标题。

+0

非常好的建议。非常感谢! –

+0

所以这里是整段代码... –

相关问题