2016-12-14 39 views
0

我的目标是使用RegEx扫描电子邮件中的单词“trade”,然后打印找到的整行文本。如何从HTML文件打印一行文本

我已经成功使用RegEx从这个HTML文档中捕获其他数据(如物种,重量,价格等),并成功识别单词“trade”,但我失败了打印整个行。我确实尝试过使用BeautifulSoup来实现这个目标,但这样做的难度更大。

Here is the document I believe is in HTML format (correct me if I'm wrong and it's not HTML):

理想我想捕捉并打印单词“交易”被发现了两行。这里是我使用的尝试识别“贸易”,并打印出来的就行了代码:

with open(file_path, 'r') as f: 
     email = f.read() 
     pattern = re.search(r'\btrade\b',email).group(0) 
     match = re.search(r'\btrade\b', email) 
     if match: 
      for line in email: 
       print("TRADE STUFF:",line) 

请注意,我已经尝试了各种方法,如print("TRADE STUFF:", line.splitlines())以及print("TRADE STUF:", line.stripped_strings)但是没有成功。

感谢您的任何帮助。

HTML代码:

<html> 
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
<title>FW: NEFS 5 Available Fish</title> 
<link rel="important stylesheet" href=""> 
<style>div.headerdisplayname {font-weight:bold;}</style></head> 
<body> 
<table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><b>Subject: </b>FW: NEFS 5 Available Fish</td></tr><tr><td><b>From: </b>Claire Fitz-Gerald <[email protected]></td></tr><tr><td><b>Date: </b>9/5/2014 9:52 AM</td></tr></table><br> 
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; "><meta name=Generator content="Microsoft Word 12 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);} 
o\:* {behavior:url(#default#VML);} 
w\:* {behavior:url(#default#VML);} 
.shape {behavior:url(#default#VML);} 
</style><![endif]--><style><!-- 
/* Font Definitions */ 
@font-face 
    {font-family:Wingdings; 
    panose-1:5 0 0 0 0 0 0 0 0 0;} 
@font-face 
    {font-family:"Cambria Math"; 
    panose-1:2 4 5 3 5 4 6 3 2 4;} 
@font-face 
    {font-family:Calibri; 
    panose-1:2 15 5 2 2 2 4 3 2 4;} 
@font-face 
    {font-family:Tahoma; 
    panose-1:2 11 6 4 3 5 4 4 2 4;} 
@font-face 
    {font-family:"Franklin Gothic Book"; 
    panose-1:2 11 5 3 2 1 2 2 2 4;} 
@font-face 
    {font-family:"Franklin Gothic Demi"; 
    panose-1:2 11 7 3 2 1 2 2 2 4;} 
/* Style Definitions */ 
p.MsoNormal, li.MsoNormal, div.MsoNormal 
    {margin:0in; 
    margin-bottom:.0001pt; 
    font-size:12.0pt; 
    font-family:"Times New Roman","serif";} 
a:link, span.MsoHyperlink 
    {mso-style-priority:99; 
    color:blue; 
    text-decoration:underline;} 
a:visited, span.MsoHyperlinkFollowed 
    {mso-style-priority:99; 
    color:purple; 
    text-decoration:underline;} 
span.EmailStyle18 
    {mso-style-type:personal-reply; 
    font-family:"Calibri","sans-serif"; 
    color:#1F497D;} 
.MsoChpDefault 
    {mso-style-type:export-only; 
    font-size:10.0pt;} 
@page WordSection1 
    {size:8.5in 11.0in; 
    margin:1.0in 1.0in 1.0in 1.0in;} 
div.WordSection1 
    {page:WordSection1;} 
/* List Definitions */ 
@list l0 
    {mso-list-id:1512259006; 
    mso-list-template-ids:-893643712;} 
@list l0:level1 
    {mso-level-number-format:bullet; 
    mso-level-text:\F0B7; 
    mso-level-tab-stop:.5in; 
    mso-level-number-position:left; 
    text-indent:-.25in; 
    mso-ansi-font-size:10.0pt; 
    font-family:Symbol;} 
ol 
    {margin-bottom:0in;} 
ul 
    {margin-bottom:0in;} 
--></style><!--[if gte mso 9]><xml> 
<o:shapedefaults v:ext="edit" spidmax="1026" /> 
</xml><![endif]--><!--[if gte mso 9]><xml> 
<o:shapelayout v:ext="edit"> 
<o:idmap v:ext="edit" data="1" /> 
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Apologies for the delay in distributing this listing.&nbsp; It got lost in my inbox.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Please see the below quota listings.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Thanks,<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><p class=MsoNormal><span style='font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'>Claire Fitz-Gerald<o:p></o:p></span></p><p class=MsoNormal><i><span style='font-size:10.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></i></p><p class=MsoNormal><b><span style='font-size:11.0pt;font-family:"Franklin Gothic Demi","sans-serif";color:#002776'>Cape Cod Commercial Fishermen's Alliance<o:p></o:p></span></b></p><p class=MsoNormal><b><span style='font-size:11.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#DE3500'>~ Small Boats.&nbsp; Big Ideas. ~</span></b><b><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#DE3500'><o:p></o:p></span></b></p></div><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> NEFS V [mailto:[email protected]] <br><b>Sent:</b> Monday, September 01, 2014 8:46 PM<br><b>To:</b> mike walsh - 6; NEFS 11 &amp; 12 - Josh Wiersma; NEFS 13 John Haran; NEFS 2 - Dave Leveille; NEFS 3 - Rob Banks; NEFS 6 &amp; 10 Jim Reardon; NEFS 7 &amp; 8 - Linda MaCann; NEFS 9 - Stephanie Rafael-DeMello; paula lynch - 10; Claire Fitz-Gerald; Sector - MCCS; Sector - NCCS; Sector - Sustainable Harvest; tory bramante- 6<br><b>Subject:</b> NEFS 5 Available Fish<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p>&nbsp;</o:p></p><div><p class=MsoNormal>All,<br>NEFS 5 has the following fish available for lease/trade:<o:p></o:p></p></div><div><ul type=disc><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB EAST cod: 954 lbs @ $0.83</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB EAST cod: 1,046 lbs to trade for 1,830 lbs GB WEST cod</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB blackback: 30,000 lbs @ $0.07</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GOM blackback: 800 lbs @ $0.03</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>white hake: 6,322 lbs @ $0.13</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>pollock: 22,000 lbs @ $0.015</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>redfish: 14,000 lbs @ $0.015</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB yt: 1,873 lbs @ $1.13</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB yt: 5,127 lbs to trade for 10,254 lbs SNE yt</span></strong><o:p></o:p></li></ul><div><p class=MsoNormal>&nbsp;<o:p></o:p></p></div><div><p class=MsoNormal>-- <o:p></o:p></p></div><div><p class=MsoNormal>&nbsp;<o:p></o:p></p></div></div><div><p class=MsoNormal>Daniel Salerno, NEFS 5<o:p></o:p></p></div><div><p class=MsoNormal>C/O NESTCo.<o:p></o:p></p></div><div><p class=MsoNormal>55 State Street<o:p></o:p></p></div><div><p class=MsoNormal>Narragansett, RI 02882<o:p></o:p></p></div><div><p class=MsoNormal>401-932-0070<o:p></o:p></p></div><div><p class=MsoNormal>401-633-6539 (fax)<o:p></o:p></p></div><div><p class=MsoNormal><a href="mailto:[email protected]" target="_blank">[email protected]</a><o:p></o:p></p></div><div class=MsoNormal align=center style='text-align:center'></body></html> 
</body> 
</html> 
+0

你也应该分享HTML文件。 –

+0

对不起,我总是忘记添加,现在我将添加它。 – theprowler

回答

1

我会做这样的:

with open(file_path, 'r') as f: 
    while 1: 
     line=f.readline() 
     if not line: 
     break 
     if "trade" in line.lower(): 
     tags=line.replace('>','<').split('<') 
     for tag in tags: 
      if "trade" in tag.lower(): 
       print("TRADE STUFF: ",tag.strip()) 
+0

完美地工作,谢谢你!为了我自己的缘故,我可以问一下代码的工作原理吗?根据我的理解,你只是简单地搜索“trade”这个词,然后你是在抓取这些标签吗?或者标签内有什么? – theprowler

+1

对于文件中的每一行: ** 1 ** .replace()将所有'>'更改为'<',表示该行看起来像这样“ otocan

+0

Ohhhhh我陷入了困境。哇,这真的很聪明,你是否刚刚从HTML的许多经验中想出了这个想法? – theprowler

0

切换您的 'for' 循环,你的 '如果' 语句,就像这样:

for line in email: 
    if match: 
     print("TRADE STUFF: ", line) 
+0

我认为你误解了我的困境,对不起,我在解释它时不太清楚。我可以访问整个行,但因为它是HTML,它会打印出HTML代码,所有标签和所有内容。我只想要打印文本本身。 – theprowler