2014-02-25 124 views
0

我已经提取了第二个表格,在第二个表格中,我需要提取具有column[0]中文件名的行。解析来自html的特定数据

<TABLE WIDTH="100%" BORDER="1" > 
<TR ><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="2" WIDTH="70%">Root</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Functions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;10.1% (1077/10647)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Functions and exits</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;9.5% (2142/22473)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Statement blocks</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;9.1% (2191/24167)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Decisions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.8% (2648/29930)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Loops</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.4% (305/3628)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Basic conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.3% (1759/21254)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Modified conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;1.8% (35/1997)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Multiple conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;4.4% (137/3082)</TD></TR> 

</TABLE> 
</P> 
<P ALIGN="LEFT"><BR> 
2 - Files list</P> 
<BR> 
Display absolute values only.<BR> 

<TABLE WIDTH="100%" BORDER="1" > 
<TR BGCOLOR="#FFFF99"><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><b>Item<IMG SRC="cvi_sort_d.png" ALT="cvi_sort_d.xpm"></b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Functions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Functions and exits</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Statement blocks</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Decisions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Loops</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Basic conditions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Modified conditions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Multiple conditions</b></TD></TR> 
<TR ><TD BGCOLOR="#FF9999" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><B><A NAME="175746848"></A><a href="LOADER.H.html">LOADER.H</a></B></TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/2</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746912"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoaderState_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746976"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadParameters_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747104"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadOffsets_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747168"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadAppComponent_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#FF9999" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><B><A NAME="175746848"></A><a href="CORBA_FIXED.CC.html">CORBA_FIXED.CC</a></B></TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/2</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746912"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoaderState_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746976"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadParameters_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747104"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadOffsets_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747168"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadAppComponent_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
</TABLE> 

对于这个分析我写了一个Python脚本如下:

from bs4 import BeautifulSoup 
f = open("/home/vignesh/Downloads/html/RateDoc.html","r") 
fl = {'LOADER.H','CORBA_FIXED.H'} 
soup = BeautifulSoup(f) 
t = soup.findAll('table') 
for table in t[1:]: 
    rows = table.findAll('tr') 
    for tr in rows[1:]: 
     cols = tr.findAll('td') 
     for td in cols: 
      text = ''.join((td.find(text=True)).encode('utf-8')) 
      print text+"\t", 
     print 
    print 


the above script extracts the data as follows: 


LOADER.H 0/1 0/2 0/1 0/1 none none none none  
    none none none none none none none none  
    none none none none none none none none  
        none none none none none none none none  
    none none none none none none none none  
CORBA_FIXED.CC 0/1 0/2 0/1 0/1 none none none none  
    none none none none none none none none  
    none none none none none none none none  
    none none none none none none none none  
    none none none none none none none none 

但该预期的结果如下,我想提取与扩展*.cc*.h

输出的所有文件要求:

LOADER.H 0/1 0/2 0/1 0/1 none none none none  
CORBA_FIXED.CC 0/1 0/2 0/1 0/1 none none none none  

是否有人帮助我修改上述脚本,以便提取特定扩展*.cc*.h

回答

0
from bs4 import BeautifulSoup 

INPUT = "/home/vignesh/Downloads/html/RateDoc.html" 

def main(): 
    with open(INPUT, "rb") as inf: 
     soup = BeautifulSoup(inf) 

    for row in soup.findAll("tr"): 
     first_col = row.find("td") 
     links = first_col.findAll("a") 
     if len(links) == 2: 
      link_text = links[1].text 
      parts = link_text.rsplit(".", 1) 
      if len(parts) > 1 and parts[-1].lower() in {"h", "cc"}: 
       # print row 
       print("\t".join(cell.text.strip().encode("utf-8") for cell in row.findAll("td"))) 

产生

LOADER.H 0/1 0/2 0/1 0/1 none none none none 
CORBA_FIXED.CC 0/1 0/2 0/1 0/1 none none none none 
0

它会出现,如果你封装你的数据在一个if,它应该工作。基于这样的事实,要跳过线的初始打印似乎显示一个空白项 其次是“无”的8个数值

if text is '': 
    break 
else: 
    print text + '\t', 

这是你的代码的检查只能作为我目前不能对其进行测试。