2017-01-25 51 views
0

这是我的情况:我的代码从电子邮件中的HTML表中解析出数据。我遇到的障碍是这些表中的一些在桌子中间有空白的空行,如下图所示。此空白区域导致我的代码失败(IndexError: list index out of range),因为它试图从单元格中提取文本。如何绕过IndexError

是否可以对Python说:“好吧,如果遇到这些来自这些空白行的错误,就停在那里,从目前为止已经获取文本的行并执行其余代码那些”...?

这听起来像是一个愚蠢的解决方案,但我的项目涉及到我只从表中最近的日期获取数据,总是在前几行中,并且总是在这些空白空行之前。

因此,如果可以说“如果你遇到这个错误,就忽略它并继续”,那么我想学习如何做到这一点。如果不是那么我就不得不找出解决这个问题的另一种方法。感谢任何和所有的帮助。

配合间隙下表:enter image description here

我的代码:

from bs4 import BeautifulSoup, NavigableString, Tag 
import pandas as pd 
import numpy as np 
import os 
import re 
import email 
import cx_Oracle 

dsnStr = cx_Oracle.makedsn("sole.nefsc.noaa.gov", "1526", "sole") 
con = cx_Oracle.connect(user="user", password="password", dsn=dsnStr) 

def celltext(cell): 
    '''  
     textlist=[] 
     for br in cell.findAll('br'): 
      next = br.nextSibling 
      if not (next and isinstance(next,NavigableString)): 
       continue 
      next2 = next.nextSibling 
      if next2 and isinstance(next2,Tag) and next2.name == 'br': 
       text = str(next).strip() 
       if text: 
        textlist.append(next) 
     return (textlist) 
    ''' 
    textlist=[] 
    y = cell.find('span') 
    for a in y.childGenerator(): 
     if isinstance(a, NavigableString): 
      textlist.append(str(a)) 
    return (textlist) 

path = 'Z:\\blub_2' 

for filename in os.listdir(path): 
    file_path = os.path.join(path, filename) 
    if os.path.isfile(file_path): 
     html=open(file_path,'r').read() 
     soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string 
     table = soup.find_all('table')[1] # Grab the second table 

df_Quota = pd.DataFrame() 

for row in table.find_all('tr'):  
    columns = row.find_all('td') 
    if columns[0].get_text().strip()!='ID': # skip header 
     Quota = celltext(columns[1]) 
     Weight = celltext(columns[2]) 
     price = celltext(columns[3]) 

     print(Quota) 

     Nrows= max([len(Quota),len(Weight),len(price)]) #get the max number of rows 

     IDList = [columns[0].get_text()] * Nrows 
     DateList = [columns[4].get_text()] * Nrows 

     if price[0].strip()=='Package': 
      price = [columns[3].get_text()] * Nrows 

     if len(Quota)<len(Weight):#if Quota has less itmes extend with NaN 
      lstnans= [np.nan]*(len(Weight)-len(Quota)) 
      Quota.extend(lstnans) 

     if len(price) < len(Quota): #if price column has less items than quota column, 
      val = [columns[3].get_text()] * (len(Quota)-len(price)) #extend with 
      price.extend(val)          #whatever is in 
                    #price column 

     #if len(DateList) > len(Quota): #if DateList is longer than Quota, 
      #print("it's longer than") 
      #value = [columns[4].get_text()] * (len(DateList)-len(Quota)) 
      #DateList = value * Nrows 

     if len(Quota) < len(DateList): #if Quota is less than DateList (due to gap), 
      stu = [np.nan]*(len(DateList)-len(Quota)) #extend with NaN 
      Quota.extend(stu) 

     if len(Weight) < len(DateList): 
      dru = [np.nan]*(len(DateList)-len(Weight)) 
      Weight.extend(dru) 

     FinalDataframe = pd.DataFrame(
     { 
     'ID':IDList,  
     'AvailableQuota': Quota, 
     'LiveWeightPounds': Weight, 
     'price':price, 
     'DatePosted':DateList 
     }) 

     df_Quota = df_Quota.append(FinalDataframe, ignore_index=True) 
     #df_Quota = df_Quota.loc[df_Quota['DatePosted']=='5/20'] 
     df_Q = df_Quota['DatePosted'].iloc[0] 
     df_Quota = df_Quota[df_Quota['DatePosted'] == df_Q] 
print (df_Quota) 

for filename in os.listdir(path): 
    file_path = os.path.join(path, filename) 
    if os.path.isfile(file_path): 
     with open(file_path, 'r') as f: 
      pattern = re.compile(r'Sent:.*?\b(\d{4})\b') 
      email = f.read() 
      dates = pattern.findall(email) 
      if dates: 
       print("Date:", ''.join(dates)) 

#cursor = con.cursor() 
#exported_data = [tuple(x) for x in df_Quota.values] 
#sql_query = ("INSERT INTO ROUGHTABLE(species, date_posted, stock_id, pounds, money, sector_name, ask)" "VALUES (:1, :2, :3, :4, :5, 'NEFS 2', '1')") 
#cursor.executemany(sql_query, exported_data) 
#con.commit() 

#cursor.close() 
#con.close() 
+0

只是使用''try'和catch'追赶上'IndexError'忽略 –

+0

@Sarathsp ......如果有一个'catch'。 – tdelaney

+0

在尝试为索引建立索引之前,可以使用异常处理程序或检查事物的大小。这是很多代码,并没有暗示这个错误实际发生的地方。如果你可以将其归结为一个简单的例子,它将有助于解决方案。 – tdelaney

回答

1

继续是使用跳过空/问题行的关键词。 IndexError感谢尝试在空列列表上访问columns[0]。所以当出现异常时就跳到下一行。

for row in table.find_all('tr'): 
    columns = row.find_all('td') 
    try: 
     if columns[0].get_text().strip()!='ID': 
     # Rest as above in original code. 
    except IndexError: 
     continue 
+0

命令所以你听起来像它应该工作,但它不会......它在同一行上产生同样的错误('price = celltext(columns [3])') – theprowler

+1

这意味着我们必须处理空行以及列数少的行(少可能比4)。那么明显的解决方案是try..except。将** continue **移动到'IndexError'区块外。 –

1

使用try: ... except: ...

try: 
    #extract data from table 
except IndexError: 
    #execute rest of program