从网站抓取表格数据时出错

我想从我的项目的web上抓取一些股票相关数据。我遇到了一些问题。
问题1：
我试图从这个网站http://sharesansar.com/c/today-share-price.html
它的工作抢了表，但这些列未在order.For如一把抓住：列“公司名称”具有“开放价格”的值。我该如何解决这个问题？
问题2：
我还试图从'价格历史'选项卡下从http://merolagani.com/CompanyDetail.aspx?symbol=ADBL获取公司特定的数据。
这个时间，同时抓住了表data.The错误我得到了我得到了一个错误是：如下图所示从网站抓取表格数据时出错

代码：

import logging 
import requests 
from bs4 import BeautifulSoup 
import pandas 


module_logger = logging.getLogger('mainApp.dataGrabber') 


class DataGrabberTable: 
    ''' Grabs the table data from a certain url. ''' 

    def __init__(self, url, csvfilename, columnName=[], tableclass=None): 
     module_logger.info("Inside 'DataGrabberTable' constructor.") 
     self.pgurl = url 
     self.tableclass = tableclass 
     self.csvfile = csvfilename 
     self.columnName = columnName 

     self.tableattrs = {'class':tableclass} #to be passed in find() 

     module_logger.info("Done.") 


    def run(self): 
     '''Call this to run the datagrabber. Returns 1 if error occurs.''' 

     module_logger.info("Inside 'DataGrabberTable.run()'.") 

     try: 
      self.rawpgdata = (requests.get(self.pgurl, timeout=5)).text 
     except Exception as e: 
      module_logger.warning('Error occured: {0}'.format(e)) 
      return 1 

     #module_logger.info('Headers from the server:\n {0}'.format(self.rawpgdata.headers)) 

     soup = BeautifulSoup(self.rawpgdata, 'lxml') 

     module_logger.info('Connected and parsed the data.') 

     table = soup.find('table',attrs = self.tableattrs) 
     rows = table.find_all('tr')[1:] 

     #initializing a dict in a format below 
     # data = {'col1' : [...], 'col2' : [...], } 
     #col1 and col2 are from columnName list 
     self.data = {} 
     self.data = dict(zip(self.columnName, [list() for i in range(len(self.columnName))])) 

     module_logger.info('Inside for loop.') 
     for row in rows: 
      cols = row.find_all('td') 
      index = 0 
      for key in self.data: 
       if index > len(cols): break 
       self.data[key].append(cols[index].get_text()) 
       index += 1 
     module_logger.info('Completed the for loop.') 

     self.dataframe = pandas.DataFrame(self.data) #make pandas dataframe 

     module_logger.info('writing to file {0}'.format(self.csvfile)) 
     self.dataframe.to_csv(self.csvfile) 
     module_logger.info('written to file {0}'.format(self.csvfile)) 

     module_logger.info("Done.") 
     return 0 

    def getData(self): 
     """"Returns 'data' dictionary.""" 
     return self.data 




    # Usage example 

    def main(): 
     url = "http://sharesansar.com/c/today-share-price.html" 
     classname = "table" 
     fname = "data/sharesansardata.csv" 
     cols = [str(i) for i in range(18)] #make a list of columns 

     '''cols = [ 
      'S.No', 'Company Name', 'Symbol', 'Open price', 'Max price', 
     'Min price','Closing price', 'Volume', 'Previous closing', 
     'Turnover','Difference', 
     'Diff percent', 'Range', 'Range percent', '90 days', '180 days', 
     '360 days', '52 weeks high', '52 weeks low']''' 

     d = DataGrabberTable(url, fname, cols, classname) 
     if d.run() is 1: 
      print('Data grabbing failed!') 
     else: 
      print('Data grabbing done.') 


    if __name__ == '__main__': 
     main()

几个建议将help.Thank你！

来源

2017-08-03 Kishor

你的山坳列表中缺少的元素有19列，而不是18：

>>> len([str(i) for i in range(18)]) 
18

除此之外，您还似乎在复杂的事情。下面应该这样做：

import requests 
from bs4 import BeautifulSoup 
import pandas as pd 

price_response = requests.get('http://sharesansar.com/c/today-share-price.html') 
price_table = BeautifulSoup(price_response.text, 'lxml').find('table', {'class': 'table'}) 
price_rows = [[cell.text for cell in row.find_all(['th', 'td'])] for row in price_table.find_all('tr')] 
price_df = pd.DataFrame(price_rows[1:], columns=price_rows[0]) 

com_df = None 
for symbol in price_df['Symbol']: 
    comp_response = requests.get('http://merolagani.com/CompanyDetail.aspx?symbol=%s' % symbol) 
    comp_table = BeautifulSoup(comp_response.text, 'lxml').find('table', {'class': 'table'}) 
    com_header, com_value = list(), list() 
    for tbody in comp_table.find_all('tbody'): 
     comp_row = tbody.find('tr') 
     com_header.append(comp_row.find('th').text.strip().replace('\n', ' ').replace('\r', ' ')) 
     com_value.append(comp_row.find('td').text.strip().replace('\n', ' ').replace('\r', ' ')) 
    df = pd.DataFrame([com_value], columns=com_header) 
    com_df = df if com_df is None else pd.concat([com_df, df]) 

print(price_df) 
print(com_df)

来源

2017-08-03 13:26:55

我仍然得到列不匹配（问题1）。 – Kishor

@Kishor看我的编辑。 –

它工作！非常感谢你。问题2呢？你有没有发现任何错误？ – Kishor

从网站抓取表格数据时出错

回答

相关问题