2017-10-04 65 views
-1

我试图从多个气象站获取多年的小时数据,并将它放入熊猫数据框中。我不能使用API​​,因为请求有限制,我不想支付数千美元来获取这些数据。循环遍历python中的多个url请求的值列表

我可以从脚本中获取所需的数据。当我尝试对其进行修改,使其循环遍历站点列表时,我得到一个406错误,或者它只返回来自列表中第一个站点的数据。我怎样才能遍历所有的电台?另外,如何存储站名以便将其添加到另一列的数据框中?

这里是我的代码看起来像现在:

stations = ['EGMC','KSAT','CAHR'] 


weather_data = [] 
date = [] 
for s in stations: 
    for y in range(2014,2015): 
     for m in range(1, 13): 
      for d in range(1, 32): 
      #check if a leap year 
       if y%400 == 0: 
        leap = True 
       elif y%100 == 0: 
        leap = False 
       elif y%4 == 0: 
        leap = True 
       else: 
        leap = False 

      #check to see if dates have already been scraped  

      if (m==2 and leap and d>29): 
       continue 
      elif (y==2013 and m==2 and d > 28): 
       continue 
      elif(m in [4, 6, 9, 11] and d > 30): 
       continue 

      timestamp = str(y) + str(m) + str(d) 
      print ('Getting data for ' + timestamp) 

#pull URL 
      url = 'http://www.wunderground.com/history/airport/{0}/' + str(y) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html?HideSpecis=1'.format(stations) 
      page = urlopen(url) 

     #find the correct piece of data on the page 
      soup = BeautifulSoup(page, 'lxml') 



      for row in soup.select("table tr.no-metars"): 
       date.append(str(y) + '/' + str(m) + '/' + str(d)) 
       cells = [cell.text.strip().encode('ascii', 'ignore').decode('ascii') for cell in row.find_all('td')] 
       weather_data.append(cells) 

weather_datadf = pd.DataFrame(weather_data) 
datedf = pd.DataFrame(date) 
result = pd.concat([datedf, weather_datadf], axis=1) 
result 

回答

0

这里是你的错误解释https://httpstatuses.com/406

您应该添加User-Agent到头部。但我认为在这个网站存在一些爬行保护,你应该使用更具体的东西,如Scrapy,Crawlera,代理列表,用户代理旋转器