2017-08-14 60 views
0

试图使数据挖掘,我在阵列中的所有网址,但一旦我尝试采取在刮板给了我这个错误:类型错误:列表索引必须是整数,而不是标签 - 蟒蛇

$TypeError: list indices must be integers, not Tag -- python 

这是我刮板全码:

s = sched.scheduler(time.time, time.sleep) 
def myScraper(sc): 

csv_f = csv.reader(f) 
quote_page = [] 

for row in csv_f: 
    quote_page.append(url+row[0]) 


i=1 
for var in quote_page: 
    num_dat = [] 
    txt_dat = [] 
    num_dat2 = [] 
    txt_dat2 = [] 

    s.enter(5,1,myScraper, (sc,)) 
    sleep(5) 

    print(quote_page[i]) 

    page = urlopen(quote_page[i]) 

    i = i+1 

    soup = BeautifulSoup(page, 'html.parser') 
    data_store = [] 
    for tr in soup.find_all('tr'): # find table rows 
     tds = tr.find_all('td', attrs={'class': 'fieldData'}) # find all table cells 
     for i in tds: # returns all cells from html rows 
      if i != []: # pops out empty cells from returned data 
       data_store.append(i.text) 
       #print(i.text) 
       #print("\n") 
    data_store2 = [] 
    for tr in soup.find_all('tr'): 
     tds2 = tr.find_all('td', attrs={'class': 'improvementsFieldData'}) 
     for i in tds2: 
      if i != []: 
       data_store2.append(i.text) 

    for j in data_store: 
     if ',' in j and ' ' not in j: 
      lft_dec = j[:j.index(',')].replace('$', '') 
      rght_dec = j[j.index(','):].replace(',', '') # drop the decimal 
      num_dat.append(float(lft_dec+rght_dec)) # convert to numerical data 
     else: 
      txt_dat.append(j) 

    for j in data_store2: 
     if ',' in j and ' ' not in j: 
      lft_dec = j[:j.index(',')].replace('$', '') 
      rght_dec = j[j.index(','):].replace(',', '').replace('Sq. Ft', '') # drop the decimal and Sq 
      num_dat2.append(float(lft_dec+rght_dec)) # convert to numerical data 
     elif ('Sq. Ft' and ',') in j: 
      sqft_dat_befcm = j[:j.index(',')].replace(',', '') 
      sqft_dat_afcm = j[j.index(','):].replace(' ', '').replace('Sq.Ft', '').replace(',', '') 
      num_dat2.append(float(sqft_dat_befcm+sqft_dat_afcm)) 
     else: 
      txt_dat2.append(j) 
    print(num_dat) 
    print(txt_dat) 
    print(num_dat2) 
    print(txt_dat2) 







s.enter(5, 1, myScraper, (s,)) 
s.run() 
f.close 

基本上我对这一计划的目标是给定的URL,我可以打开一个浏览器刮去第一个数组,接着等待的时间和重复间隔量直到数组完成。

编辑***对不起,第一次发布在这个。下面是完整的堆栈跟踪

Traceback (most recent call last): 
    File "C:\Users\Ahmad\Desktop\HouseProject\AhmadsScraper.py", line 85, in 
<module> 
    s.run() 
    File "C:\Users\Ahmad\Anaconda2\lib\sched.py", line 117, in run 
    action(*argument) 
    File "C:\Users\Ahmad\Desktop\HouseProject\AhmadsScraper.py", line 32, in 
myScraper 
    print(quote_page[i]) 
TypeError: list indices must be integers, not Tag 
+1

您能否提供完整的回溯?所以人们可以理解哪一行会抛出错误? –

+0

是的!感谢您的回应! – Matherz

回答

0

的问题是,因为使用的是相同的变量i作为一个计数器和一个内循环变量。如何使用enumerate代替?

for idx, var in enumerate(quote_page): 
    ... 
    print(quote_page[idx]) 

    page = urlopen(quote_page[idx]) 
+0

多德非常感谢你!哈哈不能相信我错过了它现在是有道理的!我只是解决它,它的工作!再次感谢! – Matherz

相关问题