Python，网页抓取：嵌套循环无法正常工作

-1

变量j的嵌套循环无法正常工作。即使在它似乎被正确初始化之前需要的变量，调试器也会跳过它。Python，网页抓取：嵌套循环无法正常工作

from urllib.request import Request, urlopen 
# Get beautifulsoup4 with: pip install beautifulsoup4 
import bs4 
import pdb 
import sys 
import json 

site = "http://bgp.he.net/report/world" 
hdr = {'User-Agent': 'Mozilla/5.0'} 
req = Request(site,headers=hdr) 
page = urlopen(req) 
soup = bs4.BeautifulSoup(page, 'html.parser') 

for t in soup.find_all('td', class_='centeralign'): 
    s = str(t.string) 
    if s != "None": 
     print (s.strip()) 
     site2 = "http://bgp.he.net/country/" + s.strip() 
     req = Request(site2,headers=hdr) 
     soup2 = bs4.BeautifulSoup(page, 'html.parser') 

    for j in soup2.find_all('td'): 
     s2 = str(j.string) 
     print (j.strip())

来源

2017-07-28 Jeremy Villa

你想要的输出？ – Gahan

你也试图一次又一次地解析相同的页面。 – Gahan

[使用bs4提取除表头信息]的可能副本（https://stackoverflow.com/questions/37635847/extracting-information-from-a-table-except-header-of-the-table -using-bs4） – stovfl

from urllib.request import Request, urlopen 
# Get beautifulsoup4 with: pip install beautifulsoup4 
import bs4 
import pdb 
import sys 
import json 

site = "http://bgp.he.net/report/world" 
hdr = {'User-Agent': 'Mozilla/5.0'} 
req = Request(site,headers=hdr) 
page = urlopen(req) 
soup = bs4.BeautifulSoup(page, 'html.parser') 

for t in soup.find_all('td', class_='centeralign'): 
    s = str(t.string) 
    if s != "None": 
     print(s.strip()) 
     site2 = "http://bgp.he.net/country/" + s.strip() 
     req2 = Request(site2,headers=hdr) # you missed these two lines 
     page2 = urlopen(req2) 
     soup2 = bs4.BeautifulSoup(page2, 'html.parser') 

     for j in soup2.find_all('td'): 
      s2 = str(j.text) 
      print(s2.strip()) # wrong variable used by you to strip

来源

2017-07-28 13:05:20 Gahan

谢谢，我觉得自己像一个白痴 –

Python，网页抓取：嵌套循环无法正常工作

回答

相关问题