2016-04-25 110 views
1

我想知道如何从一个网站使用美丽的汤/请求抓取多个不同的网页,而不必一遍又一遍地重复我的代码。从网站抓取多个网页(BeautifulSoup,Requests,Python3)

在下面我当前的代码,这是爬行某些城市的旅游景点:

RegionIDArray = [187147,187323,186338] 
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'} 
already_printed = set() 

for reg in RegionIDArray: 
    for page in range(1,700,30): 
     r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html") 

     g_data = soup.find_all("div", {"class": "element_wrap"}) 

     for item in g_data: 
      header = item.find_all("div", {"class": "property_title"}) 
      item = (header[0].text.strip()) 
      if item not in already_printed: 
       already_printed.add(item) 

       print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]) + " | " + "Art: Museum ") 

到目前为止一切正常。下一步,我想抓取这些城市最受欢迎的博物馆以及旅游景点。

因此,我必须改变-c参数修改请求,以获得所需的所有博物馆:

r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html") 

因此我的代码是这样的:

RegionIDArray = [187147,187323,186338] 
museumIDArray = [47,49] 
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'} 
already_printed = set() 

for reg in RegionIDArray: 
    for page in range(1,700,30): 
     r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html") 
     soup = BeautifulSoup(r.content) 

     g_data = soup.find_all("div", {"class": "element_wrap"}) 

     for item in g_data: 
      header = item.find_all("div", {"class": "property_title"}) 
      item = (header[0].text.strip()) 
      if item not in already_printed: 
       already_printed.add(item) 

       print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]) + " | " + "Art: Museum ") 

那似乎不完全正确。我得到的输出不包括某些城市的所有博物馆和旅游景点。

任何人都可以帮助我吗?任何反馈意见。

+0

您的代码会出错,也什么字典在你的代码吧做阴影巨蟒内建? –

+0

@PadraicCunningham你的意思是“隐藏python builtin”对不起,如果我对你的神经有所了解,但我仍然是初学者 –

+0

dict是一个python类型/函数,最好避免使用阴影,即使用相同的作为内建类型的vriables的名称。你能添加一个链接并解释你想要解析的内容吗? –

回答

1

所有名称都在property_title类的div内的锚标记中。

for reg in RegionIDArray: 
    for page in range(1,700,30): 
     r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html") 
     soup = BeautifulSoup(r.content) 

     for item in (a.text for a in soup.select("div.property_title a")): 
      if item not in already_printed: 
       already_printed.add(item) 
       print("POI: " + str(item) + " | " + "Location: " + str(dct[reg]) + " | " + "Art: Museum ") 

这也是更好地得到从分页格链接:

from bs4 import BeautifulSoup 
import requests 
from urllib.parse import urljoin 


RegionIDArray = [187147,187323,186338] 
museumIDArray = [47,49] 
dct = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'} 
already_printed = set() 

def get_names(soup): 
    for item in (a.text for a in soup.select("div.property_title a")): 
     if item not in already_printed: 
      already_printed.add(item) 
      print("POI: {} | Location: {} | Art: Museum ".format(item, dct[reg])) 

base = "https://www.tripadvisor.de" 
for reg in RegionIDArray: 
    r = requests.get("https://www.tripadvisor.de/Attractions-c[47,49]-g{}-oa.html".format(reg)) 
    soup = BeautifulSoup(r.content) 

    # get links to all next pages. 
    all_pages = (urljoin(base, a["href"]) for a in soup.select("div.unified.pagination a.pageNum.taLnk")[1:]) 
    # use helper function to print the names. 
    get_names(soup) 

    # visit all remaining pages. 
    for url in all_pages: 
     soup = BeautifulSoup(requests.get(url).content) 
     get_names(soup) 
+0

非常感谢您的反馈。但现在我收到以下错误消息:Traceback(最近呼叫的最后一个): 文件“C:/Users/Raju/Desktop/Scr​​ipts/nnnn.py”,第25行,在 get_names(汤) 文件“C:/Users/Raju/Desktop/Scr​​ipts/nnnn.py”,第15行,在get_names print(“POI:{} | Location:{} |”+“Art:Museum”.format(item。dict [ reg])) AttributeError:'str'对象没有属性'dict'你能帮我吗?哪里不对? –

+0

@SeriousRuffy,你使用的字典? –

+0

@Padriac它应该是dct,就像您在上面的代码中所陈述的那样。我只是试着用“字典”。不过,我收到了同样的错误消息 –