刮网址

我使用Python 3.5，并试图刮URL列表（同一网站）的列表，代码如下：刮网址

import urllib.request 
from bs4 import BeautifulSoup 



url_list = ['URL1', 
      'URL2','URL3] 

def soup(): 
    for url in url_list: 
     sauce = urllib.request.urlopen(url) 
     for things in sauce: 
      soup_maker = BeautifulSoup(things, 'html.parser') 
      return soup_maker 

# Scraping 
def getPropNames(): 
    for propName in soup.findAll('div', class_="property-cta"): 
     for h1 in propName.findAll('h1'): 
      print(h1.text) 

def getPrice(): 
    for price in soup.findAll('p', class_="room-price"): 
     print(price.text) 

def getRoom(): 
    for theRoom in soup.findAll('div', class_="featured-item-inner"): 
     for h5 in theRoom.findAll('h5'): 
      print(h5.text) 


for soups in soup(): 
    getPropNames() 
    getPrice() 
    getRoom()

到目前为止，如果我打印的汤，让propNames， getPrice或getRoom他们似乎工作。但我似乎无法通过每个URL并打印getPropNames，getPrice和getRoom。

只有在几个月的时间里才学习Python，所以非常感谢您的帮助！

来源

2017-02-17 Maverick

试想一下这个代码做：

def soup(): 
    for url in url_list: 
     sauce = urllib.request.urlopen(url) 
     for things in sauce: 
      soup_maker = BeautifulSoup(things, 'html.parser') 
      return soup_maker

让我告诉你一个例子：

def soup2(): 
    for url in url_list: 
     print(url) 
     for thing in ['a', 'b', 'c']: 
      print(url, thing) 
      maker = 2 * thing 
      return maker

而且输出url_list = ['one', 'two', 'three']是：

one 
('one', 'a')

你现在看到？到底是怎么回事？

基本上你的汤功能首先返回return - 不返回任何迭代器，任何列表;只有第一BeautifulSoup - 你是幸运的（或不），这是迭代:)

所以更改代码：

def soup3(): 
    soups = [] 
    for url in url_list: 
     print(url) 
     for thing in ['a', 'b', 'c']: 
      print(url, thing) 
      maker = 2 * thing 
      soups.append(maker) 
    return soups

然后输出为：

one 
('one', 'a') 
('one', 'b') 
('one', 'c') 
two 
('two', 'a') 
('two', 'b') 
('two', 'c') 
three 
('three', 'a') 
('three', 'b') 
('three', 'c')

但我相信，这也不会工作:)只是想知道什么是由酱返回：sauce = urllib.request.urlopen(url)和实际上你的代码迭代：for things in sauce - 意思是things是什么。

快乐编码。

来源

2017-02-17 13:44:33 opalczynski

谢谢SebastianOpałczyński，我会把它放在船上，试着让我的头靠近它，让你知道结果！ – Maverick

get*函数中的每一个都使用全局变量soup，该函数在任何地方都没有正确设置。即使是这样，这也不是一个好方法。让soup函数参数代替，例如：

def getRoom(soup): 
    for theRoom in soup.findAll('div', class_="featured-item-inner"): 
     for h5 in theRoom.findAll('h5'): 
      print(h5.text) 

for soup in soups(): 
    getPropNames(soup) 
    getPrice(soup) 
    getRoom(soup)

其次，你应该做而从yield代替soup()的return把它变成一台发电机。否则，您需要返回一个BeautifulSoup对象的列表。

def soups(): 
    for url in url_list: 
     sauce = urllib.request.urlopen(url) 
     for things in sauce: 
      soup_maker = BeautifulSoup(things, 'html.parser') 
      yield soup_maker

我还建议使用XPath或CSS选择器来提取HTML元素：https://stackoverflow.com/a/11466033/2997179。

来源

2017-02-17 13:47:03

谢谢Martin Valgur，这很有见地 - 我会研究Xpath/CSS。在应用您的建议时，我收到以下错误消息：AttributeError：'function'对象没有属性'findAll - 任何想法？ – Maverick

您是否将'soup'参数添加到所有功能？我还建议将'soup（）'函数重命名为'soups（）'。 –

谢谢，那是我错了！但是，它似乎只适用于getPrice。其他2不返回任何东西？奇怪，因为当我第一次写这些功能，我使用1个网址，他们都完美地工作。 – Maverick

回答

相关问题