2017-03-03 35 views
0
<div class="columns small-5 medium-4 cell header">Ref No.</div> 
<div class="columns small-7 medium-8 cell">110B60329</div>               

网站是https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results的FindAll用美丽的汤产量空白回报div标签

我想运行一个循环,并回归“110B60329”。我跑了美丽的汤,做了一个find_all(div),然后根据他们的类将2个不同的标签定义为头部和数据。然后我通过'head'标签运行迭代,希望它能返回我定义为数据的div标签中的信息。

Python返回空白(cmd提示重新打印filepth)。

请问有人会知道我该如何解决这个问题。我的完整代码是.....谢谢

import requests 
from bs4 import BeautifulSoup as soup 
import csv 


url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results' 

baseurl = 'https://www.saa.gov.uk' 

session = requests.session() 

response = session.get(url) 

# content of search page in soup 
html= soup(response.content,"lxml") 
properties_col = html.find_all('div') 



for col in properties_col: 
    ref = 'n/a' 
    des = 'n/a' 

    head = col.find_all("div",{"class": "columns small-5 medium-4 cell header"}) 

    data = col.find_all("div",{"class":"columns small-7 medium-8 cell"}) 

    for i,elem in enumerate(head): 
    #for i in range(elems): 
     if head [i].text == "Ref No.": 
      ref = data[i].text 
      print ref    

回答

1

你可以通过两种方式做到这一点。

1)如果您确定您正在抓取的网站不会更改其内容,您可以找到该类的所有div,并通过提供索引来获取内容。

2)找到所有左侧的div(标题),如果其中一个匹配你想要的下一个兄弟获取文本。

实施例:

import requests 
from bs4 import BeautifulSoup as soup 

url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results' 

baseurl = 'https://www.saa.gov.uk' 

session = requests.session() 

response = session.get(url) 

# content of search page in soup 
html = soup(response.content,"lxml") 

#Method 1 
LeftBlockData = html.find_all("div", class_="columns small-7 medium-8 cell") 
Reference = LeftBlockData[0].get_text().strip() 
Description = LeftBlockData[2].get_text().strip() 
print(Reference) 
print(Description) 

#Method 2 
for column in html.find_all("div", class_="columns small-5 medium-4 cell header"): 
    RightColumn = column.next_sibling.next_sibling.get_text().strip() 
    if "Ref No." in column.get_text().strip(): 
     print (RightColumn) 
    if "Description" in column.get_text().strip(): 
     print (RightColumn) 

印刷品将输出(按顺序):

110B60329

STORE

110B60329

STORE

你的问题是,你正在试图匹配一个节点文本有很多标签与一个非间隔字符串。

例如您head [i].text变量包含 Ref No.,所以如果你有Ref No.比较一下,它会给出错误的结果。剥离它将解决。

+0

非常感谢,工作 –

+0

您好,我遵循相同的逻辑,尝试通过添加行来从同一页提升'Rateable Value RightBlockData = html.find_all(“div”,class _ =“columns small-12中等5“) Rateable_Value = RightBlockData [2] .get_text()。strip() –

+0

但是我得到错误RightBlockData = html.find_all(”div“,class _ =”columns small-12 medium-5“) Rateable_Value = RightBlockData [2] .get_text()。strip() –

1
import requests 
from bs4 import BeautifulSoup 

r = requests.get("https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results") 
soup = BeautifulSoup(r.text, 'lxml') 
for row in soup.find_all(class_='table-row'): 

    print(row.get_text(strip=True, separator='|').split('|')) 

出来:

['Ref No.', '110B60329'] 
['Office', 'LOTHIAN VJB'] 
['Description', 'STORE'] 
['Property Address', '29 BOSWALL PARKWAY', 'EDINBURGH', 'EH5 2BR'] 
['Proprietor', 'SCOTTISH MIDLAND CO-OP SOCIETY LTD.'] 
['Tenant', 'PROPRIETOR'] 
['Occupier'] 
['Net Annual Value', '£1,750'] 
['Marker'] 
['Rateable Value', '£1,750'] 
['Effective Date', '01-APR-10'] 
['Other Appeal', 'NO'] 
['Reval Appeal', 'NO'] 

get_text()是非常强大的工具,你可以剥离在文本中的空白,并把分离。

您可以使用此方法获取干净的数据并对其进行过滤。

+1

感谢您的方法。 –