2017-06-02 161 views
1

希望任何人都可以帮助我。我对python相当陌生,但我想从一个网站上获取数据,这个网站不幸需要一个帐户。虽然我无法提取日期(即2017-06-01)。使用python从html中提取文本

<li class="latest-value-item"> 
    <div class="latest-value-label">Date</div> 
    <div class="latest-value">2017-06-01</div> 
</li> 
<li class="latest-value-item"> 
    <div class="latest-value-label">Index</div> 
    <div class="latest-value">1430</div> 
</li> 

这是我的代码:

import urllib3 
import urllib.request 
from bs4 import BeautifulSoup 
import pandas as pd 
import requests 
import csv 
from datetime import datetime 

url = 'https://www.quandl.com/data/LLOYDS/BCI-Baltic-Capesize-Index' 
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'lxml') 

Baltic_Indices = [] 
New_Value = [] 

#new = soup.find_all('div', attrs={'class':'latest-value'}).get_text() 
date = soup.find_all(class_="latest value") 
text1 = date.text 

print(text1) 
+0

[使用Python从HTML文件中提取文本]的可能副本(https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) – Umair

回答

2

date = soup.find_all(class_="latest value")

您使用了错误的CSS类名('latest value' != 'latest-value'

print(soup.find_all(attrs={'class': 'latest-value'})) 
# [<div class="latest-value">2017-06-01</div>, <div class="latest-value">1430</div>] 

for element in soup.find_all(attrs={'class': 'latest-value'}): 
    print(element.text) 
# 2017-06-01 
# 1430 

我更喜欢使用attrs kwarg但你方法也适用(给定正确的CSS类名称)

for element in soup.find_all(class_='latest-value'): 
    print(element.text) 
# 2017-06-01 
# 1430