从使用Python中的beautifulsoup的网站中提取数字

我正在尝试使用urllib来抓取html页面，然后使用beautifulsoup来提取数据。我想从comments_42.html中获取所有数字并打印出它们的总和，然后显示数据的数量。这是我的代码，我正在尝试使用正则表达式，但它不适用于我。从使用Python中的beautifulsoup的网站中提取数字

import urllib 
from bs4 import BeautifulSoup 
url = 'http://python-data.dr-chuck.net/comments_42.html' 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html,"html.parser") 
tags = soup('span') 
for tag in tags: 
    print tag

来源

2015-12-13 Salosha

1.您没有使用正则表达式，只要我能看到; 2. *“不起作用”的意思是什么？ – jonrsharpe

我的意思是我在使用正则表达式时得到了堆栈，这可能是由于我的编程技巧低下造成的。 – Salosha

那么？这不是教程服务。 *试一试。* – jonrsharpe

使用BeautifulSoup的findAll（）方法提取所有带有'comments'类的span标签，因为它们包含了您需要的信息。然后您可以根据您的要求对它们执行任何操作。

soup = BeautifulSoup(html,"html.parser") 
data = soup.findAll("span", { "class":"comments" }) 
numbers = [d.text for d in data]

这里是输出：

[u'100', u'97', u'87', u'86', u'86', u'78', u'75', u'74', u'72', u'72', u'72', u'70', u'70', u'66', u'66', u'65', u'65', u'63', u'61', u'60', u'60', u'59', u'59', u'57', u'56', u'54', u'52', u'52', u'51', u'47', u'47', u'41', u'41', u'41', u'38', u'35', u'32', u'31', u'24', u'19', u'19', u'18', u'17', u'16', u'13', u'8', u'7', u'1', u'1', u'1']

来源

2015-12-13 09:14:26 Learner

谢谢，这对我来说很好，有没有办法摆脱“u”'“？ Sry回复这么晚，我需要使用vpn连接网站才能通过GFW，这就是为什么我无法尽快回复。 – Salosha

使用'数字= [d.text.encode（'utf-8'）作为数据中的d]' – Learner

@学习者的解决方案是完全正确的！但如果你想要做更多的名称和注释，你可以做到这一点，它返回名称和注释的列表：

from BeautifulSoup import BeautifulSoup 
import re 
import urllib 
url = 'http://python-data.dr-chuck.net/comments_42.html' 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 
all = soup.findAll('span',{'class':'comments'},text=re.compile(r'[0-9]{0,4}')) #use regex to extract only numbers 
cleaned = filter(lambda x: x!=u'\n',all)[4:] 
In [18]: cleaned 
Out[18]: 
[u'Leven', 
u'100', 
u'Mahdiya', 
u'97', 
u'Ajayraj', 
u'87', 
u'Lillian', 
u'86', 
u'Aon', 
u'86', 
u'Ruaraidh', 
u'78', 
u'Gursees', 
u'75', 
u'Emmanuel', 
u'74', 
u'Christy', 
u'72', 
u'Annoushka', 
u'72', 
u'Inara', 
u'72', 
u'Caite', 
u'70', 
u'Rosangel', 
u'70', 
u'Iana', 
u'66', 
u'Anise', 
u'66', 
u'Jaosha', 
u'65', 
u'Cadyn', 
u'65', 
u'Edward', 
u'63', 
u'Charlotte', 
u'61', 
u'Sammy', 
u'60', 
u'Zarran', 
u'60',.....] #

来源

2015-12-13 09:35:03

太棒了！你用正则表达式，这正是我想要的，但我怎么能在列表中脱离“u”？作为答复这么晚，世界上有两个互联网，中国和其他国家，我很难用vpn来检查答案。 – Salosha

@Saikorin：你会发现它是一个unicode字符串！您可以使用** encode（）**方法将其转换为普通字符串。例如，如果ustr = u'str'是unicode，那么str = ustr.encode（）是一个普通的字符串。 –

我明白了，但是我仍然对Python中的unicode输出感到有点迷惑，因此请检查一下。谢谢你和学习者，100％解决了我所有的困惑！ – Salosha

不要忘记，你必须要想在代码中使用它们导入正则表达式。

import re

来源

2015-12-21 02:35:48 cybernerd

我从Coursera开始学习同样的课程。不要去寻求上述解决方案，你介意尝试这一个。直到上述问题，我觉得这个问题属于我们所了解的范围。它绝对为我工作。

import urllib 
import re 
from bs4 import * 

url = 'http://python-data.dr-chuck.net/comments_216543.html' 
html = urllib.urlopen(url).read() 

soup = BeautifulSoup(html,"html.parser") 
sum=0 
# Retrieve all of the anchor tags 
tags = soup('span') 
for tag in tags: 
    # Look at the parts of a tag 
    y=str(tag) 
    x= re.findall("[0-9]+",y) 
    for i in x: 
     i=int(i) 
     sum=sum+i 
print sum

来源

2016-01-14 11:52:50 Tuhin

做它的基本途径...

# Retrieve all of the anchor tags 
tags = soup('span') 
sum = 0 
count = 0 
for tag in tags: 
# Look at the parts of a tag 

    #print tag.contents[0] 
    num = float(tag.contents[0]) 
    #print num 
    sum = sum + num 
    count = count + 1 

print 'count:',count 
print 'sum:',sum

来源

2016-01-20 05:36:55 JPAbucay

我这样做的光标，它给了我所有的正确答案。希望它帮助;）

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import ssl 

# Ignore SSL certificate errors 
ctx = ssl.create_default_context() 
ctx.check_hostname = False 
ctx.verify_mode = ssl.CERT_NONE 

url = input('Enter - ') 
html = urlopen(url, context=ctx).read() 
soup = BeautifulSoup(html,"html.parser") 

# Retrieve all of the anchor tags 
tags = soup('span') 
sum = 0 
count = 0 
for tag in tags: 
# Look at the parts of a tag 

    #print tag.contents[0] 
    num = float(tag.contents[0]) 
    #print num 
    sum = sum + num 
    count = count + 1 

print ('count:', count) 
print ('sum:', sum)

来源

2017-07-27 23:52:43 Anna

-1

import urllib.request,urllib.parse,urllib.error 

import re 

from bs4 import BeautifulSoup 

url = input('Enter - ') 


html = urllib.request.urlopen(url).read() 

soup = BeautifulSoup(html,"html.parser") 

tags=soup('span') 

sum=0 

for tag in tags: 

    x=re.findall("[0-9]+",tag) 



    for i in x: 

     z=int(i) 


     sum=sum+i 


print(sum)

来源

2017-09-20 23:18:42

欢迎使用堆栈溢出。请编辑您的答案，以便对代码进行格式化，并添加关于您的代码的解释，以及为什么OP应该使用它，或者是更好的解决方案，然后是接受的答案。 – Syfer

从使用Python中的beautifulsoup的网站中提取数字

回答

相关问题