网页抓取，python和beautifulsoup

我想从网站上得到一段文字，但是我是这样做的。我得到的网页文本删除所有的HTML标签，我想找出它是否可能得到某个段落形式的所有文本返回。网页抓取，python和beautifulsoup

继承人我的代码

import requests 
from bs4 import BeautifulSoup 

response = requests.get("https://en.wikipedia.org/wiki/Aras_(river)") 
txt = response.content 

soup = BeautifulSoup(txt,'lxml') 
filtered = soup.get_text() 
print(filtered)

文本的继承人一部分它打印出来

>>>>Basin 


    Main source 
    Erzurum Province, Turkey 


    River mouth 
    Kura river 


    Physical characteristics 


    Length 
    1,072 km (666 mi) 


    The Aras or Araxes is a river in and along the countries of Turkey,  
    Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser 
    Caucasus Mountains and then joins the Kura River which drains the north 
    side of those mountains. Its total length is 1,072 kilometres (666 mi). 
    Given its length and a basin that covers an area of 102,000 square 
    kilometres (39,000 sq mi), it is one of the largest rivers of the 
    Caucasus. 



    Contents 


    1 Names 
    2 Description 
    3 Etymology and history 
    4 Iğdır Aras Valley Bird Paradise 
    5 Gallery 
    6 See also 
    7 Footnotes

，我只想要得到这一段

The Aras or Araxes is a river in and along the countries of Turkey,  
    Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser 
    Caucasus Mountains and then joins the Kura River which drains the north 
    side of those mountains. Its total length is 1,072 kilometres (666 mi). 
    Given its length and a basin that covers an area of 102,000 square 
    kilometres (39,000 sq mi), it is one of the largest rivers of the 
    Caucasus.

是可以过滤掉这段？

来源

2017-01-05 Boneyflesh

您应该多阅读BeautifulSoup文档。您可以提供classnames和xpaths来明确指定要从中检索数据的元素。 – JosephGarrone

会做@JosephGarrone – Boneyflesh

soup = BeautifulSoup(txt,'lxml') 
filtered = soup.p.get_text() # get the first p tag. 
print(filtered)

出来：

The Aras or Araxes is a river in and along the countries of Turkey, Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser Caucasus Mountains and then joins the Kura River which drains the north side of those mountains. Its total length is 1,072 kilometres (666 mi). Given its length and a basin that covers an area of 102,000 square kilometres (39,000 sq mi), it is one of the largest rivers of the Caucasus.

来源

2017-01-05 03:33:15

使用XPath，而不是！它更容易，更准确，并且专门为这些用例而设计。不幸的是，BeautifulSoup不直接支持XPath。您需要使用lxml包代替

import urllib2 
from lxml import etree 

response = urllib2.urlopen("https://en.wikipedia.org/wiki/Aras_(river)") 
parser = etree.HTMLParser() 
tree = etree.parse(response, parser) 
tree.xpath('string(//*[@id="mw-content-text"]/p[1])')

说明XPath的：

//指文档中的根元素。

*任何标记

[@id="mw-content-text"]指定条件相匹配。

p[1]选择容器内的p类型的第一个元素。

string功能，让您元素（一个或多个）

的字符串表示顺便说一句，如果你使用谷歌Chrome或Firefox，你可以使用$x功能测试里面DevTools XPath表达式：

$x('string(//*[@id="mw-content-text"]/p[1])')

来源

2017-01-05 03:39:04 bman

网页抓取，python和beautifulsoup

回答

相关问题