如何使用CSS Selector和BeautifulSoup从表格中抓取数据？

在这个页面上，有一系列表格我试图从一个未命名的表格和未命名的单元格中获取特定的数据。我使用Chrome中的检查元素的复制选择器来查找CSS选择器。当我要求Python打印特定的CSS选择器时，我得到'Nonetype'对象不可调用如何使用CSS Selector和BeautifulSoup从表格中抓取数据？

具体在此页面上，我试图从表中显示数字“198” ＃一般信息，文章：第n个孩子（4），表：第n个孩子（2），

CSS选择的路径是：

"html body div#program-details section#general-info article.grid-50 table tbody tr td"

这上来使用复制选择

#general-info > article:nth-child(4) > table:nth-child(2) > tbody > tr > td:nth-child(2)

大部分代码访问该站点并绕过EULA。跳到我遇到问题的代码的底部。

import mechanize 
import requests 
import urllib2 
import urllib 
import csv 
from BeautifulSoup import BeautifulSoup 

br = mechanize.Browser() 
br.set_handle_robots(False) 
br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")] 

sign_in = br.open('https://login.ama-assn.org/account/login') #the login url 

br.select_form(name = "go") #Alternatively you may use this instead of the above line if your form has name attribute available. 

br["username"] = "wasabinoodlz" #the key "username" is the variable that takes the username/email value 
br["password"] = "Bongshop10" #the key "password" is the variable that takes the password value 
logged_in = br.submit() #submitting the login credentials 
logincheck = logged_in.read() #reading the page body that is redirected after successful login 
#print (logincheck) #printing the body of the redirected url after login 


# EULA agreement stuff 
cont = br.open('https://freida.ama-assn.org/Freida/eula.do').read() 
cont1 = br.open('https://freida.ama-assn.org/Freida/eulaSubmit.do').read() 

# Begin request for page data 
req = br.open('https://freida.ama-assn.org/Freida/user/programDetails.do?pgmNumber=1205712369').read() 

#Da Soups! 
soup = BeautifulSoup(req) 
#print soup.prettify() # use this to read html.prettify() 


for score in soup.select('#general-info > article:nth-child(4) > table:nth-child(2) > tbody > tr > td:nth-child(2)'): 
    print score.string

来源

2017-08-08 Justin Khine

对不起，划伤了。遵循该路径，我只能找到一个''嵌套的元素 – Mangohero1

有一个表在第一个'article class =“grid-50”' 中给出了“总程序大小”。然后在第二个'article类中有另一个表=“grid-50”' 两者都嵌套''标签 –

好吧我会研究它 – Mangohero1

你需要使用html5lib解析器来初始化BeautifulSoup。

soup = BeautifulSoup(req, 'html5lib')

BeautifulSoup仅实现nth-of-type伪选择。

data = soup.select(
      '#general-info > ' 
      'article:nth-of-type(4) > ' 
      'table:nth-of-type(2) > ' 
      'tbody > ' 
      'tr > ' 
      'td:nth-of-type(2)' 
     )

来源

2017-08-09 10:59:19

如何使用CSS Selector和BeautifulSoup从表格中抓取数据？

回答

相关问题