2017-10-05 60 views
0

我想解析这个页面只有https://en.wikipedia.org/wiki/Morgan_Freeman的电影摄影部分。使用红宝石的nokogiri刮去维基百科的特定部分

我试过到目前为止

actor = "Morgan_Freeman" 
html = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/" + actor)) 


output = File.new(actor + ".txt", 'w+') 


person = html.at_css('#firstHeading').text # gets the name 
bday = html.at_css('.bday').text # birthday 
filmo_list = html.at_css('.div-col') # the div that wraps all the Filmography 
parsed_film = [] # list to add those Films 

filmo_list.at_css('i').each { |l| puts l } 

我从这个国家消失。

我发现了filmo_list将返回

<div class="div-col columns column-width" style="-moz-column-width: 20em; -webkit-column-width: 20em; column-width: 20em;"> 
<ul> 
<li> 
<i><a href="/wiki/Brubaker" title="Brubaker">Brubaker</a></i> (1980)</li> 
<li> 
<i><a href="/wiki/Marie_(film)" title="Marie (film)">Marie</a></i> (1985)</li> 
<li> 
<i><a href="/wiki/That_Was_Then..._This_Is_Now" title="That Was Then... This Is Now">That Was Then... This Is Now</a></i> (1985)</li> 
<li> 
<i><a href="/wiki/Street_Smart_(film)" title="Street Smart (film)">Street Smart</a></i> (1987)</li> 
<li> 
<i><a href="/wiki/Glory_(1989_film)" title="Glory (1989 film)">Glory</a></i> (1989)</li> 
<li> 
<i><a href="/wiki/Driving_Miss_Daisy" title="Driving Miss Daisy">Driving Miss Daisy</a></i> (1989)</li> 
<li> 
<i><a href="/wiki/Lean_on_Me_(film)" title="Lean on Me (film)">Lean on Me</a></i> (1989)</li> 
<li> 
<i><a href="/wiki/Johnny_Handsome" title="Johnny Handsome">Johnny Handsome</a></i> (1989)</li> 
<li> 
<i><a href="/wiki/Robin_Hood:_Prince_of_Thieves" title="Robin Hood: Prince of Thieves">Robin Hood: Prince of Thieves</a></i> (1991)</li> 
<li> 
<i><a href="/wiki/Unforgiven_(1992_film)" class="mw-redirect" title="Unforgiven (1992 film)">Unforgiven</a></i> (1992)</li> 
<li> 
<i><a href="/wiki/The_Shawshank_Redemption" title="The Shawshank Redemption">The Shawshank Redemption</a></i> (1994)</li> 
<li> 
<i><a href="/wiki/Outbreak_(film)" title="Outbreak (film)">Outbreak</a></i> (1995)</li> 
<li> 
<i><a href="/wiki/Seven_(1995_film)" title="Seven (1995 film)">Seven</a></i> (1995)</li> 
<li> 
<i><a href="/wiki/Moll_Flanders_(1996_film)" title="Moll Flanders (1996 film)">Moll Flanders</a></i> (1996)</li> 
<li> 
<i><a href="/wiki/Amistad_(1997_film)" class="mw-redirect" title="Amistad (1997 film)">Amistad</a></i> (1997)</li> 
<li> 
<i><a href="/wiki/Kiss_the_Girls_(film)" class="mw-redirect" title="Kiss the Girls (film)">Kiss the Girls</a></i> (1997)</li> 
<li> 
<i><a href="/wiki/Deep_Impact_(film)" title="Deep Impact (film)">Deep Impact</a></i> (1998)</li> 
<li> 
<i><a href="/wiki/Nurse_Betty" title="Nurse Betty">Nurse Betty</a></i> (2000)</li> 
<li> 
<i><a href="/wiki/Along_Came_a_Spider_(film)" title="Along Came a Spider (film)">Along Came a Spider</a></i> (2001)</li> 
<li> 
<i><a href="/wiki/The_Sum_of_All_Fears_(film)" title="The Sum of All Fears (film)">The Sum of All Fears</a></i> (2002)</li> 
<li> 
<i><a href="/wiki/High_Crimes" title="High Crimes">High Crimes</a></i> (2002)</li> 
<li> 
<i><a href="/wiki/Bruce_Almighty" title="Bruce Almighty">Bruce Almighty</a></i> (2003)</li> 
<li> 
<i><a href="/wiki/Million_Dollar_Baby" title="Million Dollar Baby">Million Dollar Baby</a></i> (2004)</li> 
<li> 
<i><a href="/wiki/Unleashed_(film)" title="Unleashed (film)">Unleashed</a></i> (2005)</li> 
<li> 
<i><a href="/wiki/An_Unfinished_Life" title="An Unfinished Life">An Unfinished Life</a></i> (2005)</li> 
<li> 
<i><a href="/wiki/Batman_Begins" title="Batman Begins">Batman Begins</a></i> (2005)</li> 
<li> 
<i><a href="/wiki/Lucky_Number_Slevin" title="Lucky Number Slevin">Lucky Number Slevin</a></i> (2006)</li> 
<li> 
<i><a href="/wiki/10_Items_or_Less_(film)" title="10 Items or Less (film)">10 Items or Less</a></i> (2006)</li> 
<li> 
<i><a href="/wiki/Evan_Almighty" title="Evan Almighty">Evan Almighty</a></i> (2007)</li> 
<li> 
<i><a href="/wiki/Gone,_Baby,_Gone" class="mw-redirect" title="Gone, Baby, Gone">Gone, Baby, Gone</a></i> (2007)</li> 
<li> 
<i><a href="/wiki/The_Bucket_List" title="The Bucket List">The Bucket List</a></i> (2007)</li> 
<li> 
<i><a href="/wiki/Feast_of_Love" title="Feast of Love">Feast of Love</a></i> (2007)</li> 
<li> 
<i><a href="/wiki/Wanted_(2008_film)" title="Wanted (2008 film)">Wanted</a></i> (2008)</li> 
<li> 
<i><a href="/wiki/The_Dark_Knight_(film)" title="The Dark Knight (film)">The Dark Knight</a></i> (2008)</li> 
<li> 
<i><a href="/wiki/Invictus_(film)" title="Invictus (film)">Invictus</a></i> (2009)</li> 
<li> 
<i><a href="/wiki/Red_(2010_film)" title="Red (2010 film)">RED</a></i> (2010)</li> 
<li> 
<i><a href="/wiki/Dolphin_Tale" title="Dolphin Tale">Dolphin Tale</a></i> (2011)</li> 
<li> 
<i><a href="/wiki/The_Dark_Knight_Rises" title="The Dark Knight Rises">The Dark Knight Rises</a></i> (2012)</li> 
<li> 
<i><a href="/wiki/The_Magic_of_Belle_Isle" title="The Magic of Belle Isle">The Magic of Belle Isle</a></i> (2012)</li> 
<li> 
<i><a href="/wiki/Olympus_Has_Fallen" title="Olympus Has Fallen">Olympus Has Fallen</a></i> (2013)</li> 
<li> 
<i><a href="/wiki/Oblivion_(2013_film)" title="Oblivion (2013 film)">Oblivion</a></i> (2013)</li> 
<li> 
<i><a href="/wiki/Now_You_See_Me_(film)" title="Now You See Me (film)">Now You See Me</a></i> (2013)</li> 
<li> 
<i><a href="/wiki/Last_Vegas" title="Last Vegas">Last Vegas</a></i> (2013)</li> 
<li> 
<i><a href="/wiki/The_Lego_Movie" title="The Lego Movie">The Lego Movie</a></i> (2014)</li> 
<li> 
<i><a href="/wiki/Transcendence_(2014_film)" title="Transcendence (2014 film)">Transcendence</a></i> (2014)</li> 
<li> 
<i><a href="/wiki/Lucy_(2014_film)" title="Lucy (2014 film)">Lucy</a></i> (2014)</li> 
<li> 
<i><a href="/wiki/Dolphin_Tale_2" title="Dolphin Tale 2">Dolphin Tale 2</a></i> (2014)</li> 
<li> 
<i><a href="/wiki/Momentum_(2015_film)" title="Momentum (2015 film)">Momentum</a></i> (2015)</li> 
<li> 
<i><a href="/wiki/Ted_2" title="Ted 2">Ted 2</a></i> (2015)</li> 
<li> 
<i><a href="/wiki/London_Has_Fallen" title="London Has Fallen">London Has Fallen</a></i> (2016)</li> 
<li> 
<i><a href="/wiki/Now_You_See_Me_2" title="Now You See Me 2">Now You See Me 2</a></i> (2016)</li> 
<li> 
<i><a href="/wiki/Going_in_Style_(2017_film)" title="Going in Style (2017 film)">Going In Style</a></i> (2017)</li> 
<li> 
<i><a href="/wiki/The_Nutcracker_and_the_Four_Realms" title="The Nutcracker and the Four Realms">The Nutcracker and the Four Realms</a></i> (2018)</li> 
</ul> 
</div> 

所以,基本上一堆<李>的内部一个巨大的< UL>的。

我想解析div的“Brubaker(1980)”部分并将其添加到“parsed_film”,但我不确定如何访问“filmo_list”div中的每个项目。

请帮忙!

回答

1

这样做:

parsed_film = html.css('.div-col li').map(&:text) 
puts parsed_film 

做些什么: html.css('.div-col li')选择每个列表项的NodeSet。然后我们遍历它们并调用text以获取li中的文本。

如果不想一年解析的薄膜,然后,往里走i为:

parsed_film = html.css('.div-col li i').map(&:text) 

要纠正你的方法,你css而不是at_csscss返回DOM中所有匹配选择器元素的集合,而at_css仅返回该集合的第一个匹配元素。您需要整个在这里设置。

filmo_list.css('i').each { |x| puts x.text } 
+0

非常感谢!有用!我想问的一件事是,有没有办法将其应用于其他演员的页面呢?他们可能有不同的“电影作品”部分的格式,但我的想法是找到包含文字“电影作品”的部分,并尝试找到包含所有这些电影名称的最接近的div。 – Alibaba17

+0

你可以搜索'html.css('#Filography')。父',然后转到下一个兄弟,直到你得到一个'div' – kiddorails

+0

我们如何设置条件“转到下一个兄弟姐妹,直到你得到一个div “? – Alibaba17