我想抓scrapy this使用scrapy的网站。页面结构如下:如何选择和提取两个元素之间的文本?
<div class="list">
<a id="follows" name="follows"></a>
<h4 class="li_group">Follows</h4>
<div class="soda odd"><a href="...">Star Trek</a></div>
<div class="soda even"><a href="...</a></div>
<div class="soda odd"><a href="..">Star Trek: The Motion Picture</a></div>
<div class="soda even"><a href="..">Star Trek II: The Wrath of Khan</a></div>
<div class="soda odd"><a href="..">Star Trek III: The Search for Spock</a></div>
<div class="soda even"><a href="..">Star Trek IV: The Voyage Home</a></div>
<a id="followed_by" name="followed_by"></a>
<h4 class="li_group">Followed by</h4>
<div class="soda odd"><a href="..">Star Trek V: The Final Frontier</a></div>
<div class="soda even"><a href="..">Star Trek VI: The Undiscovered Country</a></div>
<div class="soda odd"><a href="..">Star Trek: Deep Space Nine</a></div>
<div class="soda even"><a href="..">Star Trek: Generations</a></div>
<div class="soda odd"><a href="..">Star Trek: Voyager</a></div>
<div class="soda even"><a href="..">First Contact</a></div>
<a id="spin_off" name="spin_off"></a>
<h4 class="li_group">Spin-off</h4>
<div class="soda odd"><a href="..">Star Trek: The Next Generation - The Transinium Challenge</a></div>
<div class="soda even"><a href="..">A Night with Troi</a></div>
<div class="soda odd"><a href="..">Star Trek: Deep Space Nine</a></div
</div>
我想选择之间提取文本:<h4 class="li_group">Follows</h4>
和<h4 class="li_group">Followed by</h4>
然后<h4 class="li_group">Followed by</h4>
和<h4 class="li_group">Spin-off</h4>
之间的文本我想这个代码:
def parse(self, response):
for sel in response.css("div.list"):
item = ImdbcoItem()
item['Follows'] = sel.css("a#follows+h4.li_group ~ div a::text").extract(),
item['Followed_by'] = sel.css("a#vfollowed_by+h4.li_group ~ div a::text").extract(),
item['Spin_off'] = sel.css("a#spin_off+h4.li_group ~ div a::text").extract(),
return item
但是这个第一个项目提取的所有div不仅仅是div的<h4 class="li_group">Follows</h4>
和<h4 class="li_group">Followed by</h4>
之间的任何帮助真的会Helpfu升!
只是它帮助的情况下,imdb.com有一个(UN)官方的API在哪里?如果我记得好的话,你可以把所有这些数据清理干净。 – Neil