用Python和Selenium Scraping JavaScript Webdriver

我试图从Ask中提取广告，它们是由Google托管的JS在iframe中生成的。用Python和Selenium Scraping JavaScript Webdriver

当我手动导航我的方式，并查看源代码，他们是（我正在寻找一个ID为“adBlock”，它是在一个iframe中）的div。

但是，当我尝试使用Firefox，Chromedriver或FirefoxPortable时，返回给我的源缺少我正在寻找的所有元素。

我试过用urllib2进行刮擦，结果相同，甚至在添加必要的标题时也是如此。我确信像Webdriver创建的物理浏览器实例可以解决这个问题。

这里是我工作过的代码，这不得不从几个不同的来源拼凑起来：

from selenium import webdriver 
from selenium.common.exceptions import TimeoutException 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
import pprint 

# Create a new instance of the Firefox driver 
driver = webdriver.Chrome('C:\Python27\Chromedriver\chromedriver.exe') 
driver.get("http://www.ask.com") 

print driver.title 
inputElement = driver.find_element_by_name("q") 

# type in the search 
inputElement.send_keys("baseball hats") 
# submit the form (although google automatically searches now without submitting) 
inputElement.submit() 

try: 
    WebDriverWait(driver, 10).until(EC.title_contains("baseball")) 
    print driver.title 
    output = driver.page_source 
    print(output) 
finally: 
    driver.quit()

我知道我经历了几个不同的尝试圆圈查看源代码，这不是什么我很担心。

对于为什么我从该脚本中获得一个结果（广告被忽略）以及从它打开的浏览器获得完全不同的结果（存在广告）的任何想法？我试过了Scrapy，Selenium，Urllib2等等，没有快乐。

来源

2014-01-30 Rob M

Selenium只显示当前帧或iframe的内容。你必须使用沿着这些线的东西切换到iframes

iframes = driver.find_elements_by_tag_name("iframe") 

for iframe in iframes 
    driver.switch_to_default_content() 
    driver.switch_to_frame(iframe) 

    output = driver.page_source 
    print(output)

来源

2014-01-30 02:38:11 Richard

你是一个疯狂的科学家。像魅力一样工作，谢谢。 –

确实，你是maaad！完美的作品！ – tmthyjames

用Python和Selenium Scraping JavaScript Webdriver

回答

相关问题