2016-06-09 25 views
0

我想打开,然后从包含在标签的URL看起来像这样凑数据:刮JavaScript网址,但硒返回空字符串

<script src="http://includes.mpt-static.com/data/7CE5047496" type="text/javascript" charset="utf-8"></script> 

我试着用硒检索/打开网址,但它只是返回一个空白字符串。我认为这是因为当我直接点击src url时,打开一个页面并显示我想要的数据表。但是,当我复制并通过网址到浏览器中时,它会返回空白。另外,每次我重新加载页面时,都会生成一个新的src url。有谁知道为什么会发生这种情况?

的网址: 查看源代码:http://mypricetrack.com/amazon/B00N2BW2PK

我的代码:

import time 
from fake_useragent import UserAgent 
import urllib2 
import csv 
from bs4 import BeautifulSoup 
import json 
from selenium import webdriver 

#FAKE-USER_AGENT 
ua = UserAgent(cache = False) 
headers = {'User-Agent': ua.randome} 


#SENDING REQUEST TO PRICETRACKER WEBSITE 
product = 'B00N2BW2PK' 
page = requests.get('http://www.mypricetrack.com/amazon/'+str(product), headers = headers) 
soup = BeautifulSoup(page.text) 
#print(soup.prettify()) 

#GETTING URL FOR DATA 
data_link = [] 
for tag in soup.findAll('script',{'charset':'utf-8'}): 
    data_link = data_link + [tag['src']] 
string2 = data_link[1] 
print string2 
#OPENING URL FOR DATA 

driver = webdriver.Firefox() 
driver.get(string2) 
time.sleep(5) 
htmlSource = driver.page_source 
print htmlSource 

回答

0

,除非你有一个适当的标题 “引荐” 请求它的JavaScript将无法下载。

硒是有点矫枉过正,你可以得到它只是使用python的要求:

import requests 
import re 
from bs4 import BeautifulSoup 
# Emulate a browser with proper headers 
session.headers.update({ 
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36', 
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
    'Accept-Language':'en-US,en;q=0.8,es;q=0.6' 
}) 
# Go to product page 
product_page = 'http://mypricetrack.com/amazon/B00N2BW2PK' 
res = session.get(product_page) 
# find link 
link = soup.find('script', {'src':re.compile('http://includes.mpt-static.com/data')}) 
link_src = link['src'] 
# Get you JS content 
res = session.get(src, headers={'Referer':product_page}).text