2016-02-12 22 views
0

我试图下载下面的链接中的所有PDF文件。如何提取网页上的链接的URL

Link

首先,我试图提取所有PDF链接(链接包含在红色this image

from bs4 import BeautifulSoup 
import urllib2 as ul 

resp = ul.urlopen("https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1") 
soup = BeautifulSoup(resp, 'lxml') 

f = open('url.txt', 'w') 

for link in soup.find_all('a', href=True): 

    f.write(str(link['href']) + '\n') 

f.close() 

---------------------------------------------------------------- 

<url.txt> 
http://www.osa.org 
# 
https://www.osapublishing.org 
# 
# 
# 
# 
/about.cfm 

/aop 
/ao 
/as 
/boe 
/col 
/jdt 
/jlt 
/jot 
/jocn 
/josaa 
/josab 
/josk 
/optica 
/ome 
/oe 
/ol 
/prj 
/jon 
/josa 
/on 
/aop 
/ao 
/as 
/boe 
/col 
/jdt 
/jlt 
/jot 
/jocn 
/josaa 
/josab 
/josk 
/optica 
/ome 
/oe 
/ol 
/prj 
/jon 
/josa 
/on 
/conferences.cfm 
/conferences.cfm 
/conferences.cfm?findby=conference 
/conference.cfm?meetingid=5 
/conference.cfm?meetingid=124 
/conference.cfm?meetingid=56 
/conference.cfm?meetingid=144&yr=2015 
/conference.cfm?meetingid=153&yr=2015 
/conference.cfm?meetingid=131&yr=2015 
/conference.cfm?meetingid=174&yr=2015 
/conference.cfm?meetingid=109&yr=2015 
#global-nav 
/books/lasers/lasers.cfm 
/oida/reports.cfm 
http://www.osa-opn.org 
/author/author.cfm 
/submit/review/peer_review.cfm 
/library/ 
/osadigitalarchive.cfm 
/isp.cfm 
http://imagebank.osa.org 
/spotlight 
/china/ 
# 
/user 
# 
# 
# 
https://www.osapublishing.org 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
/
# 
# 
/user 
# 
# 
/about.cfm 
/conferences.cfm 
/conferences.cfm 
/conferences.cfm?findby=conference 
/china/ 
/author/author.cfm 
/submit/review/peer_review.cfm 
/library/ 
/books/lasers/lasers.cfm 
/oida/reports.cfm 
http://www.osa-opn.org 
http://imagebank.osa.org 
/spotlight/ 
/china/ 
/about.cfm 
/benefitslog.cfm 
/contactus.cfm 
# 
/privacy.cfm 
/termsofuse.cfm 
https://account.osa.org/eweb/dynamicpage.aspx?sso=1&site=osac&webcode=loginrequired&url_success=https%3A%2F%2Fwww%2Eosapublishing%2Eorg%2Fsearch%2Ecfm%3Fq%3Dcomsol%26meta%3D1%26cj%3D1%26cc%3D1%26usertoken%3D%7Btoken%7D 
https://account.osa.org/eweb/Dynamicpage.aspx?webcode=forgotpassword*Site=osac 
/privacy.cfm 
http://www.osa.org/en-us/help/ 

但是,它看起来像我想提取WASN”链接的网址提取。
我该怎么做?

+1

因此,你的目标是查看:PDF链接的权利?我看到的第一个是:'PDF'这可能意味着一些事情,它们是动态生成的或通过AJAX调用的。当我按照链接,我被带到一个页面,我登录或购买。所以它不会直接将您带到PDF中。你如何手动获取PDF文件? – Twisty

+0

第二个加载一个完整的PDF在浏览器,它看起来像是动态生成的:https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648 -E3C1-262B-6AF76128B6A12104_274099%2Foe-21-22-27371.pdf%3Fda%3D1%26id%3D274099%26seq%3D0%26mobile%3Dno与组织=我想补充一个条件来寻找你的脚本 'PDF'。 – Twisty

+0

谢谢你的回答。其中一些可以无需登录即可下载。我知道这些链接的URL不在HTML源代码中。有没有办法打开这些链接,而没有他们的网址? –

回答

2

所有你想解决PDF链接是不是HTML的源内通过“https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1”。

PDF链接正在通过AJAX加载。

我猜你需要打开与邮政和设置“中的”正确的参数/饼干的URL。例如: “CFID = XXXXXXXX; CFTOKEN = XXXXXXXX; BIGipServerPubsWeb_HTTP = xxxxxxxxx.xxxxx.xxxx; _ga = GAx.x.xxxxxxxxxx.xxxxxxxxxx; _gat = 1”

您的回应将JSON格式。对象将包含'result [0] .data.has-pdf = true'来测试现有的PDF。链接看起来像:'fn:doc(“/ oe/21/22/27371/oe-21-22-27371.xml”)/ article/front/article-meta/abstract/p',所以你需要匹配它们到PDF文件。

,但我想他们可能有一些IP支票或其他安全的东西,所以也许你无法通过POST来自其他任何域,那么原产地得到一些数据。只是一个猜测;)

+0

某些链接不需要登录,例如:'https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648-E3C1-262B- 6AF76128B6A12104_274099%2Foe-21-22-27371.pdf%3Fda%3D1%26id%3D274099%26seq%3D0%26mobile%3Dno&org ='在这里你可以看到直接的URL被传递给CF脚本 – Twisty