我试图下载下面的链接中的所有PDF文件。如何提取网页上的链接的URL
首先,我试图提取所有PDF链接(链接包含在红色this image)
from bs4 import BeautifulSoup
import urllib2 as ul
resp = ul.urlopen("https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1")
soup = BeautifulSoup(resp, 'lxml')
f = open('url.txt', 'w')
for link in soup.find_all('a', href=True):
f.write(str(link['href']) + '\n')
f.close()
----------------------------------------------------------------
<url.txt>
http://www.osa.org
#
https://www.osapublishing.org
#
#
#
#
/about.cfm
/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/conference.cfm?meetingid=5
/conference.cfm?meetingid=124
/conference.cfm?meetingid=56
/conference.cfm?meetingid=144&yr=2015
/conference.cfm?meetingid=153&yr=2015
/conference.cfm?meetingid=131&yr=2015
/conference.cfm?meetingid=174&yr=2015
/conference.cfm?meetingid=109&yr=2015
#global-nav
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/osadigitalarchive.cfm
/isp.cfm
http://imagebank.osa.org
/spotlight
/china/
#
/user
#
#
#
https://www.osapublishing.org
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
/
#
#
/user
#
#
/about.cfm
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/china/
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
http://imagebank.osa.org
/spotlight/
/china/
/about.cfm
/benefitslog.cfm
/contactus.cfm
#
/privacy.cfm
/termsofuse.cfm
https://account.osa.org/eweb/dynamicpage.aspx?sso=1&site=osac&webcode=loginrequired&url_success=https%3A%2F%2Fwww%2Eosapublishing%2Eorg%2Fsearch%2Ecfm%3Fq%3Dcomsol%26meta%3D1%26cj%3D1%26cc%3D1%26usertoken%3D%7Btoken%7D
https://account.osa.org/eweb/Dynamicpage.aspx?webcode=forgotpassword*Site=osac
/privacy.cfm
http://www.osa.org/en-us/help/
但是,它看起来像我想提取WASN”链接的网址提取。
我该怎么做?
因此,你的目标是查看:PDF链接的权利?我看到的第一个是:'PDF'这可能意味着一些事情,它们是动态生成的或通过AJAX调用的。当我按照链接,我被带到一个页面,我登录或购买。所以它不会直接将您带到PDF中。你如何手动获取PDF文件? – Twisty
第二个加载一个完整的PDF在浏览器,它看起来像是动态生成的:https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648 -E3C1-262B-6AF76128B6A12104_274099%2Foe-21-22-27371.pdf%3Fda%3D1%26id%3D274099%26seq%3D0%26mobile%3Dno与组织=我想补充一个条件来寻找你的脚本 'PDF'。 – Twisty
谢谢你的回答。其中一些可以无需登录即可下载。我知道这些链接的URL不在HTML源代码中。有没有办法打开这些链接,而没有他们的网址? –