1
我有一个小脚本,我很高兴能够从剪贴板中读取一个或多个书目参考,并从Google学术搜索获得学术论文的信息,然后将其送入SciHub以获得pdf。由于某种原因,它停止了工作,我花了很多年时间试图找出原因。向SciHub发送表单请求不再使用urllib urllib2 python
测试表明该程序的Google(scholarly.py)部分工作正常,这是SciHub的一部分是问题所在。
任何想法?澳大利亚佩斯市Appleard,S.J.,Angeloni,J。和Watkins,R。(2006)一个城市地区出现干旱和人口密度增加的富砷地下水。 Applied Geochemistry 21(1),83-97。
'''Program to automatically find and download items from a bibliography or references list.
This program uses the 'scihub' website to obtain the full-text paper where
available, if no entry is found the paper is ignored and the failed downloads
are listed at the end'''
import scholarly
import win32clipboard
import urllib
import urllib2
import webbrowser
import re
'''Select and then copy the bibliography entries you want to download the
papers for, python reads the clipboard'''
win32clipboard.OpenClipboard()
c = win32clipboard.GetClipboardData()
win32clipboard.EmptyClipboard()
'''Cleans up the text. removes end lines and double spaces etc.'''
c = c.replace('\n', ' ')
c = c.replace('\r', ' ')
while c.find(' ') != -1:
c = c.replace(' ', ' ')
win32clipboard.SetClipboardText(c)
win32clipboard.CloseClipboard()
print "Working..."
'''bit of regex to extract the title of the paper,
IMPORTANT: bibliography has to be in
author date format or you will need to revise this,
at the moment it looks for year date in brackets, then copies all the text until it
reaches a full-stop, assuming that this is the paper title. If it is not, it
will either fail or will be using inappropriate search terms.'''
paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c)
print "Analysing titles"
print "The following titles found:"
print "*************************"
list_of_titles= list()
for i in paper_info:
print '%s...' % (i[3][:50])
Paper_title=str(i[3])
list_of_titles.append(Paper_title)
failed=list()
for title in list_of_titles:
try:
search_query = scholarly.search_pubs_query(title)
info= (next(search_query))
print "Querying Google Scholar"
print "**********************"
print "Looking up paper title:"
print "**********************"
print title
print "**********************"
url=info.bib['url']
print "Journal URL found "
print url
#url=next(search_query)
print "Sending URL: ", url
site='http://sci-hub.cc/'
data = urllib.urlencode({'request': url})
print data
results = urllib2.urlopen(site, data) #this is where it fails
with open("results.html", "w") as f:
f.write(results.read())
webbrowser.open_new("results.html")
except:
print "**********************"
print "No valid journal found for:"
print title
print "**********************"
print "Continuing..."
failed.append(title)
continue
if len(failed)==0:
print 'Complete'
else:
print '*************************************'
print 'The following titles did not download: '
print '*************************************'
print failed
print "Please check that these are valid entries"
你有一个裸'不同的是:在你的代码是吃的每一个例外,用无用的错误消息替换它'块。尝试删除它,看看问题究竟是什么。 – Blender
感谢搅拌机,我得到HTTP错误403:禁止 – flashliquid
我认为我需要欺骗标题看起来不是一个Python脚本。我无法让它工作。我目前正在使用“请求”而不是URLlib和URLlib2重写有问题的部分。这很令人困惑,因为它在数周内工作正常。 – flashliquid