2016-11-22 49 views
1

我有一个小脚本,我很高兴能够从剪贴板中读取一个或多个书目参考,并从Google学术搜索获得学术论文的信息,然后将其送入SciHub以获得pdf。由于某种原因,它停止了工作,我花了很多年时间试图找出原因。向SciHub发送表单请求不再使用urllib urllib2 python

测试表明该程序的Google(scholarly.py)部分工作正常,这是SciHub的一部分是问题所在。

任何想法?澳大利亚佩斯市Appleard,S.J.,Angeloni,J。和Watkins,R。(2006)一个城市地区出现干旱和人口密度增加的富砷地下水。 Applied Geochemistry 21(1),83-97。

'''Program to automatically find and download items from a bibliography or references list. 
This program uses the 'scihub' website to obtain the full-text paper where 
available, if no entry is found the paper is ignored and the failed downloads 
are listed at the end''' 

import scholarly 
import win32clipboard 
import urllib 
import urllib2 
import webbrowser 
import re 

'''Select and then copy the bibliography entries you want to download the 
papers for, python reads the clipboard''' 
win32clipboard.OpenClipboard() 
c = win32clipboard.GetClipboardData() 
win32clipboard.EmptyClipboard() 

'''Cleans up the text. removes end lines and double spaces etc.''' 
c = c.replace('\n', ' ') 
c = c.replace('\r', ' ') 
while c.find(' ') != -1: 
    c = c.replace(' ', ' ') 
win32clipboard.SetClipboardText(c) 
win32clipboard.CloseClipboard() 
print "Working..." 

'''bit of regex to extract the title of the paper, 
IMPORTANT: bibliography has to be in 
author date format or you will need to revise this, 
at the moment it looks for year date in brackets, then copies all the text until it 
reaches a full-stop, assuming that this is the paper title. If it is not, it 
will either fail or will be using inappropriate search terms.''' 


paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c) 
print "Analysing titles" 
print "The following titles found:" 
print "*************************" 
list_of_titles= list() 
for i in paper_info: 
    print '%s...' % (i[3][:50]) 
    Paper_title=str(i[3]) 
    list_of_titles.append(Paper_title) 

failed=list() 
for title in list_of_titles: 
    try: 
     search_query = scholarly.search_pubs_query(title) 

     info= (next(search_query)) 

     print "Querying Google Scholar" 
     print "**********************" 
     print "Looking up paper title:" 
     print "**********************" 
     print title 
     print "**********************" 

     url=info.bib['url'] 
     print "Journal URL found " 
     print url 
     #url=next(search_query) 
     print "Sending URL: ", url 


     site='http://sci-hub.cc/' 
     data = urllib.urlencode({'request': url}) 

     print data 
     results = urllib2.urlopen(site, data) #this is where it fails 


     with open("results.html", "w") as f: 
      f.write(results.read()) 

     webbrowser.open_new("results.html") 


    except: 
     print "**********************" 
     print "No valid journal found for:" 
     print title 
     print "**********************" 
     print "Continuing..." 
     failed.append(title) 
    continue 

if len(failed)==0: 
    print 'Complete' 

else: 
    print '*************************************' 
    print 'The following titles did not download: ' 
    print '*************************************' 
    print failed 
    print "Please check that these are valid entries" 
+0

你有一个裸'不同的是:在你的代码是吃的每一个例外,用无用的错误消息替换它'块。尝试删除它,看看问题究竟是什么。 – Blender

+0

感谢搅拌机,我得到HTTP错误403:禁止 – flashliquid

+0

我认为我需要欺骗标题看起来不是一个Python脚本。我无法让它工作。我目前正在使用“请求”而不是URLlib和URLlib2重写有问题的部分。这很令人困惑,因为它在数周内工作正常。 – flashliquid

回答

1

现在可以使用了,我添加了“User-Agent”标题并重新调整了URLlib的内容。现在看来它更明显。一个尝试和错误的过程,尝试从网络上获取的许多不同的代码片段。希望我的老板不会问我今天取得了什么。有人应该建立一个论坛,在这里人们可以得到答案的编码问题...

'''Program to automatically find and download items from a bibliography or references list here are some journal papers in bibliographic format. Just copy the text to clipboard and run the script. 

Ghaffour, N., T. M. Missimer and G. L. Amy (2013). "Technical review and evaluation of the economics of water desalination: Current and future challenges for better water supply sustainability." Desalination 309(0): 197-207. 

Gutiérrez Ortiz, F. J., P. G. Aguilera and P. Ollero (2014). "Biogas desulfurization by adsorption on thermally treated sewage-sludge." Separation and Purification Technology 123(0): 200-213. 

This program uses the 'scihub' website to obtain the full-text paper where 
available, if no entry is found the paper is ignored and the failed downloads are listed at the end''' 

    import scholarly 
    import win32clipboard 
    import urllib 
    import urllib2 
    import webbrowser 
    import re 


    '''Select and then copy the bibliography entries you want to download the 
    papers for, python reads the clipboard''' 
    win32clipboard.OpenClipboard() 
    c = win32clipboard.GetClipboardData() 
    win32clipboard.EmptyClipboard() 

    '''Cleans up the text. removes end lines and double spaces etc.''' 
    c = c.replace('\n', ' ') 
    c = c.replace('\r', ' ') 
    while c.find(' ') != -1: 
     c = c.replace(' ', ' ') 
    win32clipboard.SetClipboardText(c) 
    win32clipboard.CloseClipboard() 
    print "Working..." 

    '''bit of regex to extract the title of the paper, 
    IMPORTANT: bibliography has to be in 
    author date format or you will need to revise this, 
    at the moment it looks for date in brackets, then copies all the text until it 
    reaches a full-stop, assuming that this is the paper title. If it is not, it 
    will either fail or will be using inappropriate search terms.''' 

    paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c) 
    print "Analysing titles" 
    print "The following titles found:" 
    print "*************************" 
    list_of_titles= list() 
    for i in paper_info: 
     print '%s...' % (i[3][:50]) 
     Paper_title=str(i[3]) 
     list_of_titles.append(Paper_title) 
    paper_number=0 
    failed=list() 
    for title in list_of_titles: 
     try: 
      search_query = scholarly.search_pubs_query(title) 

      info= (next(search_query)) 
      paper_number+=1 
      print "Querying Google Scholar" 
      print "**********************" 
      print "Looking up paper title:" 
      print title 
      print "**********************" 

      url=info.bib['url'] 
      print "Journal URL found " 
      print url 
      #url=next(search_query) 
      print "Sending URL: ", url 

      site='http://sci-hub.cc/' 

      r = urllib2.Request(url=site) 
      r.add_header('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11') 
      r.add_data(urllib.urlencode({'request': url})) 
      res= urllib2.urlopen(r) 



      with open("results.html", "w") as f: 
       f.write(res.read()) 


      webbrowser.open_new("results.html") 
      if not paper_number<= len(list_of_titles): 
       print "Next title" 
      else: 
       continue 

     except Exception as e: 
      print repr(e) 
      paper_number+=1 
      print "**********************" 
      print "No valid journal found for:" 
      print title 
      print "**********************" 
      print "Continuing..." 
      failed.append(title) 
     continue 

    if len(failed)==0: 
     print 'Complete' 

    else: 
     print '*************************************' 
     print 'The following titles did not download: ' 
     print '*************************************' 
     print failed 
     print "Please check that these are valid entries" 
相关问题