我试图抓取一组网页并使用Apache Solr对其进行索引。为了抓取网页,我在BeautifulSoup和urllib2的帮助下使用了python。我成功地检索了网址和html数据。HTTP 404 NOT FOUND错误 - Apache Solr
现在我试图让Solr通过solr(http://code.google.com/p/solrpy/)将它们编入索引。我一直得到一个HTTP 404错误未找到。
我还没有修改默认的schema.xml,我正在使用Apache Solr附带的示例服务器。
这里是我的代码:
import sys
import urllib2
import solr
from bs4 import BeautifulSoup
from lxml import etree
import hashlib
solrUrl = 'http://localhost:8983/solr/'
solrInstance = solr.SolrConnection(solrUrl)
conn = urllib2.urlopen('http://seekingalpha.com/market_currents.xml')
root = etree.fromstring(conn.read())
links = root.findall(".//link")
counter = 0
for link in links:
counter=counter+1
url = link.text
url_md5 = hashlib.md5(url).hexdigest()
conn = urllib2.urlopen(link.text)
soup = BeautifulSoup(conn.read())
title_page = soup.html.head.title.string.decode("utf-8")
print title_page
try: # Add to the Solr instance
solrInstance.add(id=str(url_md5),url_s=url,text=str(title_page),title=str(title_page))
except Exception as inst:
print "Error adding URL: "+url
print "\tWith Message: "+str(inst)
else:
print "Added Page \""+title+"\" with URL "+url
try:
solrInstance.commit()
except:
print "Could not Commit Changes to Solr Instance - check logs"
else:
print "Success. "+str(counter)+" documents added to index"
而这里的错误:
Error adding URL: http://seekingalpha.com/currents/all
With Message: HTTP code=404, reason=Not Found
如何纠正呢?提前致谢。
任何你没有使用Apache Nutch的理由?它专为爬行而设计,并直接支持Solr。 –