0
我想从PDF页面获取文本,我将使用XPATH selenium IDE和python逐个打开pdf页面链接但是,给我空数据,有时它给了我一个页面内容的PDF页面,但没有在一个特定的格式。如何使用selenium IDE和python获取pdf页面(链接)中的所有页面文本
如何从pdf链接的所有页面获取文本?
这里是我的代码:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "http://www.incredibleindia.org"
driver = webdriver.Firefox()
driver.get(url)
# wait for menu to being loaded
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.menu li > a")))
#article under media tab
article_link = [a.get_attribute('href') for a in driver.find_elements_by_xpath("html/body/div[3]/div/div[1]/div[2]/ul/li[3]/ul/li[6]/a")]
#all important news links under trade tab
for link in article_link:
print link
driver.get(link)
#check article sublinks css available on article link page
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.article-full-div")))
except TimeoutException:
print driver.title, "No news links under media tab"
#alrticle sub links under article tab
article_sub_links = [a.get_attribute('href') for a in driver.find_elements_by_xpath(".//*[@id='article-content']/div/div[2]/ul/li/a")]
print "article sub links"
for link in article_sub_links:
print link
driver.get(link)
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.textLayer")))
except TimeoutException:
print driver.title, "No news links under media tab"
content = [a.text for a in driver.find_elements_by_xpath(".//*[contains(@id,'pageContainer')]")]
print content
for data in content:
print data
输出:
http://www.incredibleindia.org/en/media-black-2/articles
article sub links
http://www.incredibleindia.org/images/articles/Ajanta.pdf
[u'', u'', u'']
http://www.incredibleindia.org/images/articles/Bedhaghat.pdf
404 - Error: 404 No news links under media tab`
[]
http://www.incredibleindia.org/images/articles/Bellur.pdf
[u'', u'', u'']
http://www.incredibleindia.org/images/articles/Bidar.pdf
[u'', u'', u'']
http://www.incredibleindia.org/images/articles/Braj.pdf
[u'', u'', u'', u'']
http://www.incredibleindia.org/images/articles/Carnival.pdf
[u'', u'', u'']`
它正在处理某些pdf链接,但不是用于所有pdf链接。它是打印PDF链接或一些PDF链接的内容,而不是全部PDF链接的内容。 – Mukesh 2015-03-16 05:47:04
@ user3902208谢谢,你能否提供一个不适用的示例链接? – alecxe 2015-03-16 09:55:02
输出: 'http://www.incredibleindia.org/en/media-black-2/articles 文章子链接 http://www.incredibleindia.org/images/articles/Ajanta.pdf http:///www.incredibleindia.org/images/articles/Bedhaghat.pdf 404 - 错误:404空的内容 http://www.incredibleindia.org/images/articles/Bellur.pdf http://www.incredibleindia.org /images/articles/Gir.pdf http://www.incredibleindia.org/images/articles/Hampi.pdf http://www.incredibleindia.org/images/articles/Orchha.pdf **它没有显示内容除了此链接**' – Mukesh 2015-03-16 10:51:28