2015-03-13 45 views
0

我想从PDF页面获取文本,我将使用XPATH selenium IDE和python逐个打开pdf页面链接但是,给我空数据,有时它给了我一个页面内容的PDF页面,但没有在一个特定的格式。如何使用selenium IDE和python获取pdf页面(链接)中的所有页面文本

如何从pdf链接的所有页面获取文本?

这里是我的代码:

from selenium import webdriver 
from selenium.common.exceptions import TimeoutException 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

url = "http://www.incredibleindia.org" 
driver = webdriver.Firefox() 
driver.get(url) 
# wait for menu to being loaded 
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.menu li > a"))) 

#article under media tab 
article_link = [a.get_attribute('href') for a in driver.find_elements_by_xpath("html/body/div[3]/div/div[1]/div[2]/ul/li[3]/ul/li[6]/a")] 
#all important news links under trade tab 
for link in article_link: 
    print link 
    driver.get(link) 
    #check article sublinks css available on article link page 
    try: 
     WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.article-full-div"))) 
    except TimeoutException: 
     print driver.title, "No news links under media tab" 
    #alrticle sub links under article tab 
    article_sub_links = [a.get_attribute('href') for a in driver.find_elements_by_xpath(".//*[@id='article-content']/div/div[2]/ul/li/a")] 

    print "article sub links" 
    for link in article_sub_links: 
     print link 

     driver.get(link) 
     try: 
      WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.textLayer"))) 
     except TimeoutException: 
      print driver.title, "No news links under media tab" 

     content = [a.text for a in driver.find_elements_by_xpath(".//*[contains(@id,'pageContainer')]")] 
     print content 
     for data in content: 
      print data 

输出:

http://www.incredibleindia.org/en/media-black-2/articles 
article sub links 
http://www.incredibleindia.org/images/articles/Ajanta.pdf 
[u'', u'', u''] 



http://www.incredibleindia.org/images/articles/Bedhaghat.pdf 
404 - Error: 404 No news links under media tab` 
[] 
http://www.incredibleindia.org/images/articles/Bellur.pdf 
[u'', u'', u''] 



http://www.incredibleindia.org/images/articles/Bidar.pdf 
[u'', u'', u''] 



http://www.incredibleindia.org/images/articles/Braj.pdf 
[u'', u'', u'', u''] 




http://www.incredibleindia.org/images/articles/Carnival.pdf 
[u'', u'', u'']` 

回答

1

我想你需要深入到 “textlayer”(divclass="textlayer"每一页容器内元素)。您还需要在异常处理块中使用continue

for link in article_sub_links: 
    driver.get(link) 

    try: 
     WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.textLayer"))) 
    except TimeoutException: 
     print driver.title, "Empty content" 
     continue 

    content = [a.text for a in driver.find_elements_by_css_selector("div[id^=pageContainer] div.textLayer")] 
    for data in content: 
     print data 
+0

它正在处理某些pdf链接,但不是用于所有pdf链接。它是打印PDF链接或一些PDF链接的内容,而不是全部PDF链接的内容。 – Mukesh 2015-03-16 05:47:04

+0

@ user3902208谢谢,你能否提供一个不适用的示例链接? – alecxe 2015-03-16 09:55:02

+0

输出: 'http://www.incredibleindia.org/en/media-black-2/articles 文章子链接 http://www.incredibleindia.org/images/articles/Ajanta.pdf http:///www.incredibleindia.org/images/articles/Bedhaghat.pdf 404 - 错误:404空的内容 http://www.incredibleindia.org/images/articles/Bellur.pdf http://www.incredibleindia.org /images/articles/Gir.pdf http://www.incredibleindia.org/images/articles/Hampi.pdf http://www.incredibleindia.org/images/articles/Orchha.pdf **它没有显示内容除了此链接**' – Mukesh 2015-03-16 10:51:28

相关问题