python-pptx从幻灯片标题中提取文本

我在python中构建了一个文档检索引擎，它返回按用户提交的查询的相关性排列的文档。我有一个包含PowerPoint文件的文档集合。对于PPT，在结果页面上，我想向用户展示前几个幻灯片标题，以给他/她更清晰的图片（有点像我们在Google搜索中看到的）。python-pptx从幻灯片标题中提取文本

所以基本上，我想从使用python的PPT文件的幻灯片标题中提取文本。我正在使用python-pptx包。目前我的实现看起来是这样的

from pptx import Presentation 
prs = Presentation(filepath) # load the ppt 
slide_titles = [] # container foe slide titles 
for slide in prs.slides: # iterate over each slide 
     title_shape = slide.shapes[0] # consider the zeroth indexed shape as the title 
     if title_shape.has_text_frame: # is this shape has textframe attribute true then 
      # check if the slide title already exists in the slide_title container 
      if title_shape.text.strip(""" [email protected]#$%^&*)(_-+=}{][:;<,>.?"'/<,""")+ '. ' not in slide_titles: 
       slide_titles.append(title_shape.text.strip(""" [email protected]#$%^&*)(_-+=}{][:;<,>.?"'/<,""")+ '. ')

但你可以看到我假设每张幻灯片上零索引的形状是幻灯片标题，这显然不是这种情况每次。任何想法如何实现这一目标？

在此先感谢。

来源

2017-04-12 Clock Slave

Slide.shapes（a SlideShapes对象）具有属性.title，它返回标题形状，当有一个（通常是）时返回标题形状，如果没有标题存在则返回None。
http://python-pptx.readthedocs.io/en/latest/api/shapes.html#slideshapes-objects

这是访问标题形状的首选方式。

请注意，并非所有幻灯片都有标题形状，因此您必须测试None结果以避免在此情况下发生错误。

另请注意，用户有时会为标题使用不同的形状，例如可能会添加一个单独的新文本框。所以你不能保证将“出现”的文字作为幻灯片中的标题。但是，您将获得与PowerPoint考虑标题相匹配的文本，例如，它在“大纲”视图中显示为该幻灯片标题的文本。

来源

2017-04-12 19:11:08 scanny

python-pptx从幻灯片标题中提取文本

回答

相关问题