使用python-docx更新大量文件的docx元数据

我在需要更新元数据的文件夹和子文件夹中有大约300个docx文件。我有一个单独的包含元数据的300多行csv文件：每行包含文件名，关键字和行中的标题。使用python-docx更新大量文件的docx元数据

我想循环浏览从csv提取内容并将元数据插入docx文件的docx文件。 Docx文件从根文件夹向下存储2个子文件夹。

到目前为止，我已经勾画出以下内容。我正在努力研究如何循环访问csv文件并按顺序将元数据应用于每个文件。我确信有一个相对简单的方法可以解决这个问题，建立循环并获取csv内容就是我迷失的地方。我是一个小菜鸟，和我一样，感受我的方式。

任何提示赞赏。

#running in python 3.5.2 32bit 
import csv 
from docx import Document 
import os 
import sys 

csv_path = ("datasheet_metadata_uplift.csv") 

def update_docx_metadata(document, keywords, title): 
    """ 
    Update the *keywords*, and *title* metadata 
    properties in *document*. 
    """ 
    core_properties = document.core_properties 
    core_properties.keywords = keywords 
    core_properties.title = title 

def read_csv_lines(filename, keywords, title): 
    """ 
    Reads the csv lines, returns *filename*, *keywords*, *title* 
    """ 
    with open(csv_path, 'r') as f: 
     csv_file = csv.reader(f) 
     for row in csv_file: 
      filename = row[0] 
      keywords = row[1] 
      title = row[2] 

def open_docx(filename): 
    """ 
    Search for docx file and open it 
    """ 
    for root, dirs, files in os.walk("."): 
     if filename in files: 
      doc_path = os.path.join(path, filename) 

csv_lines = read_csv_lines(filename, keywords, title) 
for filename, keywords, title in csv_lines: 
    document = Document(doc_path) 
    update_doc_metadata(filename, keywords, title) 
    document.save(doc_path)

来源

2016-11-17 Aidan

下一步我会推荐Aidan将您的代码重构为相干函数。这将允许您在需要时执行所需的操作，每个操作都有一个函数调用，这样意图和流程就不会被遮挡。

你可能有这样的事情开始：

def update_doc_metadata(document, author, keywords, title, subject): 
    """ 
    Update the *author*, *keywords*, *title*, and *subject* metadata 
    properties in *document*. 
    """ 
    core_properties = document.core_properties 
    core_properties.author = author 
    core_properties.keywords = keywords 
    core_properties.title = title 
    core_properties.subject = subject

注意的几件事情：

它是连贯的，这意味着它所有的只有一两件事。这使得更具可重用性。
它不依赖任何不作为参数进来的东西。这使得它很容易测试（如果你这样做）并且通常易于理解，因为所需的所有上下文都在这十行中。
它有一个文档字符串，明确指出它的功能。这是一门有用的学科，不仅因为它可以帮助读者（很可能是你，几周或几个月后）理解这个意图，而是因为它迫使你解释你在做什么。很多时候，你可以检测出错误的因素，因为解释很难或很长时间。（围绕参数的星号将在开展某些文档软件包斜体字显示。）

如果你继续这样，定位和“提取”相干位到功能，主代码的核心逻辑将变得更清晰。

我认为，整体结构是这样的：

csv_lines = read_csv_lines(csv_path) 
for filename, keywords, title in csv_lines: 
    doc_path, document = open_docx(filename) 
    update_doc_metadata(document, author, keywords, title, subject) 
    document.save(doc_path)

来源

2016-11-17 22:20:10 scanny

嗨Scanny - 谢谢！非常有帮助的答案，我一直在重构使用函数，如你所建议的，但有些不太正确。我得到一个'NameError：name'文件名'未定义'的错误与代码的最后部分有关。我已经使用新代码更新了原始帖子。有什么想法？ – Aidan

@Aidan我想你可能会对函数参数在Python中的作用感到困惑。他们将价值（*）*转化为*函数，但通常不会*出*。为此你需要一个return语句。所以read_csv_lines应该只是将csv_path作为参数，然后返回（filename，keywords，title）序列（可能是元组）的序列（可能是list）。我认为read_csv_lines的返回值只是'return [row for csv_file]''。您可能想要查找一些Python教程资源。我喜欢[这一个]（https://pymotw.com/3/）和Python官方教程是相当不错:) – scanny

好吧，感谢您的帮助scanny，我意识到，我今天看到这一点。 – Aidan

所以我想通了这一点，它结束了是很简单的。通过将完整的文件路径放入csv中，我也使自己更容易。感谢scanny的鼓励。下一站，文档和教程页:)

#runs in python 3.5.2 32-bit 
#docx requires 32 bit operation 
import csv 
from docx import Document 
import os 
import sys 

#path to the csv file - csv file must contain rows as follows: 
#full filepath, title, subject 
#ensure there are no commas, other than the csv delimiters 

csv_path = "datasheet_metadata_uplift.csv" 

#set up the lists that will be used to hold csv values 
filename = [] 
title = [] 
keywords = [] 

#sets up the csv file, and parses the "columns" to one of three lists: filename, title, keywords 
f = open(csv_path) 
csv_file = csv.reader(f) 

#chops up csv into [] lists 
for row in csv_file: 
    filename.append(row[0]) 
    title.append(row[1]) 
    keywords.append(row[2]) 

#get the number of lines in the csv, and thus the number of files that need updating 
file = open(csv_path) 
num_lines = len(file.readlines()) 

#do the updates on every filename in the list 
i = 0 
while i < num_lines: 
    if i < num_lines: 
     #update the docx files, one for each csv file entry 
     document = Document(filename[i]) 
     core_properties = document.core_properties 
     core_properties.keywords = (keywords[i]) 
     core_properties.title = (title[i]) 
     core_properties.subject = ("YOUR_SUBJECT_HERE") 
     core_properties.comments = (" ") 
     core_properties.company = ("YOUR_COMPANY_HERE") 
     document.save(filename[i]) 
     i+=1 

print ("finished!")

来源

2016-11-21 15:43:48 Aidan

使用python-docx更新大量文件的docx元数据

回答

相关问题