有没有办法加快这个python程序？（short）

我是编程新手，所以如果以下程序中的逻辑没有意义，那可能就是为什么。幸运的是，下面的代码运行并完成了我需要的一切，但感觉执行需要很长时间（每10,000条记录需要6分钟）。有没有办法加快这个python程序？（short）

该程序的目的是为我的数据库中的记录分配新的ID，并允许用户指定增量值和这些ID的起点。

说实话，我并不完全确定执行时间是否不合理，因为我没有很多经验来建立它，但如果有一种方法来加速它，我是所有的耳朵。

# generates study IDs for MS Access dataset 

import pyodbc 
import random 
import time 

startTime = time.time() 

dbFile = 'C:\Backend.accdb' 
conn = pyodbc.connect(r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};' 
         + 'DBQ=' + dbFile + '; Provider=MSDASQL;') 
cursor = conn.cursor() 


# shuffle the existing IDs so the assignment of the new IDs is random 
a = [] 
sql = "SELECT ID FROM Clients" 

for row in cursor.execute(sql): 
    a.append(row.ID) 

print "\nIDs appended to list...\n" 

random.shuffle(a) 

print "\nlist shuffled\n" 

# assign new IDs according to the conditions below 
startPt = 900001 
increment = 7 
idList = {} 

for i in a: 
    idList[i] = startPt 
    startPt += increment 

# append new IDs to another table in the database 
for j, k in idList.iteritems(): 
    sql = "INSERT INTO newID values ('%s', '%s')" %(j,k) 
    cursor.execute(sql) 
    conn.commit() 

# close connection 
cursor.close() 
conn.close() 

# calculate, in seconds, the time the program took to execute  
executionTime = str(time.time() - startTime) 

print "completed. the program took %s seconds to execute." %executionTime

来源

2012-03-23 fromabove

可能http://codereview.stackexchange.com会是一个更好的地方。 – 2012-03-23 22:30:09

您应该意识到字符串中的反斜杠会引入'转义序列'，因此当您的行'dbFile ='C：\ Backend.accdb'工作时，如果反斜杠之后的第一个字符是r，t， n或其他一些字母。在单引号或双引号字符串中使用双反斜线，或使用原始字符串（'r“c：\ thing”'），或使用正斜杠（即使在Windows上也可用作路径分隔符）。 – 2012-03-23 22:45:02

请参阅http://docs.python.org/library/profile.html，但可能只是在关闭连接之前移动'conn.commit（）'将会产生巨大的差异。 – agf 2012-03-23 22:58:50

# shuffle the existing IDs so the assignment of the new IDs is random 
a = [] 
sql = "SELECT ID FROM Clients" 

for row in cursor.execute(sql): 
    a.append(row.ID)

如果你想要把一切都放在一个列表，使用cursor.fetchall()，它将为您创建列表

print "\nIDs appended to list...\n" 

random.shuffle(a) 

print "\nlist shuffled\n"

你应该可以修改你的查询来为你洗牌SELECT ID FROM Clients ORDER BY RAND()或类似的。这样你就不必自己洗牌，这可能会更快。

for i in a: 
    idList[i] = startPt 
    startPt += increment

你为什么要将数据存储在字典中以便直接操作它？

# append new IDs to another table in the database 
for j, k in idList.iteritems(): 
    sql = "INSERT INTO newID values ('%s', '%s')" %(j,k) 
    cursor.execute(sql)

你应该几乎总是使用参数，而不是字符串格式化

cursor.execute("INSERT INTO newID values(?,?)", (j, k))

这就是让你从SQL注入安全。你也可以使用executemany函数。它将允许您传递不同参数的列表，并在其中许多参数上执行相同的查询。这可能是处理数据的最快方式。

conn.commit()

你不应该在每次插入后提交。通常你会等待，直到你全部完成。

来源

2012-03-23 22:44:26

我敢打赌，绝大多数时间都花在了最后两行上。仅仅提交一次可能会大大加快速度。 – agf 2012-03-23 22:57:35

啊，我没有意识到有conn.commit（），我认为它马上执行。我想conn.commit（）是懒惰的，并优化待办事项列表？我建议executemany（）试图提交一次。 – 2012-03-23 23:05:02

@robertking，强制更改磁盘。每次插入都很昂贵。 – 2012-03-23 23:16:51

您一次将所有ID插入到数据库中。您可以使用大查询一次全部插入它们：

"INSERT INTO newID values (123, 123), (456, 456), (789, 789)" (and so on)

这意味着您需要先构建查询字符串，然后执行它。如果之后代码仍然很慢，您应该使用Python代码分析器来查看哪一部分是瓶颈。

来源

2012-03-23 22:29:49

我建议你应该打印出每段代码需要多长时间。

我认为最慢的部分将被插入到newID中，特别是如果表上有一个主键。

我建议你使用“执行所有”来插入，以便它一次完成插入操作。

事实上pyodbc看起来有这样的功能：

executemany

cursor.executemany(sql, seq_of_parameters) --> None 

Executes the same SQL statement for each set of parameters. seq_of_parameters is a sequence of sequences. 

params = [ ('A', 1), ('B', 2) ] 
executemany("insert into t(name, id) values (?, ?)", params) 
This will execute the SQL statement twice, once with ('A', 1) and once with ('B', 2).

看到http://code.google.com/p/pyodbc/wiki/Cursor

来源

2012-03-23 22:33:06

有没有办法加快这个python程序？ （short）

回答

相关问题

有没有办法加快这个python程序？（short）