Perl与Python日志处理性能

我正在研究基于Web的日志管理系统，该系统将构建在Grails框架上，我将使用Python或Perl之类的文本处理语言之一。我创建了Python和Perl脚本，用于加载日志文件并解析每行以将它们保存到MySQL数据库（该文件包含大约40,000行，大约7MB）。花了1分2秒使用Perl，只有17秒使用Python。我认为Perl会比Python更快，因为Perl是原始的文本处理语言（我的猜测也来自不同的博客，在那里我正在阅读有关Perl文本处理性能的文章）。另外我还没有预料到Perl和Python之间会有47秒的差异。为什么Perl比Python更花时间来处理我的日志文件？是因为我使用了一些错误的数据库模块或我的代码，Perl的正则表达式可以改进？Perl与Python日志处理性能

注：我是一名Java和Groovy开发人员，我没有使用Perl的经验（我正在使用Strawberry Perl v5.16）。我也用Java（1分5秒）和Groovy（1分7秒）做了这个测试，但是超过1分钟来处理日志文件太多了，所以这两种语言都出来了，现在我想在Perl和蟒蛇。

Perl代码

use DBI; 
use DBD::mysql; 
# make connection to database 
$connection = DBI->connect("dbi:mysql:logs:localhost:3306","root","") || die  "Cannot connect: $DBI::errstr"; 

# set the value of your SQL query 
$query = "insert into logs (line_number, dated, time_stamp, thread, level, logger, user, message) 
     values (?, ?, ?, ?, ?, ?, ?, ?) "; 

# prepare your statement for connecting to the database 
$statement = $connection->prepare($query); 

$runningTime = time; 

# open text file 
open (LOG,'catalina2.txt') || die "Cannot read logfile!\n";; 

while (<LOG>) { 
    my ($date, $time, $thread, $level, $logger, $user, $message) = /^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2},\d{3}) (\[.*\]) (.*) (\S*) (\(.*\)) - (.*)$/; 

    $statement->execute(1, $date, $time, $thread, $level, $logger, $user, $message); 
} 

# close the open text file 
close(LOG); 

# close database connection 
$connection->disconnect; 

$runningTime = time - $runningTime; 
printf("\n\nTotal running time: %02d:%02d:%02d\n\n", int($runningTime/3600), int(($runningTime % 3600)/60), int($runningTime % 60)); 

# exit the script 
exit;

Python代码

import re 
import mysql.connector 
import time 

file = open("D:\catalina2.txt","r") 
rexp = re.compile('^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2},\d{3}) (\[.*\]) (.*) (\S*) (\(.*\)) - (.*)$') 
conn = mysql.connector.connect(user='root',host='localhost',database='logs') 
cursor = conn.cursor() 

tic = time.clock() 

increment = 1 
for text in file.readlines(): 
    match = rexp.match(text) 
    increment += 1 
cursor.execute('insert into logs (line_number,dated, time_stamp, thread,level,logger,user,message) values (%s,%s,%s,%s,%s,%s,%s,%s)', (increment, match.group(1), match.group(2),match.group(3),match.group(4),match.group(5),match.group(6),match.group(7))) 

conn.commit() 
cursor.close() 
conn.close() 

toc = time.clock() 
print "Total time: %s" % (toc - tic)

来源

2012-11-11 Martin M.

Perl是在文本处理速度更快，并不意味着它是更快的数据库查询。 – texasbruce

即便如此，@Martin M也在Python中使用基于DFA的正则表达式编译，但是在Perl中没有利用相同的（'re :: engine :: re2'）。 –

一般而言，这些事情很难进行比较，特别是在单个数据实例和您正在查看的时间范围内。 – Bitwise

这是不公平的比较：

你只用Python调用cursor.execute一次：

for text in file.readlines(): 
    match = rexp.match(text) 
    increment += 1 
cursor.execute('insert into logs (line_number,dated, time_stamp, thread,level,logger,user,message) values (%s,%s,%s,%s,%s,%s,%s,%s)', (increment, match.group(1), match.group(2),match.group(3),match.group(4),match.group(5),match.group(6),match.group(7)))

但你打电话$statement->execute多次在Perl：

while (<LOG>) { 
    my ($date, $time, $thread, $level, $logger, $user, $message) = /^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2},\d{3}) (\[.*\]) (.*) (\S*) (\(.*\)) - (.*)$/; 

    $statement->execute(1, $date, $time, $thread, $level, $logger, $user, $message); 
}

顺便说一句，对于Python版本，呼吁cursor.execute一次的每一行都将是缓慢的。你可以使其更快通过使用cursor.executemany：

sql = 'insert into logs (line_number,dated, time_stamp, thread,level,logger,user,message) values (%s,%s,%s,%s,%s,%s,%s,%s)' 
args = [] 
for text in file: 
    match = rexp.match(text) 
    increment += 1 
    args.append([increment] + list(match.groups())) 

cursor.executemany(sql, args)

如果有太多的线路中的日志文件，你可能需要打破这种成块：

args = [] 
for text in file: 
    match = rexp.match(text) 
    increment += 1 
    args.append([increment] + list(match.groups())) 
    if increment % 1000 == 0: 
     cursor.executemany(sql, args) 
     args = [] 
if args: 
    cursor.executemany(sql, args)

（另外，不要使用file.readlines()，因为这将创建一个列表（可能是巨大的）。file是一次吐出来的一条线一个迭代器，所以for text in file就足够了。）

来源

2012-11-11 01:25:37 unutbu

你会认为他会注意到插入行数的差异。 :-)他的帖子中有一个错字？ –

我以为它的OP的错字... – texasbruce

但它实际上是有道理的。 Python不能比Java和Perl快得多' – texasbruce

Perl与Python日志处理性能

回答

相关问题