2016-10-13 159 views
0

我想比较两个文件(从第一个文件中取出一行,然后在整个第二个文件中查找)以查看它们之间的差异,并将fileA.txt中缺失的行写入fileB.txt结尾。我是新来的Python因此在第一次我以为安博这样简单的程序:比较两个文件在python中的差异

import difflib 

file1 = "fileA.txt" 
file2 = "fileB.txt" 

diff = difflib.ndiff(open(file1).readlines(),open(file2).readlines()) 
print ''.join(diff), 

但结果我有两个文件组合为每个行合适的变量。我知道我可以用标签“ - ”查找行开头,然后将其写入文件fileB.txt的结尾,但是对于大文件(〜100 MB),此方法效率不高。有人可以帮助我改进计划吗?

文件结构将是这样的:

输入:

fileA.txt

Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2 
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2 
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root 
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root 
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2 
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0) 

fileB.txt

Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2 
Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root 
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2 
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2 
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0) 

输出:

FILEB _after.txt

Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2 
Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root 
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2 
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2 
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root 
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root 
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2 
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0) 
+0

所以基本上要合并两个文本文件但不保留重复? – MooingRawr

回答

1

这种尝试在bash

cat fileA.txt fileB.txt | sort -M | uniq > new_file.txt 

sort -M 各种基于初始字符串,包括空格的任何数量的,按一个月的名称缩写其次 ,被折叠到UPPER的情况下,并按照'JAN'的顺序 <'FEB'< ... <'DEC'进行比较。无效的名称比较 低到有效的名称。 “LC_TIME”区域设置确定月份 拼写。

uniq:过滤掉文件中的重复行。

|:将一个命令的输出传递给另一个命令以进行进一步处理。

这将完成的是采取两个文件,以上述方式对它们进行排序,保持独特的项目,并将它们存储在new_file.txt

注:这不是一个Python的解决方案,但你所标记的linux问题,所以我想它可能会让你感兴趣。你也可以找到更多关于使用命令的详细信息,here

+0

我不是bash的专家。我想知道如何工作。 – galaxyan

+0

我的意思是排序后的结果可能不是基于时间戳 – galaxyan

+0

@galaxyan,其实排序有很多选项:http://ss64.com/bash/sort.html – coder

1

在两个文件中读取和转换基于时间
设置两套
排序并集

找工会联接设置为字符串,新的生产线

import datetime 
import 
file1 = "fileA.txt" 
file2 = "fileB.txt" 

with open(file1 ,'rb') as f: 
    sa = set(line for line in f) 
with open(file2 ,'rb') as f: 
    sb = set(line for line in f) 
print '\n'.join(sorted(sa.union(sb), key = lambda x: datetime.datetime.strptime(' '.join(x.split()[:3]), '%b %d %H:%M:%S'))) 



Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2 
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root 
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2 
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2 
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root 
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user 
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root 
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0) 
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2