用于CSV文件比较的Python脚本

我对Python很新颖，目前正致力于创建数据比较脚本。我有以下格式用于CSV文件比较的Python脚本

CSV1（源系统）

EMP_ID, EMP_NAME, EMP_LOCATION 
1,Sam J, Houston 
2,Man T, Houston 
3,Sub D, Chicago 
4,Saggie D, New York

CSV2（迁移/目标）

EMP_ID, EMP_NAME, EMP_LOCATION 
1,Sam J, Houston£[email protected]£ 
2,Man T, Houston 
3,Sub D, Chicago 
4,Saggie D, New York^^^

两个文件我想比较结果是这样的此

EMP_ID_S, EMP_ID_T, EMP_ID_STATUS, EMP_NAME_S, EMP_NAME_T, EMP_NAME_STATUS, EMP_LOCATION_S, EMP_LOCATION_T, EMP_LOCATION_STATUS 

1,1,Matched,Sam J, Sam J, Matched, Houston, Houston£[email protected]£, Not Matched 
4,4,Matched,Saggie D, Saggie D, New York, New York^^^, Not Matched

我已经找到文件比较脚本，但找不到这种类型的东西。

来源

2017-06-27 Subodh Deshpande

你有什么试过？阅读关于python中的'csv'模块。 – void

向我们展示您到目前为止所尝试的内容，并请阅读以下内容：[我如何提出一个好问题？]（https://stackoverflow.com/help/how-to-ask） – Clonkex

-1

它听起来有点不可思议，但你可以用熊猫

df1 = pd.read_csv('../data/example1.csv') 
df2 = pd.read_csv('../data/example2.csv')

再加入对ID列2个不同的数据帧。然后，您可以创建新列为EMP_LOCATION_S与

df1["EMP_LOCATION_S"] = df1["EMP_NAME_S"] == df2["EMP_NAME_T"]

来源

2017-06-27 04:42:22 Cenk

这里我提供了完整的解决方案，我将解释这些概念。

import pandas as pd 


index_list = ['EMP_ID_S', 'EMP_ID_T', 'EMP_ID_STATUS', 
       'EMP_NAME_S', 'EMP_NAME_T', 'EMP_NAME_STATUS', 
       'EMP_LOCATION_S', 'EMP_LOCATION_T', 
       'EMP_LOCATION_STATUS'] 

common_list = ['EMP_ID','EMP_NAME','EMP_LOCATION'] 
update1_list = list(zip(['EMP_ID_S','EMP_NAME_S','EMP_LOCATION_S'],common_list)) 

update2_list = list(zip(['EMP_ID_T','EMP_NAME_T','EMP_LOCATION_T'],common_list)) 




df1 = pd.read_csv("file1.csv") 
df2 = pd.read_csv("file2.csv") 
df3 = pd.DataFrame(list(range(4)),columns=['EMP_ID']) 
df3=df3.reindex(columns=index_list) 


for item in update1_list: 
    df3[item[0]] = df1[item[1]] 
for item in update2_list: 
    df3[item[0]] = df2[item[1]] 

df3[['EMP_ID_STATUS','EMP_NAME_STATUS','EMP_LOCATION_STATUS']]='Not Matched' 



df1.loc[(df1['EMP_ID'] == df2['EMP_ID']),'EMP_ID_STATUS'] = 'Matched' 
df1.loc[(df1['EMP_NAME'] == df2['EMP_NAME']),'EMP_NAME_STATUS'] = 'Matched' 
df1.loc[(df1['EMP_LOCATION'] == df2['EMP_LOCATION']),'EMP_LOCATION_STATUS'] = 'Matched' 

df3.update(df1) 
print(df3)

OUTPUT：

EMP_ID_S EMP_ID_T EMP_ID_STATUS EMP_NAME_S EMP_NAME_T EMP_NAME_STATUS \ 
0   1   1  Matched  Sam J  Sam J   Matched 
1   2   2  Matched  Man T  Man T   Matched 
2   3   3  Matched  Sub D  Sub D   Matched 
3   4   4  Matched Saggie D Saggie D   Matched 

    EMP_LOCATION_S EMP_LOCATION_T EMP_LOCATION_STATUS 
0  Houston Houston£[email protected]£   Not Matched 
1  Houston  Houston    Matched 
2  Chicago  Chicago    Matched 
3  New York New York^^^   Not Matched

所以首先我创建df3=df3.reindex(columns=index_list)空数据帧的列从index_list

然后，我刚刚从df1和df2

更新列 df3

for item in update1_list: 
     df3[item[0]] = df1[item[1]] 
    for item in update2_list: 
     df3[item[0]] = df2[item[1]

注：[('EMP_ID_S', 'EMP_ID'), ('EMP_NAME_S', 'EMP_NAME'), ('EMP_LOCATION_S', 'EMP_LOCATION')]list(zip([['EMP_ID_T','EMP_NAME_T','EMP_LOCATION_T'],update1_list))是后的输出为每个项目[0]

即EMP_ID_T设置DF3 [项目[0]] - > DF3 [ 'EMP_ID_T']到DF1 [项目[1] ] - > df1 ['EMP_ID']。所以每个值都会被更新。

这是你喜欢什么，

df1.loc[(df1['EMP_ID'] == df2['EMP_ID']),'EMP_ID_STATUS'] = 'Matched'

df1.loc套EMP_ID_STATUS到'Matched'只有df1['EMP_ID'] == df2['EMP_ID']满足此条件列。所以对其他人也一样，你有你想要的。

来源

2017-06-27 07:38:07 void

此外，如果这有助于选择回答。你可以在downvote下找到一个* tick *。点击它。 – void

用于CSV文件比较的Python脚本

回答

相关问题