2017-08-30 52 views
2

我有一个CSV文件是这样的:查找重复行,最大数据

Date of event  Name  Date of birth 
06.01.1986   John Smit 23.08.1996 
18.12.1996   Barbara D 01.08.1965 
12.12.2001   Barbara D 01.08.1965 
17.10.1994   John Snow 20.07.1965 

我必须找到“名称”和“出生日期”唯一行(可能与其它一些列),但与MAX日期。

所以我得csv文件是这样的:

Date of event  Name  Date of birth 
06.01.1986   John Smit 23.08.1996 
12.12.2001   Barbara D 01.08.1965 
17.10.1994   John Snow 20.07.1965 

如何做到这一点?我不有任何想法..

+0

'找到独特rows'或'找到一个重复的row'? –

+0

找到唯一的行,我也需要将这个解决方案与源列结合...并写入csv –

+0

与源结合意味着什么?唯一的来源是源,如果与非唯一结合使用,结果是污染。 –

回答

0

格式化

由于列名有空格,最好用逗号分隔。

算法

可以使用熊猫库做到这一点:

import tempfile 
import pandas 

# create a temporary csv file with your data (comma delimited) 
temp_file_name = None 
with tempfile.NamedTemporaryFile('w', delete=False) as f: 
    f.write("""Date of event,Name,Date of birth 
06.01.1986,John Smit,23.08.1996 
18.12.1996,Barbara D,01.08.1965 
12.12.2001,Barbara D,01.08.1965 
17.10.1994,John Snow,20.07.1965""") 
    temp_file_name = f.name 

# read the csv data using the pandas library, specify columns with dates 
data_frame = pandas.read_csv(
    temp_file_name, 
    parse_dates=[0,2], 
    dayfirst=True, 
    delimiter=',' 
) 

# use groupby and max to do the magic 
unique_rows = data_frame.groupby(['Name','Date of birth']).max() 

# write the results 
result_csv_file_name = None 
with tempfile.NamedTemporaryFile('w', delete=False) as f: 
    result_csv_file_name = f.name 
    unique_rows.to_csv(f) 

# read and show the results 
with open(result_csv_file_name, 'r') as f: 
    print(f.read()) 

这导致:

Name,Date of birth,Date of event 
Barbara D,1965-08-01,2001-12-12 
John Smit,1996-08-23,1986-01-06 
John Snow,1965-07-20,1994-10-17 
+0

但是如果我想写这个结果,我该怎么办?我需要将csv按最大日期与源csv的所有列分组。 –

+0

@AlexandrLebedev我更新了我的答案,也写出了csv。你应该真的只是用谷歌来查找一些文档。 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html –

0
import pandas as pd 

# read the csv in with pandas module 

df = pd.read_csv('pathToCsv.csv', header=0, parse_dates=[0, 2]) 

# set the column names as more programming friendly i.e. no whitespace 

df.columns = ['dateOfEvent','name','DOB'] # and probably some other columns .. 

# keep row only with max (Date of event) per group (name, Date of Birth) 

yourwish = =df.groupby(['Name','DOB'])['dateOfEvent'].max() 
+0

非常感谢,它帮助我找到这一行,但我也需要结果与源csv-列 –

+0

什么这意味着 – yukclam9