我有一个CSV文件是这样的：查找重复行，最大数据

Date of event  Name  Date of birth 
06.01.1986   John Smit 23.08.1996 
18.12.1996   Barbara D 01.08.1965 
12.12.2001   Barbara D 01.08.1965 
17.10.1994   John Snow 20.07.1965

我必须找到“名称”和“出生日期”唯一行（可能与其它一些列），但与MAX日期。

所以我得csv文件是这样的：

Date of event  Name  Date of birth 
06.01.1986   John Smit 23.08.1996 
12.12.2001   Barbara D 01.08.1965 
17.10.1994   John Snow 20.07.1965

如何做到这一点？我不有任何想法..

来源

2017-08-30 Alexandr Lebedev

'找到独特rows'或'找到一个重复的row'？ –

找到唯一的行，我也需要将这个解决方案与源列结合...并写入csv –

与源结合意味着什么？唯一的来源是源，如果与非唯一结合使用，结果是污染。 –

格式化

由于列名有空格，最好用逗号分隔。

算法

可以使用熊猫库做到这一点：

import tempfile 
import pandas 

# create a temporary csv file with your data (comma delimited) 
temp_file_name = None 
with tempfile.NamedTemporaryFile('w', delete=False) as f: 
    f.write("""Date of event,Name,Date of birth 
06.01.1986,John Smit,23.08.1996 
18.12.1996,Barbara D,01.08.1965 
12.12.2001,Barbara D,01.08.1965 
17.10.1994,John Snow,20.07.1965""") 
    temp_file_name = f.name 

# read the csv data using the pandas library, specify columns with dates 
data_frame = pandas.read_csv(
    temp_file_name, 
    parse_dates=[0,2], 
    dayfirst=True, 
    delimiter=',' 
) 

# use groupby and max to do the magic 
unique_rows = data_frame.groupby(['Name','Date of birth']).max() 

# write the results 
result_csv_file_name = None 
with tempfile.NamedTemporaryFile('w', delete=False) as f: 
    result_csv_file_name = f.name 
    unique_rows.to_csv(f) 

# read and show the results 
with open(result_csv_file_name, 'r') as f: 
    print(f.read())

这导致：

Name,Date of birth,Date of event 
Barbara D,1965-08-01,2001-12-12 
John Smit,1996-08-23,1986-01-06 
John Snow,1965-07-20,1994-10-17

来源

2017-08-30 04:31:07

但是如果我想写这个结果，我该怎么办？我需要将csv按最大日期与源csv的所有列分组。 –

@AlexandrLebedev我更新了我的答案，也写出了csv。你应该真的只是用谷歌来查找一些文档。 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html –

import pandas as pd 

# read the csv in with pandas module 

df = pd.read_csv('pathToCsv.csv', header=0, parse_dates=[0, 2]) 

# set the column names as more programming friendly i.e. no whitespace 

df.columns = ['dateOfEvent','name','DOB'] # and probably some other columns .. 

# keep row only with max (Date of event) per group (name, Date of Birth) 

yourwish = =df.groupby(['Name','DOB'])['dateOfEvent'].max()

来源

2017-08-30 03:25:56 yukclam9

非常感谢，它帮助我找到这一行，但我也需要结果与源csv-列 –

什么这意味着 – yukclam9

查找重复行，最大数据

回答

格式化

算法

相关问题