您可以哈希“键”列的值,并维护一组哈希码你已经遇到过:
import hashlib
hash_set = set() # this will contain all the hash codes of rows seen
def is_duplicate(row):
m = hashlib.md5()
for c in ["column1", "column2", "column3"]:
m.update(row[c])
hash_code = m.digest()
if hash_code in hash_set:
return 1
hash_set.add(hash_code)
return 0
for df_path in [df1_path, df2_path, df3_path]: # iterate dataframes 1 by 1
df = pd.read_csv(df_path) # load the dataframe
df["duplicate"] = df.apply(is_duplicate, axis=1)
unique_df = df[df["duplicate"]==0] # a "globaly" unique dataframe
unique_df.pop("duplicate") # you don't need this column anymore
# YOUR CODE...
如何散列行的值,寻找重复的哈希值? – AndreyF