我已经下载从Hotmail CSV文件从CSV文件中删除重复的行,但它有很多的重复它的。这些重复是完整的副本,我不知道我的手机为什么创建它们。
我想摆脱重复。
方法
写python脚本来删除重复。
技术规范
Windows XP SP 3 Python 2.7 CSV file with 400 contacts
我已经下载从Hotmail CSV文件从CSV文件中删除重复的行,但它有很多的重复它的。这些重复是完整的副本,我不知道我的手机为什么创建它们。
我想摆脱重复。
方法
写python脚本来删除重复。
技术规范
Windows XP SP 3 Python 2.7 CSV file with 400 contacts
UPDATE 2016
如果你乐于使用的有用more_itertools
外部库:@ IcyFlame年代
from more_itertools import unique_everseen
with open('1.csv','r') as f, open('2.csv','w') as out_file:
out_file.writelines(unique_everseen(f))
更有效率的版本解决方案
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
编辑就地你可以使用这个
import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
if line in seen: continue # skip duplicate
seen.add(line)
print line, # standard output is now redirected to the file
您可以使用下面的脚本:
先决条件:
1.csv
是由重复的文件2.csv
是一旦执行此脚本将会丢失重复项的输出文件。代码
inFile = open('1.csv','r')
outFile = open('2.csv','w')
listLines = []
for line in inFile:
if line in listLines:
continue
else:
outFile.write(line)
listLines.append(line)
outFile.close()
inFile.close()
算法说明
在这里,我在做什么是:
@ jamylak的解决方案更有效的版本相同的文件:(少了一个指令)
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line not in seen:
seen.add(line)
out_file.write(line)
要编辑相同的文件,你可以使用这个
import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
if line not in seen:
seen.add(line)
print line, # standard output is now redirected to the file
可以实现deduplicaiton有效地利用熊猫:
import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"
df = pd.read_csv(file_name, sep="\t or ,")
# Notes:
# - the `subset=None` means that every column is used
# to determine if two rows are different; to change that specify
# the columns as an array
# - the `inplace=True` means that the data structure is changed and
# the duplicate rows are gone
df.drop_duplicates(subset=None, inplace=True)
# Write the results to a different file
df.to_csv(file_name_output)
我得到'的UnicodeDecodeError:在28位 'UTF-8' 编解码器不能解码字节0x96:无效在尝试打开我的文件时启动字节' – ykombinator
@ykombinator,您可以将“encoding”参数传递给“read_csv”函数 - 请参阅https://docs.python.org/3/library/codecs.html#standard-encodings –
感谢你在2016年 – Anekdotin
@Eddwinn不客气 – jamylak