使用python脚本

我已经下载从Hotmail CSV文件从CSV文件中删除重复的行，但它有很多的重复它的。这些重复是完整的副本，我不知道我的手机为什么创建它们。

我想摆脱重复。

方法

写python脚本来删除重复。

技术规范

 

Windows XP SP 3 
Python 2.7 
CSV file with 400 contacts

来源

2013-04-01 IcyFlame

UPDATE 2016

如果你乐于使用的有用more_itertools外部库：@ IcyFlame年代

from more_itertools import unique_everseen 
with open('1.csv','r') as f, open('2.csv','w') as out_file: 
    out_file.writelines(unique_everseen(f))

更有效率的版本解决方案

with open('1.csv','r') as in_file, open('2.csv','w') as out_file: 
    seen = set() # set for fast O(1) amortized lookup 
    for line in in_file: 
     if line in seen: continue # skip duplicate 

     seen.add(line) 
     out_file.write(line)

编辑就地你可以使用这个

import fileinput 
seen = set() # set for fast O(1) amortized lookup 
for line in fileinput.FileInput('1.csv', inplace=1): 
    if line in seen: continue # skip duplicate 

    seen.add(line) 
    print line, # standard output is now redirected to the file

来源

2013-04-01 10:20:59 jamylak

感谢你在2016年 – Anekdotin

@Eddwinn不客气 – jamylak

您可以使用下面的脚本：

先决条件：

1.csv是由重复的文件
2.csv是一旦执行此脚本将会丢失重复项的输出文件。

代码

 


inFile = open('1.csv','r') 

outFile = open('2.csv','w') 

listLines = [] 

for line in inFile: 

    if line in listLines: 
     continue 

    else: 
     outFile.write(line) 
     listLines.append(line) 

outFile.close() 

inFile.close()

算法说明

在这里，我在做什么是：

在读取模式下打开文件。这是有重复的文件。
然后在循环中运行，直到文件结束，我们检查是否已经遇到行。
如果遇到了，我们不会将它写入输出文件。
如果没有，我们将其写入到输出文件，并将其添加到已经被遇到的记录列表

来源

2013-04-01 10:16:20 IcyFlame

@ jamylak的解决方案更有效的版本相同的文件：（少了一个指令）

with open('1.csv','r') as in_file, open('2.csv','w') as out_file: 
    seen = set() # set for fast O(1) amortized lookup 
    for line in in_file: 
     if line not in seen: 
      seen.add(line) 
      out_file.write(line)

要编辑相同的文件，你可以使用这个

import fileinput 
seen = set() # set for fast O(1) amortized lookup 
for line in fileinput.FileInput('1.csv', inplace=1): 
    if line not in seen: 
     seen.add(line) 
     print line, # standard output is now redirected to the file

来源

2016-08-04 18:17:30

可以实现deduplicaiton有效地利用熊猫：

import pandas as pd 
file_name = "my_file_with_dupes.csv" 
file_name_output = "my_file_without_dupes.csv" 

df = pd.read_csv(file_name, sep="\t or ,") 

# Notes: 
# - the `subset=None` means that every column is used 
# to determine if two rows are different; to change that specify 
# the columns as an array 
# - the `inplace=True` means that the data structure is changed and 
# the duplicate rows are gone 
df.drop_duplicates(subset=None, inplace=True) 

# Write the results to a different file 
df.to_csv(file_name_output)

来源

2017-02-02 19:27:14

我得到'的UnicodeDecodeError：在28位 'UTF-8' 编解码器不能解码字节0x96：无效在尝试打开我的文件时启动字节' – ykombinator

@ykombinator，您可以将“encoding”参数传递给“read_csv”函数 - 请参阅https://docs.python.org/3/library/codecs.html#standard-encodings –

使用python脚本

回答

相关问题