2013-04-01 45 views
13

目标使用python脚本

我已经下载从Hotmail CSV文件从CSV文件中删除重复的行,但它有很多的重复它的。这些重复是完整的副本,我不知道我的手机为什么创建它们。

我想摆脱重复。

方法

写python脚本来删除重复。

技术规范

 

Windows XP SP 3 
Python 2.7 
CSV file with 400 contacts 

回答

35

UPDATE 2016

如果你乐于使用的有用more_itertools外部库:@ IcyFlame年代

from more_itertools import unique_everseen 
with open('1.csv','r') as f, open('2.csv','w') as out_file: 
    out_file.writelines(unique_everseen(f)) 

更有效率的版本解决方案

with open('1.csv','r') as in_file, open('2.csv','w') as out_file: 
    seen = set() # set for fast O(1) amortized lookup 
    for line in in_file: 
     if line in seen: continue # skip duplicate 

     seen.add(line) 
     out_file.write(line) 

编辑就地你可以使用这个

import fileinput 
seen = set() # set for fast O(1) amortized lookup 
for line in fileinput.FileInput('1.csv', inplace=1): 
    if line in seen: continue # skip duplicate 

    seen.add(line) 
    print line, # standard output is now redirected to the file 
+1

感谢你在2016年 – Anekdotin

+0

@Eddwinn不客气 – jamylak

5

您可以使用下面的脚本:

先决条件:

  1. 1.csv是由重复的文件
  2. 2.csv是一旦执行此脚本将会丢失重复项的输出文件。

代码

 


inFile = open('1.csv','r') 

outFile = open('2.csv','w') 

listLines = [] 

for line in inFile: 

    if line in listLines: 
     continue 

    else: 
     outFile.write(line) 
     listLines.append(line) 

outFile.close() 

inFile.close() 

算法说明

在这里,我在做什么是:

  1. 在读取模式下打开文件。这是有重复的文件。
  2. 然后在循环中运行,直到文件结束,我们检查是否已经遇到 行。
  3. 如果遇到了,我们不会将它写入输出文件。
  4. 如果没有,我们将其写入到输出文件,并将其添加到已经被遇到的记录列表
1

@ jamylak的解决方案更有效的版本相同的文件:(少了一个指令)

with open('1.csv','r') as in_file, open('2.csv','w') as out_file: 
    seen = set() # set for fast O(1) amortized lookup 
    for line in in_file: 
     if line not in seen: 
      seen.add(line) 
      out_file.write(line) 

要编辑相同的文件,你可以使用这个

import fileinput 
seen = set() # set for fast O(1) amortized lookup 
for line in fileinput.FileInput('1.csv', inplace=1): 
    if line not in seen: 
     seen.add(line) 
     print line, # standard output is now redirected to the file 
5

可以实现deduplicaiton有效地利用熊猫:

import pandas as pd 
file_name = "my_file_with_dupes.csv" 
file_name_output = "my_file_without_dupes.csv" 

df = pd.read_csv(file_name, sep="\t or ,") 

# Notes: 
# - the `subset=None` means that every column is used 
# to determine if two rows are different; to change that specify 
# the columns as an array 
# - the `inplace=True` means that the data structure is changed and 
# the duplicate rows are gone 
df.drop_duplicates(subset=None, inplace=True) 

# Write the results to a different file 
df.to_csv(file_name_output) 
+0

我得到'的UnicodeDecodeError:在28位 'UTF-8' 编解码器不能解码字节0x96:无效在尝试打开我的文件时启动字节' – ykombinator

+1

@ykombinator,您可以将“encoding”参数传递给“read_csv”函数 - 请参阅https://docs.python.org/3/library/codecs.html#standard-encodings –