2012-10-12 154 views
0

与eumiro Delete duplicate rows in textfile - except it contains a "{" or "}" 的帮助下删除文本文件重复字的组合,我可以成功地删除重复的线路在一个大文本文件。这是从60MB到3MB文本文件的一大步。与蟒蛇

但现在我想删除重复的话是这样的:

@INBOOK{Miller1992, 
    author = {Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark 
    R. Leary and Miller, Rowland S. und Mark R. Leary and Miller, Rowland 
    S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary and 
    Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark 
    Miller, Rowland S. und Mark R. Leary}, 
    year = {1992}, 
    editor = {Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun 
    A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. 
    van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. van 
    Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk 
    and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and 
    Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun 
    and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk}, 
    title = {Handbook of discourse analysis (Bd. 3/4)}, 

的结果应该是这样的:

@INBOOK{Miller1992, 
    author = {Miller, Rowland S. und Mark R. Leary}, 
    year = {1992}, 
    editor = {Teun A. van Dijk}, 
    title = {Handbook of discourse analysis (Bd. 3/4)}, 

文本文件有70000行和authornames可以在多个项目中使用。所以也就只有在大括号中的重复(多行)应删除:

author = {Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark 
    R. Leary and Miller, Rowland S. und Mark R. Leary and Miller, Rowland 
    S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary and 
    Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark 
    Miller, Rowland S. und Mark R. Leary}, 

我想修改我的Python-Skript其删除重复行的大括号删除重复的话,但我stucked:

words_seen = set() # holds words already seen 
outfile = open("literatur_clean.txt", "w") 
for line in open("literatur_dupl.txt", "r"): 
    if ('{' in line or '}' in line 
     # some code to check whether the words are duplicate 
outfile.close() 

回答

1

根据您当前的数据集,它看起来不像是重复单词的问题,而是有时候作者或编辑器会重复n次。

你可以尝试分裂的字符串“和”。然后你可以看到其余的项目是否都是一样的。 (例如放置一组或作为字典键的所有字符串)。如果集的长度等于1,您已删除所有副本。如果没有,可能“和”也是作者或编辑名字的一部分。你必须再次合并这两个。

如果不工作(例如,因为数据集不是整齐的建议),你可以通过查找子集匹配查找重复匹配:的开始后

Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary 
^          ^
1          2 

增量指针到文本字符串串。为每个位置查找字符串开头最长的子匹配。保存这些子匹配。

+0

感谢您的回答,第一个方法似乎不太适合,但我会尝试第二种方法。 – StandardNerd