长时间潜伏,但第一次在StackOverflow上发布海报。执行合并时防止重复行
我用我正在处理的一个数据分析项目打了一堵墙。
本质上,如果我有例如CSV 'A':
id | item_num
A123 | 1
A123 | 2
B456 | 1
我有例如CSV 'B':
id | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...
如果我执行merge
使用Pandas
,它最终像这个:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | Mary had a...
A123 | 1 | ...little lamb.
A123 | 2 | ...little lamb.
B456 | 1 | Its fleece...
我怎么能让它变成:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb...
B456 | 1 | Its fleece...
这是我的代码:
import pandas as pd
# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))
# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))
# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))
# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")
我真的很感激任何帮助 - 我很卡!我正在处理20,000多行。
谢谢。
编辑:我的帖子被标记为潜在的副本。这不是,因为我不一定要添加一列 - 我只是想阻止description
乘以归属于特定id
的item_num
的数量。
UPDATE,6/21:
我怎么可以做合并,如果2级的DF这个样子呢?
id | item_num | other_col
A123 | 1 | lorem ipsum
A123 | 2 | dolor sit
A123 | 3 | amet, consectetur
B456 | 1 | lorem ipsum
而且我有例如CSV 'B':
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb.
B456 | 1 | ...Its fleece...
所以我结束了:
id | item_num | other_col | description
A123 | 1 | lorem ipsum | Mary Had a...
A123 | 2 | dolor sit | ...little lamb.
B456 | 1 | lorem ipsum | ...Its fleece...
含义,即有3行,以“阿梅德,consectetur “在”other_col“中被忽略。
的[在Python大熊猫添加新的列到现有的数据帧]可能的复制(http://stackoverflow.com/questions/12555323 /添加新的列到现有的数据框在python熊猫) – TemporalWolf
它看起来像你想['concat'或'append'](http://pandas.pydata.org/pandas- docs/stable/merging.html),而不是“合并”。 – TemporalWolf