2017-01-17 34 views
0

我在目录698中有一堆文件是确切的。每个文件都包含日期和唯一ID以及名称。像这样:我可以按日期和ID对文件进行分组,并对其进行区分吗?

import pandas as pd 
from pandas import Series, DataFrame 
import numpy as np 
import csv 
import os 
import re 

20151231_7801_Test_Maps.txt 
20151231_7801_Test_Items.txt 
20151231_7802_Test_Maps.txt 
20151231_7802_Test_Items.txt 

我期待通过日期和标识它们分组,打开每个文件(地图,以及项目),并做有关文件中的某些ID的差异分析。我将如何做到这一点?

到目前为止,我有这个作为我的代码,但我不知道如何遍历并打开每个组中的每个文件:

groups = defaultdict(list) 
for filename in os.listdir('F:\Desktop'): 
    date = filename[:8] 
    identifier = filename[10:14] 
    basename, extension = os.path.splitext(filename) 
    groups[date, identifier].append(filename) 

我的输出打印一些群体的正确,但不是全部,对例如:

('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt] 

某些组只打印一个文件,即使该日期和标识符有两个文件。

这不是我最关心的,但一旦他们在小组打散我想组中的每个文件分配给一个数据帧像这样:

for key in groups: 
    maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 
    items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 

    #checks IDs between the two files and looks for differences 
    set(maps.ID).difference(items.ID) 

可有人请与分组中的文件帮助按日期和ID,并重复按组打开文件?谢谢!

回答

0

从四条的答案以,我已经找到了一个不错的办法做到这一点。

groups = defaultdict(list) 
output = [] 

for filename in os.listdir(pathloc): 
date = filename[:8] 
identifier = filename[14:18] 
basename, extension = os.path.splitext(filename) 
groups[date, identifier].append(filename) 


for key, fnames in groups.iteritems(): 
filedicts = {} 
print list(fnames) 
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 



diffs = set(maps.ID).symmetric_difference(items.ID) 

filedicts['FileIDKey'] = list(key) 
filedicts['Missing_IDs'] = list(diffs)       
filedicts['FileNames'] = fnames 

output.append(filedicts) 

这让我然后去和这个主字典列表添加到数据帧:

new = pd.DataFrame(output) 
1

了一些帮助,从https://stackoverflow.com/a/20228113/6626530而且做得

import pandas as pd 


from collections import defaultdict 

difference = pd.DataFrame(columns=('Filename1', 'Filename2', 'DiffID1','DiffID2')) 

pathloc ='C:\Users\shmathew\Desktop\Sample\\abc\\' 
groups = defaultdict(list) 
for filename in os.listdir(pathloc): 
    date = filename[:8] 
    identifier = filename[10:14] 
    basename, extension = os.path.splitext(filename) 
    groups[date, identifier].append(filename) 



for key,filenames in groups.iteritems(): 
    #print " processing following files" 
    #print filenames 
    maps = pd.read_csv(pathloc+filenames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 
    Items = pd.read_csv(pathloc+filenames[0] , sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python') 
    df = pd.concat([maps, Items]) 
    df = df.reset_index(drop=True) 
    df_gpby = df.groupby(list(df.columns)) 
    idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] 




    #print "\n\n Difference \n\n" 
    ids= (df.reindex(idx)) 
    row =list(filenames); 
    row.extend(list(ids['ID'])) 

    print row 
    # difference.append(row) 
    difference.append(row) 
print difference 

输出

['20151231_7802_Test_Items.txt', '20151231_7802_Test_Maps.txt', '00432931830TRNY1 ', '00432xx0TRNY1 '] 
['20151231_7801_Test_Items.txt', '20151231_7801_Test_Maps.txt'] 
Empty DataFrame 
Columns: [Filename1, Filename2, DiffID1, DiffID2] 
Index: [] 
+0

谢谢!这很好,我想知道是否有一种方法可以将它放入一个名为Difference的数据框列中,每个记录旁边都有文件名/ ID? (将报告目的过滤起来更容易) – staten12

+0

更新了代码,但无法将它们放入Dataframe – Shijo

相关问题