2017-06-04 19 views
1

我是Python的新手,一直在用我创建的(150行)学生ID号,等级,年龄,class_code,area_code等等。我想要处理的数据不仅仅是按某一列(按年级,年龄等)进行过滤,而且还会创建一个与该行(学生ID)不同列的列表。我已经设法找到如何隔离需要查找特定值的列,但无法弄清楚如何创建我需要返回的值的列表。Python - 从.dat文件中过滤列并从其他列返回给定值

因此,这里是5行中的数据的样本:

1/A/15/13/43214 
2/I/15/21/58322 
3/C/17/89/68470 
4/I/18/6/57362 
5/I/14/4/00000 
6/A/16/23/34567 

我需要的第一列(学生证)名单的基础上,筛选第二列(级)......(并最终第三列,第四列等,但如果我看到它只是第二个看起来如何,我想我可以找出其他)。另请注意:我没有在.dat文件中使用标题。

我想出了如何隔离/查看第二列。

import numpy 

data = numpy.genfromtxt('/testdata.dat', delimiter='/', dtype='unicode') 

grades = data[:,1] 
print (grades) 

打印:

['A' 'I' 'C' 'I' 'I' 'A'] 

但现在,我怎么能拉就在第一列的对应于A的,C的,我是为单独的列表?

所以我想看到一个列表,也与第1列,为A的,C的整数之间的逗号,和我的

list from A = [1, 6] 
list from C = [3] 
list from I = [2, 4, 5] 

同样,如果我可以看到它是如何与实现只是第二列,只有一个值(比如说A),我想我可以想出如何为B's,C's,D's等以及其他列做些什么。我只需要看一个例子来说明如何应用这个语法,然后就像其他的一样。

此外,我一直在使用numpy,但也读了关于熊猫,csv和我认为这些库也可能是可能的。但就像我说的,一直在使用numpy来处理.dat文件。我不知道其他库是否会更容易使用?

回答

1

大熊猫的解决方案:

import pandas as pd 

df = pd.read_csv('data.txt', header=None, sep='/') 
dfs = {k:v for k,v in df.groupby(1)} 

因此,我们有DataFrames的字典:

In [59]: dfs.keys() 
Out[59]: dict_keys(['I', 'C', 'A']) 

In [60]: dfs['I'] 
Out[60]: 
    0 1 2 3  4 
1 2 I 15 21 58322 
3 4 I 18 6 57362 
4 5 I 14 4  0 

In [61]: dfs['C'] 
Out[61]: 
    0 1 2 3  4 
2 3 C 17 89 68470 

In [62]: dfs['A'] 
Out[62]: 
    0 1 2 3  4 
0 1 A 15 13 43214 
5 6 A 16 23 34567 

如果你想拥有第一列的细分电子邮件列表:

In [67]: dfs['I'].iloc[:, 0].tolist() 
Out[67]: [2, 4, 5] 

In [68]: dfs['C'].iloc[:, 0].tolist() 
Out[68]: [3] 

In [69]: dfs['A'].iloc[:, 0].tolist() 
Out[69]: [1, 6] 
1

您可以浏览列表并制作一个布尔值来选择匹配特定等级的数组。这可能需要一些改进。

import numpy as np 

grades = np.genfromtxt('data.txt', delimiter='/', skip_header=0, dtype='unicode') 


res = {} 
for grade in set(grades[:, 1].tolist()): 
    res[grade] = grades[grades[:, 1]==grade][:,0].tolist() 

print res 
+0

所以我一直在玩到目前为止发布的不同解决方案。我喜欢你的解决方案。它将res显示为一组列表。我试图查找,而且我仍在搜索,但有没有办法将列表与列表分开?所以我可以基本上是水库的'A'级别列表,以及水库等的'C'级别?我所发现的只是将列表添加到集合中,或者从列表中删除列表,或者列表的子集和列表的子集。但我似乎无法找到任何有关多个列表的集合。 – chitown88

1

实际上你不需要任何广告用于这样一个简单任务的模块。 Pure-Python解决方案将逐行读取文件并使用str.split()对它们进行“解析”,它们将为您提供您的列表,然后您可以对任何参数进行非常多的过滤。喜欢的东西:

students = {} # store for our students by grade 
with open("testdata.dat", "r") as f: # open the file 
    for line in f: # read the file line by line 
     row = line.strip().split("/") # split the line into individual columns 
     # you can now directly filter your row, or you can store the row in a list for later 
     # let's split them by grade: 
     grade = row[1] # second column of our row is the grade 
     # create/append the sublist in our `students` dict keyed by the grade 
     students[grade] = students.get(grade, []) + [row] 
# now your students dict contains all students split by grade, e.g.: 
a_students = students["A"] 
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']] 

# if you want only to collect the A-grade student IDs, you can get a list of them as: 
student_ids = [entry[0] for entry in students["A"]] 
# ['1', '6'] 

但是,让我们回去了几步 - 如果你想你应该只存储您的列表,然后更广义的解决方案创建一个函数通过传递的参数进行过滤,所以:

# define a filter function 
# filters should contain a list of filters whereas a filter would be defined as: 
# [position, [values]] 
# and you can define as many as you want 
def filter_sublists(source, filters=None): 
    result = [] # store for our result 
    filters = filters or [] # in case no filter is returned 
    for element in source: # go through every element of our source data 
     try: 
      if all(element[f[0]] in f[1] for f in filters): # check if all our filters match 
       result.append(element) # add the element 
     except IndexError: # invalid filter position or data position, ignore 
      pass 
    return result # return the result 

# now we can use it to filter our data, first lets load our data: 

with open("testdata.dat", "r") as f: # open the file 
    students = [line.strip().split("/") for line in f] # store all our students as a list 

# now we have all the data in the `students` list and we can filter it by any element 
a_students = filter_sublists(students, [[1, ["A"]]]) 
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']] 

# or again, if you just need the IDs: 
a_student_ids = [entry[0] for entry in filter_sublists(students, [[1, ["A"]]])] 
# ['1', '6'] 

# but you can filter by any parameter, for example: 
age_15_students = filter_sublists(students, [[2, ["15"]]]) 
# [['1', 'A', '15', '13', '43214'], ['2', 'I', '15', '21', '58322']] 

# or you can get all I-grade students aged 14 or 15: 
i_students = filter_sublists(students, [[1, ["I"]], [2, ["14", "15"]]]) 
# [['2', 'I', '15', '21', '58322'], ['5', 'I', '14', '4', '00000']] 
相关问题