2016-08-13 66 views
1

假设我有一个数据帧象下面这样:大熊猫从数据帧中提取列表

 FDT_DATE FFLT_LATITUDE FFLT_LONGITUDE FINT_STAT FSTR_ID 
51307 1417390467000 31.2899  121.4845 0 112609 
51308 1417390428000 31.2910  121.4859 0 112609 
51309 1417390608000 31.2944  121.4857 1 112609 
51310 1417390548000 31.2940  121.4850 1 112609 
51313 1417390668000 31.2954  121.4886 1 112609 
51314 1417390717000 31.2965  121.4937 1 112609 
53593 1417390758000 31.2946  121.4940 0 112609 
63586 1417390798000 31.2932  121.4960 1 112609 
63587 1417390818000 31.2940  121.4966 1 112609 
63588 1417390827000 31.2946  121.4974 1 112609 
63589 1417390907000 31.2952  121.4986 0 112609 

我想在一个折线列表提取位置记录,意思是提取的具有相同FSTR_ID并与记录位置FINT_STAT等于1:

FSTR_ID FDT_DATE POLYLINE 
0 112609 1417390608000 [[31.2944,121.4857],[31.2940,121.4850],[31.2954,121.4886],[31.2965,121.4937]] 
1 112609 1417390798000 [[31.2932,121.4960],[31.2940,121.4966],[31.2946, 121.4974]] 

我该怎么做?

原单数据集可以通过该代码来生成:

import pandas as pd 
df = pd.DataFrame({"FDT_DATE":{"0":1417390467000,"1":1417390428000,"2":1417390608000,"3":1417390548000,"4":1417390668000,"5":1417390717000,"6":1417390758000,"7":1417390798000,"8":1417390818000,"9":1417390827000,"10":1417390907000},"FFLT_LATITUDE":{"0":31.2899,"1":31.291,"2":31.2944,"3":31.294,"4":31.2954,"5":31.2965,"6":31.2946,"7":31.2932,"8":31.294,"9":31.2946,"10":31.2952},"FFLT_LONGITUDE":{"0":121.4845,"1":121.4859,"2":121.4857,"3":121.485,"4":121.4886,"5":121.4937,"6":121.494,"7":121.496,"8":121.4966,"9":121.4974,"10":121.4986},"FINT_STAT":{"0":0,"1":0,"2":1,"3":1,"4":1,"5":1,"6":0,"7":1,"8":1,"9":1,"10":0},"FSTR_ID":{"0":112609,"1":112609,"2":112609,"3":112609,"4":112609,"5":112609,"6":112609,"7":112609,"8":112609,"9":112609,"10":112609}}) 
df = df.sort(['FDT_DATE']) 

回答

1

listYou can insertpandas.DataFrame()仅与.set_value()方法。列类型应该是object

df = pd.DataFrame({"FDT_DATE":[1417390467000, 1417390428000, 1417390608000, 1417390548000, 
    1417390668000, 1417390717000, 1417390758000, 1417390798000, 1417390818000, 
    1417390827000, 1417390907000], "FFLT_LATITUDE":[31.2899, 31.291, 31.2944, 31.294, 
    31.2954, 31.2965, 31.2946, 31.2932, 31.294, 31.2946, 31.2952], 
    "FFLT_LONGITUDE":[121.4845, 121.4859, 121.4857, 121.485, 121.4886, 121.4937, 
    121.494, 121.496, 121.4966, 121.4974, 121.4986], 
    "FINT_STAT":[0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0], 
    "FSTR_ID":[112609, 112609, 112609, 112609, 112609, 112609, 112609, 112609, 
    112609, 112609, 112609]}) 

df = df.sort(['FDT_DATE']).reset_index(drop=True).reset_index() 

def func(x): 
    global a 
    global b 
    if (x['index'] - x['FINT_STAT']) != x['index']: 
     return a 
    else: 
     b += 1 
     a = b 

# Create 't1' column for filter "1" groups in 'FINT_STAT' column 
a = 0 
b = 0 
df['t1'] = df[['index', 'FINT_STAT']].apply(lambda x: func(x), axis=1) 

# Initialize result dataframe 
df_res = df.drop_duplicates(subset=['t1'])[['FSTR_ID', 'FDT_DATE', 't1']].copy()\ 
    .reset_index(drop=True) 
df_res = df_res.dropna().reset_index(drop=True) 

# First create 'POLYLINE' column then convert it into 'object' 
df_res['POLYLINE'] = np.nan 
df_res['POLYLINE'] = df_res['POLYLINE'].astype(object) 

# Inserting list into dataframe is available with 'pd.DataFrame.set_value() 
for i in df['t1'].dropna().unique(): 
    df_res.set_value(df_res.loc[df_res['t1'] == i, 't1'].index.tolist()[0], 'POLYLINE', 
     df.loc[df['t1'] == i, ['FFLT_LATITUDE', 'FFLT_LONGITUDE']].values.tolist()) 

df_res = df_res.drop(['t1'], axis=1) 

结果(您发布的结果不是由 'FDT_DATE' 排序):

FSTR_ID  FDT_DATE                   POLYLINE 
0 112609 1417390548000 [[31.294, 121.485], [31.2944, 121.4857], [31.2954, 121.4886], [31.2965, 121.4937]] 
1 112609 1417390798000      [[31.2932, 121.496], [31.294, 121.4966], [31.2946, 121.4974]] 
2
import pandas as pd 
import numpy as np 

# Initializing the data 
df = pd.DataFrame({'FDT_DATE': {0: 1417390467000, 1: 1417390428000, 2: 1417390608000, 3: 1417390548000, 
           4: 1417390668000, 5: 1417390717000, 6: 1417390758000, 7: 1417390798000, 
           8: 1417390818000, 9: 1417390827000, 10: 1417390907000}, 
        'FFLT_LATITUDE': {0: 31.2899, 1: 31.291, 2: 31.2944, 3: 31.294, 4: 31.2954, 
            5: 31.2965, 6: 31.2946, 7: 31.2932, 8: 31.294, 9: 31.2946, 
            10: 31.2952}, 
        'FFLT_LONGITUDE': {0: 121.4845, 1: 121.4859, 2: 121.4857, 3: 121.485, 4: 121.4886, 
             5: 121.4937, 6: 121.494, 7: 121.496, 8: 121.4966, 9: 121.4974, 
             10: 121.4986}, 
        'FINT_STAT': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 0, 7: 1, 8: 1, 9: 1, 
           10: 0}, 
        'FSTR_ID': {0: 112609, 1: 112609, 2: 112609, 3: 112609, 4: 112609, 5: 112609, 
           6: 112609, 7: 112609, 8: 112609, 9: 112609, 10: 112609}}) 

# Transforming sequences of records with FINT_STAT == 1 to unique GROUP_ID values 
df['GROUP_ID'] = df['FINT_STAT'].apply(np.logical_not).cumsum() 
# Marking groups with FINT_STAT == 0 for removing 
df['GROUP_ID'] *= df['FINT_STAT'] 
# Removing marked groups 
df['GROUP_ID'] = df['GROUP_ID'].replace(0, np.NaN) 

# Grouping by columns GROUP_ID and FSTR_ID 
gb = df.groupby(['GROUP_ID', 'FSTR_ID']) 

result = pd.DataFrame() 
# Appending columns with values of minimal FDT_DATE for every group 
result['MIN_FDT_DATE'] = gb['FDT_DATE'].min() 
# Aggregating results by applying the lambda 
# which return list of pairs of FFLT_LATITUDE and FFLT_LONGITUDE 
result['COORDINATES'] = gb.apply(lambda group: [(row['FFLT_LATITUDE'], row['FFLT_LONGITUDE']) 
           for _, row in group.iterrows()]) 


# Widening line and max column width for printing 
pd.set_option('display.line_width', 300) 
pd.set_option('display.max_colwidth', 200) 
# Looking at result 
print (result) 

输出:

    MIN_FDT_DATE                   COORDINATES 
GROUP_ID FSTR_ID                         
2.0  112609 1417390548000 [(31.2944, 121.4857), (31.294, 121.485), (31.2954, 121.4886), (31.2965, 121.4937)] 
3.0  112609 1417390798000      [(31.2932, 121.496), (31.294, 121.4966), (31.2946, 121.4974)] 
+0

喜cridnirk,我觉得你的方法是多清楚,但有没有办法将原点的FDT_DATE保留在结果中?就像@ragesz所做的一样...我试着在groupby对象中保留FDT_DATE,但失败了。 – jjdblast

+0

@jjdblast:答案已更新。 – cridnirk