python
  • pandas
  • 2017-01-27 119 views 2 likes 
    2

    我在这里挣扎,我正在寻找采取以下数据,按日期分组,然后检查组内的行,以确定组是否有任何位置数据与之关联,如果是的话,解压缩它。pandas groupBy日期然后过滤日期和字符串到新的数据框

    我的数据样本:

    id,dates,text,place 
    1,2017-01-26 01:06:47,text,"Place(country_code='US', full_name='Manhattan, NY', place_type='city', name='Manhattan', contained_within=[], _api=<tweepy.api.API object at 0x10336f320>, attributes={}, country='United States', bounding_box=BoundingBox(type='Polygon', coordinates=[[[-74, 40], [-73, 40], [-73, 40], [-74, 40]]], _api=<tweepy.api.API object at 0x10336f320>))" 
    2,2017-01-26 01:05:51,text,"Place(country_code='US', full_name='Manhattan, NY', place_type='city', name='Manhattan', contained_within=[], _api=<tweepy.api.API object at 0x10336f320>, attributes={}, country='United States', bounding_box=BoundingBox(type='Polygon', coordinates=[[[-74, 40], [-73, 40], [-73, 40], [-74, 40]]], _api=<tweepy.api.API object at 0x10336f320>))" 
    4,2017-01-23 01:38:29,text, 
    5,2017-01-23 01:36:53,text, 
    

    我开始通过加载CSV和分组日期

    import pandas as pd 
    import matplotlib.pyplot as plt 
    import datetime 
    
    fig = plt.figure(figsize=(5,5)) 
    df1 = pd.read_csv('data.csv') 
    df = df1[['dates','place']] 
    df['dates']=pd.to_datetime(df['dates'],format='%Y-%m-%d') 
    df.index=df['dates'] 
    
    grp = pd.groupby(df,by=[df.index.year,df.index.month,df.index.day]) 
    for date,group in grp: 
        print(date) 
        print(group) 
    

    这将产生一个数据帧,看起来像这样:

    (2017, 1, 26) 
                dates \ 
    dates 
    2017-01-26 01:06:47 2017-01-26 01:06:47 
    2017-01-26 01:05:51 2017-01-26 01:05:51 
    
                       place 
    dates 
    2017-01-26 01:06:47 Place(country_code='US', full_name='Manhattan,... 
    2017-01-26 01:05:51            NaN 
    

    这里是我遇到过滤/条件问题的地方,我的目标是要有一个可以保存的数据框一个csv,看起来像这样:

    date, item_count, has_location, location 
    2017-01-26, 2, yes, Manhattan 
    2017-01-23, 2, no, na 
    

    什么是继续进行的最佳方式?由于

    +0

    我不知道,但似乎输出与输入不同 - 有问题的一行'ID = 3'。我尝试用我的解决方案省略它,请检查它。 – jezrael

    回答

    2

    我认为你可以使用:

    extractname与第一place列,然后通过dt.dategroupby(如果datesdtypedatetimeto_datetime可以去掉),并通过总有些size列如idfirstplace。通过numpy.where创建的最后insert新列:

    print (df) 
        id    dates text \ 
    0 1 2017-01-26 01:06:47 text 
    1 2 2017-01-26 01:05:51 text 
    2 4 2017-01-23 01:38:29 text 
    3 5 2017-01-23 01:36:53 text 
    
                   place 
    0 Place(country_code='US', full_name='Manhattan,... 
    1 Place(country_code='US', full_name='Manhattan,... 
    2            NaN 
    3            NaN 
    
    df.place = df.place.str.extract(", name='(.*)', contained_within", expand=True) 
    print (df) 
        id    dates text  place 
    0 1 2017-01-26 01:06:47 text Manhattan 
    1 2 2017-01-26 01:05:51 text Manhattan 
    2 4 2017-01-23 01:38:29 text  NaN 
    3 5 2017-01-23 01:36:53 text  NaN 
    
    df1 = df.groupby(pd.to_datetime(df.dates).dt.date).agg({'id':'size', 'place':'first'}) 
    df1.columns = ['item_count','location'] 
    df1.insert(1, 'has_location', np.where(df1.location.isnull(), 'no', 'yes')) 
    print (df1) 
          item_count has_location location 
    dates           
    2017-01-23   2   no  NaN 
    2017-01-26   2   yes Manhattan 
    
    +0

    处理id = 3,使用这个'agg({'size','first',lambda x:x.isnull()。any()})'确定has_location – Boud

    +0

    @Boud - 谢谢,是'3'而不是'2'(可能是错字),我不知道是否理解得很好。 – jezrael

    +0

    @Boud - 然后输出是'location = Manhattan',可能是错字... – jezrael

    相关问题