pandas groupBy日期然后过滤日期和字符串到新的数据框

我在这里挣扎，我正在寻找采取以下数据，按日期分组，然后检查组内的行，以确定组是否有任何位置数据与之关联，如果是的话，解压缩它。pandas groupBy日期然后过滤日期和字符串到新的数据框

我的数据样本：

id,dates,text,place 
1,2017-01-26 01:06:47,text,"Place(country_code='US', full_name='Manhattan, NY', place_type='city', name='Manhattan', contained_within=[], _api=<tweepy.api.API object at 0x10336f320>, attributes={}, country='United States', bounding_box=BoundingBox(type='Polygon', coordinates=[[[-74, 40], [-73, 40], [-73, 40], [-74, 40]]], _api=<tweepy.api.API object at 0x10336f320>))" 
2,2017-01-26 01:05:51,text,"Place(country_code='US', full_name='Manhattan, NY', place_type='city', name='Manhattan', contained_within=[], _api=<tweepy.api.API object at 0x10336f320>, attributes={}, country='United States', bounding_box=BoundingBox(type='Polygon', coordinates=[[[-74, 40], [-73, 40], [-73, 40], [-74, 40]]], _api=<tweepy.api.API object at 0x10336f320>))" 
4,2017-01-23 01:38:29,text, 
5,2017-01-23 01:36:53,text,

我开始通过加载CSV和分组日期

import pandas as pd 
import matplotlib.pyplot as plt 
import datetime 

fig = plt.figure(figsize=(5,5)) 
df1 = pd.read_csv('data.csv') 
df = df1[['dates','place']] 
df['dates']=pd.to_datetime(df['dates'],format='%Y-%m-%d') 
df.index=df['dates'] 

grp = pd.groupby(df,by=[df.index.year,df.index.month,df.index.day]) 
for date,group in grp: 
    print(date) 
    print(group)

这将产生一个数据帧，看起来像这样：

(2017, 1, 26) 
            dates \ 
dates 
2017-01-26 01:06:47 2017-01-26 01:06:47 
2017-01-26 01:05:51 2017-01-26 01:05:51 

                   place 
dates 
2017-01-26 01:06:47 Place(country_code='US', full_name='Manhattan,... 
2017-01-26 01:05:51            NaN

这里是我遇到过滤/条件问题的地方，我的目标是要有一个可以保存的数据框一个csv，看起来像这样：

date, item_count, has_location, location 
2017-01-26, 2, yes, Manhattan 
2017-01-23, 2, no, na

什么是继续进行的最佳方式？由于

来源

2017-01-27 sn4ke

我不知道，但似乎输出与输入不同 - 有问题的一行'ID = 3'。我尝试用我的解决方案省略它，请检查它。 – jezrael

我认为你可以使用：

extractname与第一place列，然后通过dt.dategroupby（如果dates列dtype为datetime，to_datetime可以去掉），并通过总有些size列如id和first列place。通过numpy.where创建的最后insert新列：

print (df) 
    id    dates text \ 
0 1 2017-01-26 01:06:47 text 
1 2 2017-01-26 01:05:51 text 
2 4 2017-01-23 01:38:29 text 
3 5 2017-01-23 01:36:53 text 

               place 
0 Place(country_code='US', full_name='Manhattan,... 
1 Place(country_code='US', full_name='Manhattan,... 
2            NaN 
3            NaN 

df.place = df.place.str.extract(", name='(.*)', contained_within", expand=True) 
print (df) 
    id    dates text  place 
0 1 2017-01-26 01:06:47 text Manhattan 
1 2 2017-01-26 01:05:51 text Manhattan 
2 4 2017-01-23 01:38:29 text  NaN 
3 5 2017-01-23 01:36:53 text  NaN 

df1 = df.groupby(pd.to_datetime(df.dates).dt.date).agg({'id':'size', 'place':'first'}) 
df1.columns = ['item_count','location'] 
df1.insert(1, 'has_location', np.where(df1.location.isnull(), 'no', 'yes')) 
print (df1) 
      item_count has_location location 
dates           
2017-01-23   2   no  NaN 
2017-01-26   2   yes Manhattan

来源

2017-01-27 19:42:13 jezrael

处理id = 3，使用这个'agg（{'size'，'first'，lambda x：x.isnull（）。any（）}）'确定has_location – Boud

@Boud - 谢谢，是'3'而不是'2'（可能是错字），我不知道是否理解得很好。 – jezrael

@Boud - 然后输出是'location = Manhattan'，可能是错字... – jezrael

回答

相关问题