2017-02-24 48 views
0

我有一个聊天数据集,我想创建一个会话组并统计他们发送的消息数量。python群聊ID

这是我的数据。该数据是 “ID”的聊天记录,其名称是Jimmy。

Sender  Receiver Text 
ID   person1 HI 
person1  ID   Hello~ 
ID   person1 My name is Jimmy 
person1  ID   Nice to meet you! 
ID   person1 Nice to meet you, too 
ID   person2 Hi 
person1  ID   Hi there 
ID   person2 My name is Jimmy 
person1  ID   My name is Abi 
ID   person2 Nice to meet you 
...   ....  ..... 

“ID”可以与多个人聊天。
我想要计算每个对话的消息数量。
在这种情况下,两个对话都有5条消息。

我已经编写了代码,但由于我的数据很大,所以看起来效率很低。

#chat_df is the dataframe of chat data 
    df = [] 
    total_message =[] 
    receiver_id = chat_df["receiver"].unique() 
    for x in rid: 
     total_message.append(len(chat_df[(chat_df["receiver"] == x) | (chat_df["sender"] == x)])) 
     df.append(chat_df[(chat_df["receiver"] == x) | (chat_df["sender"] == x)]) 

有没有一种更有效的方法来获得一对双人的聊天数据?

回答

1

我认为你需要stackvalue_counts

df1 = chat_df[['Sender','Receiver']].stack().value_counts().reset_index() 
df1.columns = ['People','Counts'] 
print (df1) 
    People Counts 
0  ID  10 
1 person1  7 
2 person2  3 

编辑:

#get number of all words 
chat_df['Len'] = chat_df.Text.str.split().str.len() 
#reshape dataframe 
chat_df = chat_df.set_index('Len')[['Sender','Receiver']].stack().reset_index(name='People') 
print (chat_df) 
    Len level_1 People 
0  1 Sender  ID 
1  1 Receiver person1 
2  1 Sender person1 
3  1 Receiver  ID 
4  4 Sender  ID 
5  4 Receiver person1 
6  4 Sender person1 
7  4 Receiver  ID 
8  5 Sender  ID 
9  5 Receiver person1 
10 1 Sender  ID 
11 1 Receiver person2 
12 2 Sender person1 
13 2 Receiver  ID 
14 4 Sender  ID 
15 4 Receiver person2 
16 4 Sender person1 
17 4 Receiver  ID 
18 4 Sender  ID 
19 4 Receiver person2 

#groupby by People and aggregate sum and size 
chat_df1 = chat_df.groupby('People')['Len'].agg(['size','sum']) 
chat_df1.columns = ['Count','Len_words'] 
chat_df1 = chat_df1.reset_index() 
#filter all sizes higher as 5 
chat_df1 = chat_df1[chat_df1['Count'] > 5] 
print (chat_df1) 
    People Count Len_words 
0  ID  10   30 
1 person1  7   21 
+0

谢谢!这就是我需要的! 还有一个问题.. 如果我想计算每条消息的文本数量,以便更高的计数(5位以上),你会如何建议完成它? 非常感谢你! – jimmy15923

+0

谢谢。我正在考虑你的第二个问题,我认为没有更好的解决方案,因为['boolean indexing'](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-索引)。 – jezrael

+0

什么意思是文本的数量?数字?或短信的长度? – jezrael