您可以使用双groupby
更快的解决方案与size
和nlargest
:
df3 = df.groupby(['Names', 'Hobby'])
.size()
.groupby(level=0)
.nlargest(1)
.reset_index(level=0, drop=True)
.reset_index(name='Count')
另一种解决方案是使用Counter
:
from collections import Counter
df1 = df.groupby('Names')['Hobby'].apply(lambda x: Counter(x).most_common(1)[0][0])
个
时序:
In [52]: %timeit df.groupby(['Names', 'Hobby']).size().groupby(level=0).nlargest(1).reset_index(level=0, drop=True).reset_index(name='Count')
1 loop, best of 3: 191 ms per loop
In [53]: %timeit df.groupby('Names')['Hobby'].apply(lambda x: Counter(x).most_common(1)[0][0])
1 loop, best of 3: 242 ms per loop
In [54]: %timeit df.groupby('Names')['Hobby'].agg(lambda x: pd.value_counts(x).index[0])
1 loop, best of 3: 345 ms per loop
代码进行测试:
#[1000000 rows x 2 columns]
np.random.seed(123)
N = 1000000
L1 = ['Andrew', 'Kevin','Joe','John', 'Bob', 'Peter']
L2 = ['Football','Photo','Games','Travel']
df = pd.DataFrame({'Names':np.random.choice(L1, N),
'Hobby': np.random.choice(L1, N)})
print (df)