我有一组句子,我想将它们分组,以使组中的所有行应共享一个特定的词。然而,一个句子可以属于很多群体,因为它里面有很多单词。用熊猫将一行分组成多个组
所以在下面的例子中,应该有一个基团是这样的:
- A '温度' 组,其中包括所有的行(0,1,2,3和4)
- A'冻结基团,其包括行2和4
- A‘只包含行0
- 组为每一个其它字的’基团,其包括行0,1,2,和3
- A‘金属’基团在数据集
import pandas as pd
# An example data set
df = pd.DataFrame({"sentences": [
"two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature",
"the temperature at which a liquid boils",
"a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees",
"a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °",
"a system for measuring temperature in which water freezes at 32º and boils at 212º"
]})
# Create a new series which is a list of words in each "sentences" column
df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" "))
# Try to group by this new column
df.groupby('words').count()
# TypeError: unhashable type: 'list'
但是我的代码引发错误,如图所示。
(见下) 由于我的任务有点复杂,我知道它可能不仅仅是调用groupby()。有人可以帮助我用熊猫做词组吗?
编辑通过返回tuple(sentence.split())
(感谢ethan-furman)解决了错误之后,我尝试打印结果,但它似乎没有做任何事情。我想大概只是把每行一组:
print(df.groupby('words').count())
# sentences 5
# dtype: int64
但这解决的错误,但我仍然不能得到正确的结果(见编辑) – Miguel