用熊猫将一行分组成多个组

我有一组句子，我想将它们分组，以使组中的所有行应共享一个特定的词。然而，一个句子可以属于很多群体，因为它里面有很多单词。用熊猫将一行分组成多个组

所以在下面的例子中，应该有一个基团是这样的：

A '温度' 组，其中包括所有的行（0，1，2，3和4）
A'冻结基团，其包括行2和4
A‘只包含行0
组为每一个其它字的’基团，其包括行0，1，2，和3
A‘金属’基团在数据集

import pandas as pd 

# An example data set 
df = pd.DataFrame({"sentences": [ 
    "two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature", 
    "the temperature at which a liquid boils", 
    "a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees", 
    "a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °", 
    "a system for measuring temperature in which water freezes at 32º and boils at 212º" 
]}) 

# Create a new series which is a list of words in each "sentences" column 
df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" ")) 

# Try to group by this new column 
df.groupby('words').count() 

# TypeError: unhashable type: 'list'

~~但是我的代码引发错误，如图所示。~~ （见下）由于我的任务有点复杂，我知道它可能不仅仅是调用groupby（）。有人可以帮助我用熊猫做词组吗？

编辑通过返回tuple(sentence.split())（感谢ethan-furman）解决了错误之后，我尝试打印结果，但它似乎没有做任何事情。我想大概只是把每行一组：

print(df.groupby('words').count()) 

# sentences 5 
# dtype: int64

来源

2015-12-09 Miguel

我目前的解决方案使用了熊猫的MultiIndex功能。我敢肯定，它可以与一些更有效地使用numpy的加以改进，但我相信这将执行比其他只蟒蛇回答显著更好：

import pandas as pd 
import numpy as np 

# An example data set 
df = pd.DataFrame({"sentences": [ 
    "two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature", 
    "the temperature at which a liquid boils", 
    "a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees", 
    "a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °", 
    "a system for measuring temperature in which water freezes at 32º and boils at 212º" 
]}) 

# Create a new series which is a list of words in each "sentences" column 
df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" ")) 

# This is all the words in the dataset. Each word will be its own index (level of the MultiIndex) 
names = np.unique(df['words'].sum()) 

# Create an array of tuples, one tuple for each row of data 
# Each tuple contains True if the row has that word in it, and False if it does not 
values = df['words'].map(
    lambda words: np.vectorize(
     lambda word: 
      True if word in words else False)(names) 
) 

# Make a multindex 
index = pd.MultiIndex.from_tuples(values, names=names) 

# Add the MultiIndex without creating a new data frame 
df.set_index(index, inplace=True) 

# Find all the rows that have the word 'temperature' 
xs = df.xs(True, level='temperature') 

print(xs.to_string(index=False))

来源

2015-12-09 21:45:00 Miguel

要解决您的TypeError你可以改变你lambda到

lambda sentence: tuple(sentence.split())

将返回的tuple而不是list（和tuples和可哈希）。

来源

2015-12-09 03:31:41

但这解决的错误，但我仍然不能得到正确的结果（见编辑） – Miguel

您可以使用集合集合，以便每个单词都是唯一的。首先，我们需要得到所有句子中所有单词的列表。为此，我们将单词初始化为空集，然后使用列表理解在每个句子中添加每个小写单词（在分割句子之后）。

接下来，我们使用词典理解来建立词典中每个单词的字典。该值是包含包含该单词的每个句子的数据框。这些是通过在函数groupby(df.sentences.str.contains(word, case=False))上分组获得的，然后获得这个条件为True的每个组。

words = set() 
_ = [words.add(word.lower()) for sentence in df.sentences for word in sentence.split()] 

word_dict = {word: df.groupby(df.sentences.str.contains(word, case=False)).get_group(True) 
      for word in words} 

>>> word_dict['temperature'] 
              sentences 
0 two long pieces of metal fixed together, each ... 
1   the temperature at which a liquid boils 
2 a system for measuring temperature that is par... 
3 a unit for measuring temperature. Measurements... 
4 a system for measuring temperature in which wa... 

>>> word_dict['freezes'] 
              sentences 
2 a system for measuring temperature that is par... 
4 a system for measuring temperature in which wa... 

>>> words 
{'0', 
'100', 
'212\xc2\xba', 
'32\xc2\xba', 
'a', 
'amount', 
'and', 
'are', 
'as', 
'at', 
'bends', 
...

要获得索引值的字典每个字：

>>> {word: word_dict[word].index.tolist() for word in word_dict} 
{'0': [2], 
'100': [2], 
'212\xc2\xba': [4], 
'32\xc2\xba': [4], 
'a': [0, 1, 2, 3, 4], 
'amount': [0], 
'and': [2, 4], 
'are': [0, 3], 
'as': [2, 3, 4], 
'at': [0, 1, 2, 3, 4], 
'bends': [0], 
'boils': [1, 2, 4], 
'both': [0], 
'by': [3], 
'degrees': [2], 
'different': [0], 
'each': [0], 
'expressed': [3], 
'fixed': [0], 
'followed': [3], 
'for': [2, 3, 4], 
'freezes': [2, 4], 
...

或布尔指标矩阵。

>>> [df.sentences.str.contains(word, case='lower').tolist() for word in word_dict] 
[[False, False, True, False, True], 
[False, False, False, True, False], 
[True, False, False, False, False], 
[False, False, True, False, False], 
...

来源

2015-12-09 04:38:02 Alexander

用熊猫将一行分组成多个组

回答

相关问题