如何使用唯一ID替换Python Pandas表文本值？

我使用熊猫阅读此格式的文件：如何使用唯一ID替换Python Pandas表文本值？

fp = pandas.read_table("Measurements.txt") 
fp.head() 

"Aaron", 3, 5, 7 
"Aaron", 3, 6, 9 
"Aaron", 3, 6, 10 
"Brave", 4, 6, 0 
"Brave", 3, 6, 1

我想用一个唯一的ID来代替每个名称，以便输出如下：

"1", 3, 5, 7 
"1", 3, 6, 9 
"1", 3, 6, 10 
"2", 4, 6, 0 
"2", 3, 6, 1

我怎样才能做到这一点？

谢谢！

来源

2016-08-14 Kingua

我会利用categorical D型的：

In [97]: x['ID'] = x.name.astype('category').cat.rename_categories(range(1, x.name.nunique()+1)) 

In [98]: x 
Out[98]: 
    name v1 v2 v3 ID 
0 Aaron 3 5 7 1 
1 Aaron 3 6 9 1 
2 Aaron 3 6 10 1 
3 Brave 4 6 0 2 
4 Brave 3 6 1 2

如果需要字符串ID代替数字的，你可以使用：

x.name.astype('category').cat.rename_categories([str(x) for x in range(1,x.name.nunique()+1)])

，或者如@MedAli在his answer已经提到，使用factorize()方法 - 演示：

In [141]: x['cat'] = pd.Categorical((pd.factorize(x.name)[0] + 1).astype(str)) 

In [142]: x 
Out[142]: 
    name v1 v2 v3 ID cat 
0 Aaron 3 5 7 1 1 
1 Aaron 3 6 9 1 1 
2 Aaron 3 6 10 1 1 
3 Brave 4 6 0 2 2 
4 Brave 3 6 1 2 2 

In [143]: x.dtypes 
Out[143]: 
name  object 
v1   int64 
v2   int64 
v3   int64 
ID  category 
cat  category 
dtype: object 

In [144]: x['cat'].cat.categories 
Out[144]: Index(['1', '2'], dtype='object')

或ha咏类别为整数：

In [154]: x['cat'] = pd.Categorical((pd.factorize(x.name)[0] + 1)) 

In [155]: x 
Out[155]: 
    name v1 v2 v3 ID cat 
0 Aaron 3 5 7 1 1 
1 Aaron 3 6 9 1 1 
2 Aaron 3 6 10 1 1 
3 Brave 4 6 0 2 2 
4 Brave 3 6 1 2 2 

In [156]: x['cat'].cat.categories 
Out[156]: Int64Index([1, 2], dtype='int64')

解释：

In [99]: x.name.astype('category') 
Out[99]: 
0 Aaron 
1 Aaron 
2 Aaron 
3 Brave 
4 Brave 
Name: name, dtype: category 
Categories (2, object): [Aaron, Brave] 

In [100]: x.name.astype('category').cat.categories 
Out[100]: Index(['Aaron', 'Brave'], dtype='object') 

In [101]: x.name.astype('category').cat.rename_categories([1,2]) 
Out[101]: 
0 1 
1 1 
2 1 
3 2 
4 2 
dtype: category 
Categories (2, int64): [1, 2]

解释为factorize()方法：

In [157]: (pd.factorize(x.name)[0] + 1) 
Out[157]: array([1, 1, 1, 2, 2]) 

In [158]: pd.Categorical((pd.factorize(x.name)[0] + 1)) 
Out[158]: 
[1, 1, 1, 2, 2] 
Categories (2, int64): [1, 2]

来源

2016-08-14 19:43:37 MaxU

谢谢 - 现在就试试你的建议吧！我已经破解了一个功能，现在可以完成这项工作，但是您的代码看起来像是更优雅的解决方案。 – Kingua

你可以通过一个简单的字典映射来实现。例如说你的数据是这样的：

col1, col2, col3, col4 
"Aaron", 3, 5, 7 
"Aaron", 3, 6, 9 
"Aaron", 3, 6, 10 
"Brave", 4, 6, 0 
"Brave", 3, 6, 1

后来干脆

myDict = {"Aaron":"1", "Brave":"2"} 
fp["col1"] = fp["col1"].map(myDict)

，如果你不希望建立一个字典使用pandas.factorize这是要采取编码列的护理你从0开始。你可以找到一个关于如何使用它的例子here。

来源

2016-08-14 19:38:33 MedAli

谢谢！名字数量非常大，所以我现在正在尝试使用因式分解。我已经破解了一个可以完成这项工作的功能，但是pandas.factorize看起来像是一个更优雅的解决方案。 – Kingua

为什么不上名字

df["col0"] = df["col0"].apply(lambda x: hashlib.sha256(x.encode("utf-8")).hexdigest())

使用哈希这样你不需要关心的是发生的名称，即你不需要知道他们的前期建设一个字典映射。

来源

2016-08-14 19:40:31 ChE

谢谢 - 我认为，但我使用的库需要简单的数值。 – Kingua

它看起来像这样Replace all occurrences of a string in a pandas dataframe可能拥有你的答案。根据文档，pandas.read_table创建一个数据帧，并且数据帧具有替换功能。

fp.replace({'Aaron': '1'}, regex=True)

尽管您可能不需要regex = True部分，因为它是完整的直接替换。

来源

2016-08-14 19:55:22 Alan

谢谢 - 我一直这样做的名称较少，但在这种情况下，名称数量大于1000，因此不能一个一个选择。我已经破解了一个可以完成这项工作的功能，但是pandas.factorize看起来像是一个更优雅的解决方案。 – Kingua

如何使用唯一ID替换Python Pandas表文本值？

回答

相关问题