为什么我不能在熊猫中将列分成两列？

我有一个数据帧[“阿鲁”]是这样的：为什么我不能在熊猫中将列分成两列？

df.anno 

0   type I secretion outer membrane protein, TolC... 
1   conserved hypothetical protein [Shigella boyd... 
2    Transposase [Congregibacter litoralis KT71] 
3   Chain A, The Crystal Structure Of Chlorite Di... 
4   chlorite dismutase, partial [uncultured bacte... 
5   carbamoyl-phosphate synthase, small subunit [... 
6   anthranilate synthase component 1 [endosymbio... 
7   chlorite dismutase, partial [bacterium enrich... 
8   peptidase dimerization domain protein [Myroid... 
9   MULTISPECIES: MFS transporter [Enterobacteria... 
10  CAAX amino terminal protease family protein [... 
11  Fe-S oxidoreductase [Desulfovibrio africanus ... 
12  phosphoenolpyruvate synthase/pyruvate phospha...

因为有两个部分的每一行中：1：蛋白名称。 2.具有'[......]'的微生物物种。

我想提取蛋白质名称部分并丢弃微生物物种，所以我决定首先将该列分成两列，位置为'[''。

df2 = pd.DataFrame(df.anno.str.split("[", 1).tolist(), columns = ['protein','species'])

它返回一个错误：

TypeError: object of type 'NoneType' has no len()

我也试过：

df[['protein','species']] = df['anno'].str.split('[', expand=True)

它也返回了一个错误：

ValueError: Columns must be same length as key

如何做到这一点？有没有其他的方法来提取蛋白质名称？谢谢！

来源

2017-09-17 stevex

我觉得有问题多个[，所以加了n=1到split先分开了[。对于删除最后]使用rstrip：

df[['protein','species']] = df['anno'].str.rstrip(']').str.split('[', expand=True, n=1)

对于带材通过最后[使用rsplit：

df[['protein','species']] = df['anno'].str.rstrip(']').str.rsplit('[', expand=True, n=1)

与extract另一种解决方案用于提取由过去[]：

df[['protein','species']] = df['anno'].str.extract('(.*)\[(.*)\]', expand=True)

样品：

df[['protein','species']] = df['anno'].str.rstrip(']').str.split('[', expand=True, n=1) 
df['species'] = df['species'].str.replace('\]\[',',') 
df['protein'] = df['protein'].str.strip() 
print (df) 
       anno  protein species 
0  protein [q][sd]  protein q,sd 
1    protein  protein None 
2 Transposase [KT71] Transposase KT71 
3    None   None None

来源

2017-09-17 18:01:26 jezrael

非常感谢。有效！一个问题：对于行[3]，其值如下：'链A，亚氯酸盐歧化酶的晶体结构：产生分子氧的排毒酶' 它没有物种名称。如果我运行这个命令，它会给我带来“蛋白质”和“物种”的NaN。如果我想保留蛋白质名称，我应该怎么做？ – stevex

带'str.split'的解决方案应该可以工作。 – jezrael

问题是该列中的某些值没有“[物种名称..]”。如果我使用str.extract，它将返回NaN。如果我使用str.split，它会返回错误。 – stevex

为什么我不能在熊猫中将列分成两列？

回答

相关问题