从文件中读取，并在Python

它可以在每个时间和行数改变可以改变的，并且包含每个行：

string (can contain one word, two or even more)^string of one word 
EX: 



level country^layla 
hello sandra^organization 
hello people^layla 
hello samar^organization

我想用大熊猫来创建数据帧这样的：

item0 (country, people) 
item1 (sandra , samar)

因为例如每次出现蕾拉，我们正在返回属于它，增加一条，作为第二列只是展示的具有最正确的名称在这种情况下（国家，人民），我们把layla称为item0，并将其作为数据框的索引。我似乎无法安排这一点，我不知道如何做的逻辑返回“^”后的任何重复和返回其属于最正确的名称列表。我的审判至今它并没有真正做到这一点是：

def text_file(file): 

    list=[] 
    file_of_text = "text.txt" 
    with open(file_of_context) as f: 
     for l in f: 
       l_dict = l.split(" ") 
       list.append(l_dict) 
    return(list) 

def items(file_of_text): 

    list_of_items= text_file(file_of_text) 
    for a in list_of_items: 
     for b in a: 
      if a[-1]== 



def main(): 

    file_of_text = "text.txt" 

if __name__ == "__main__": 
    main()

来源

2016-10-30 Lelo

现在您已经添加了更多的文本文件的新列，你想要的输出是什么？ – Abdou

与大熊猫开始read_csv()指定“^”作为定界符和使用任意列名

df = pd.read_csv('data.csv', delimiter='\^', names=['A', 'B']) 
print (df) 
       A    B 
0 level country   layla 
1 hello sandra  organization 
2 hello people   layla 
3 hello samar  organization

然后大家平分得到我们想要的值。展开arg是新的熊猫16我相信

df['A'] = df['A'].str.split(' ', expand=True)[1] 
print(df) 
     A    B 
0 country   layla 
1 sandra organization 
2 people   layla 
3 samar organization

那么我们组列B和应用元组的功能。注：我们正在重置索引，所以我们可以在以后使用

g = df.groupby('B')['A'].apply(tuple).reset_index() 
print(g) 
       B     A 
0   layla (country, people) 
1 organization (sandra, samar)

创建以字符串“项目”和索引

g['item'] = 'item' + g.index.astype(str) 
    print (g[['item','A']]) 
     item     A 
    0 item0 (country, people) 
    1 item1 (sandra, samar)

来源

2016-10-30 01:59:21

typeError：split（）得到了一个意想不到的关键字参数'expand'，反正我们可以避免使用expand？ – Lelo

是的，你可以像'df ['A'] = df ['A']。map（lambda x：x.split（）[1]）' –

你假设A在第一个df中总是两个单词？这可以改变，它可以有一个，两个或三个，甚至更多，因此我有错误，无论如何，以避免这种情况？ – Lelo

让我们假设你的文件被称为file_of_text.txt并包含以下内容：

level country^layla 
hello sandra^organization 
hello people^layla 
hello samar^organization

你能得到你的数据从文件数据帧类似于所需输出与下面的代码行：

import re 
import pandas as pd 

def main(myfile): 
    # Open the file and read the lines 
    text = open(myfile,'r').readlines() 

    # Split the lines into lists 
    text = list(map(lambda x: re.split(r"\s[\^\s]*",x.strip()), text)) 

    # Put it in a DataFrame 
    data = pd.DataFrame(text, columns = ['A','B','C']) 

    # Create an output DataFrame with rows "item0" and "item1" 
    final_data = pd.DataFrame(['item0','item1'],columns=['D']) 

    # Create your desired column 
    final_data['E'] = data.groupby('C')['B'].apply(lambda x: tuple(x.values)).values 

    print(final_data) 

if __name__ == "__main__": 
    myfile = "file_of_text.txt" 
    main(myfile)

的想法是读取从文本文件中的行，然后分裂使用每一行3210方法从re模块。然后将结果传递给DataFrame方法以生成名为data的数据帧，该数据帧用于创建所需的数据帧final_data。结果应该如下所示：

# data 

     A  B    C 
0 level country   layla 
1 hello sandra organization 
2 hello people   layla 
3 hello samar organization 


# final_data 

     D     E 
0 item0 (country, people) 
1 item1 (sandra, samar)

请看看脚本并询问更多问题，如果您有任何问题。

我希望这会有所帮助。

来源

2016-10-30 01:08:28 Abdou

如果文件中每行的长度每次都改变，该怎么办？ – Lelo

请提供与每次更改文件中的行的情况相匹配的数据_。 – Abdou

这也假定在“^”之前只有2个单词，而它可以变化..如何调整？ – Lelo

从文件中读取，并在Python

回答

相关问题