2017-08-27 76 views
0

enter image description here我创建了一个代码,以帮助我检索从csv文件从CSV提取行基于文件的特定关键字

import re 
keywords = {"metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists", 
      "electronic", "workers"} # all your keywords 


keyre=re.compile("energy",re.IGNORECASE) 
with open("2006-data-8-8-2016.csv") as infile: 
    with open("new_data.csv", "w") as outfile: 
     outfile.write(infile.readline()) # Save the header 
     for line in infile: 
      if len(keyre.findall(line))>0: 
       outfile.write(line) 

我需要它来查找每个关键字,其中有两个主要的列中的数据“位置“和”职位描述“,然后将包含这些单词的整行写入新文件中。关于如何以最简单的方式完成这些任何想法?

+0

我需要它来看待所有的关键字,例如,它应该寻找包括“金属”字下的行“位置”和“工作描述”,然后提取整行并将它们写入文件中,然后查找第二个单词并执行相同操作直到最后一个单词 –

回答

0

试试这个,在数据框中循环并将新的数据框写回csv文件。

import pandas as pd 

keywords = {"metal", "energy", "team", "sheet", "solar", "financial", 
     "transportation", "electrical", "scientists", 
     "electronic", "workers"} # all your keywords 

df = pd.read_csv("2006-data-8-8-2016.csv", sep=",") 

listMatchPosition = [] 
listMatchDescription = [] 

for i in range(len(df.index)): 
    if any(x in df['position'][i] or x in df['Job description'][i] for x in keywords): 
     listMatchPosition.append(df['position'][i]) 
     listMatchDescription.append(df['Job description'][i]) 


output = pd.DataFrame({'position':listMatchPosition, 'Job description':listMatchDescription}) 
output.to_csv("new_data.csv", index=False) 

编辑: 如果你有许多列添加,修改下面的代码将做的工作。

df = pd.read_csv("2006-data-8-8-2016.csv", sep=",") 

output = pd.DataFrame(columns=df.columns) 

for i in range(len(df.index)): 
    if any(x in df['position'][i] or x in df['Job description'][i] for x in keywords): 
    output.loc[len(output)] = [df[j][i] for j in df.columns] 

output.to_csv("new_data.csv", index=False) 
+0

请注意,如果“作业描述”不是只有一个单词,因为我认为它不是,与Dataframe.isin方法 –

+0

相反,csv文件还包含其他列以及我需要提取并放入新文件的内容。任何想法如何? @Vincent K –

+0

你的意思是像“薪水”,“地点”这样的列需要一起提取?如果是的话,如果它只是更多的几列,只需添加更多listMatchxxx –

0

你可以做到这一点使用熊猫如下,如果你正在寻找含有关键字的列表中只有一个字行:

keywords = ["metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists", 
      "electronic", "workers"] 

# read the csv data into a dataframe 
# change "," to the data separator in your csv file 
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",") 
# filter the data: keep only the rows that contain one of the keywords 
# in the position or the Job description columns 
df = df[df["position"].isin(keywords) | df["Job description"].isin(keywords)] 
# write the data back to a csv file 
df.to_csv("new_data.csv",sep=",", index=False) 

如果你正在寻找的行子(例如,在寻找financial engineeringfinancial),那么你可以做到以下几点:

keywords = ["metal", "energy", "team", "sheet", "solar" "financial", "transportation", "electrical", "scientists", 
      "electronic", "workers"] 
searched_keywords = '|'.join(keywords) 

# read the csv data into a dataframe 
# change "," to the data separator in your csv file 
df = pd.read_csv("2006-data-8-8-2016.csv", sep=",") 
# filter the data: keep only the rows that contain one of the keywords 
# in the position or the Job description columns 
df = df[df["position"].str.contains(searched_keywords) | df["Job description"].str.contains(searched_keywords)] 
# write the data back to a csv file 
df.to_csv("new_data.csv",sep=",", index=False) 
+0

这很简单,看起来不错,我得到了代码。但它不会保存任何数据只有标题:(虽然我相信很多关键字都包含在文件中,具体位置和职位描述@MedAli –

+0

@ Eng.Reem您可以分享您的数据样本吗? – MedAli

+0

这是行不通的,因为“职位说明”栏不仅仅是一个单词 –

相关问题