2017-04-19 69 views
0

我有两种类型的文件,excel和csv,我正在使用它读取带有两个永久列的数据:问题,答案和两个临时列,可能存在或不存在Word和Replacement。如何根据数据可用性从excel或csv文件中读取数据?

我已经做了不同的功能,从csv和excel文件中读取数据,这将根据文件的扩展名来调用。

是否有一种方法可以根据它们何时存在以及何时不存在,从临时列(Word和Replacement)中读取数据。请参考下面的函数定义:

1)CSV文件:

def read_csv_file(path): 
    quesData = [] 
    ansData = [] 
    asciiIgnoreQues = [] 
    qWithoutPunctuation = [] 
    colnames = ['Question','Answer'] 
    data = pandas.read_csv(path, names = colnames) 
    quesData = data.Question.tolist() 
    ansData = data.Answer.tolist() 
    qWithoutPunctuation = quesData 

    qWithoutPunctuation = [''.join(c for c in s if c not in string.punctuation) for s in qWithoutPunctuation] 

    for x in qWithoutPunctuation: 
     asciiIgnoreQues.append(x.encode('ascii','ignore')) 

    return asciiIgnoreQues, ansData, quesData 

2)功能来读取Excel数据:

def read_excel_file(path): 
    book = open_workbook(path) 
    sheet = book.sheet_by_index(0) 
    quesData = [] 
    ansData = [] 
    asciiIgnoreQues = [] 
    qWithoutPunctuation = [] 

    for row in range(1, sheet.nrows): 
     quesData.append(sheet.cell(row,0).value) 
     ansData.append(sheet.cell(row,1).value) 

    qWithoutPunctuation = quesData 
    qWithoutPunctuation = [''.join(c for c in s if c not in string.punctuation) for s in qWithoutPunctuation] 

    for x in qWithoutPunctuation: 
     asciiIgnoreQues.append(x.encode('ascii','ignore')) 

    return asciiIgnoreQues, ansData, quesData 
+0

你认为'pandas.read_csv'和'pandas.read_excel'吗?他们将根据列出现的情况自动读取。 – tmrlvi

+0

@tmrlvi,我在读取csv函数时使用了pandas.read_csv,但列标题必须在colnames中提供。但是如果我没有单词和替换曲面怎么办? –

+0

你不必提供它们。如果你不这样做,'pandas'推断出这些名字。还是你的数据不包含标题? – tmrlvi

回答

0

我不完全相信你试图达到什么,但是读取和转换数据的方式如下:

def read_file(path, typ): 
    if typ == "excel": 
     df = pd.read_excel(path, sheetname=0) # Default is zero 
    else: # Assuming "csv". You can make it explicit 
     df = pd.read_csv(path) 

    qWithoutPunctuation = df["Question"].apply(lambda s: ''.join(c for c in s if c not in string.punctuation)) 
    df["asciiIgnoreQues"] = qWithoutPunctuation.apply(lambda x: x.encode('ascii','ignore')) 

    return df 

# Call it like this: 
read_data("file1.csv","csv") 
read_data("file2.xls","excel") 
read_data("file2.xlsx","excel") 

如果数据不包括WordReplacement["Question", "Word", "Replacemen", "Answer", "asciiIgnoreQues"](如果包含),则这将返回DataFrame["Question","Answer", "asciiIgnoreQues"]列。

请注意,我已经使用了apply,它使您能够在所有系列上按元素运行函数。