2017-09-11 33 views
0

数据集:匹配字

> df 
Id  Clean_Data 
1918916 Luxury Apartments consisting 11 towers Well equipped gymnasium Swimming Pool Toddler Pool Health Club Steam Room Sauna Jacuzzi Pool Table Chess Billiards room Carom Table Tennis indoor games 
1495638 near medavakkam junction calm area near global hospital 
1050651 No Pre Emi No Booking Amount No Floor Rise Charges No Processing Fee HLPROJECT HIGHLIGHTS 

下面是被成功地从值列表返回匹配的单词在n元语法Category.py

df['one_word_tokenized_text'] =df["Clean_Data"].str.split() 
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2))) 
df['trigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 3))) 
df['four_words'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 4))) 
token=pd.Series(df["one_word_tokenized_text"]) 
Lid=pd.Series(df["Id"]) 
matches= token.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.HealthCare]))) 
match_list= [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches] 
match_df = pd.DataFrame({"ID":Lid,"jc1":match_list}) 


def match_word(feature, row): 
    categories = [] 

    for bigram in row.bigram: 
     joined = ' '.join(bigram) 
     if joined in feature: 
      categories.append(joined) 
    for trigram in row.trigram: 
     joined = ' '.join(trigram) 
     if joined in feature: 
      categories.append(joined) 
    for fourwords in row.four_words: 
     joined = ' '.join(fourwords) 
     if joined in feature: 
      categories.append(joined) 
    return categories 

match_df['Health1'] = df.apply(partial(match_word, HealthCare), axis=1) 
match_df['HealthCare'] = match_df[match_df.columns[[1,2]]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1) 
代码

Category.py

category = [('steam room','IN','HealthCare'), 
     ('sauna','IN','HealthCare'), 
     ('Jacuzzi','IN','HealthCare'), 
     ('Aerobics','IN','HealthCare'), 
     ('yoga room','IN','HealthCare'),] 
    HealthCare= [e1 for (e1, rel, e2) in category if e2=='HealthCare'] 

输出:

ID HealthCare 
1918916 Jacuzzi 
1495638 
1050651 Aerobics, Jacuzzi, yoga room 

在这里,如果我提到的确切字母大小写在“类别列表”的功能,如数据集中提到的,那么代码标识,并返回值,否则它不会。 所以我希望我的代码不区分大小写,甚至可以跟踪健康类别下的“蒸汽房”,“桑拿房”。我尝试使用“.lower()”函数,但不知道如何实现它。

回答

1

编辑2:只category.py更新

Category.py

category = [('steam room','IN','HealthCare'), 
     ('sauna','IN','HealthCare'), 
     ('jacuzzi','IN','HealthCare'), 
     ('aerobics','IN','HealthCare'), 
     ('Yoga room','IN','HealthCare'), 
     ('booking','IN','HealthCare'),   
     ] 
category1 = [value[0].capitalize() for index, value in enumerate(category)] 
category2 = [value[0].lower() for index, value in enumerate(category)] 

test = [] 
test2 =[] 

for index, value in enumerate(category1): 
    test.append((value, category[index][1],category[index][2])) 

for index, value in enumerate(category2): 
    test2.append((value, category[index][1],category[index][2])) 

category = category + test + test2 


HealthCare = [e1 for (e1, rel, e2) in category if e2=='HealthCare'] 

你不变的数据集

import pandas as pd 
from nltk import ngrams, word_tokenize 
import Categories 
from Categories import * 
from functools import partial 


data = {'Clean_Data':['Luxury Apartments consisting 11 towers Well equipped gymnasium Swimming Pool Toddler Pool Health Club Steam Room Sauna Jacuzzi Pool Table Chess Billiards room Carom Table Tennis indoor games', 
        'near medavakkam junction calm area near global hospital', 
        'No Pre Emi No Booking Amount No Floor Rise Charges No Processing Fee HLPROJECT HIGHLIGHTS '], 
'Id' : [1918916, 1495638,1050651]} 

df = pd.DataFrame(data) 


df['one_word_tokenized_text'] =df["Clean_Data"].str.split() 
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2))) 
df['trigram'] = df['Clean_Data']).apply(lambda row: list(ngrams(word_tokenize(row), 3))) 
df['four_words'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 4))) 
token=pd.Series(df["one_word_tokenized_text"]) 
Lid=pd.Series(df["Id"]) 
matches= token.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.HealthCare]))) 
match_list= [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches] 
match_df = pd.DataFrame({"ID":Lid,"jc1":match_list}) 


def match_word(feature, row): 
    categories = [] 

    for bigram in row.bigram: 
     joined = ' '.join(bigram) 
     if joined in feature: 
      categories.append(joined) 
    for trigram in row.trigram: 
     joined = ' '.join(trigram) 
     if joined in feature: 
      categories.append(joined) 
    for fourwords in row.four_words: 
     joined = ' '.join(fourwords) 
     if joined in feature: 
      categories.append(joined) 
    return categories 

match_df['Health1'] = df.apply(partial(match_word, HealthCare), axis=1) 
match_df['HealthCare'] = match_df[match_df.columns[[1,2]]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)enize(row), 4))) 

输出

print match_df 

+--------+----------------+-------------+------------------------------------+ 
|ID  |jc1    |Health1  |HealthCare       | 
+--------+----------------+-------------+------------------------------------+ 
|1918916 |[sauna, jacuzzi]|    |['sauna', 'jacuzzi'],['steam room'] | 
+--------+----------------+-------------+------------------------------------+ 
|1495638 |    |    |         | 
+--------+----------------+-------------+------------------------------------+ 
|1050651 | [Booking] |    | ['Booking'],[]     |    | 
+--------+----------------+-------------+------------------------------------+ 
+0

不,我不应该修改我的数据集值。我只是想将这些词与类别值进行匹配,而不管情况如何。 –

+0

好吧,您已经为您的数据集添加了列,我刚才从我看到的方式编辑了我的答案,您可以: - a)为您正在创建的3列设置较低/大写变量 - b)尝试在您的Category.py 中重现(使用python代码)所有可能的大小写格式,后者似乎是矫枉过正。 – Pelican

+0

对不起,如果我的问题很混乱,我理解你的观点,但我担心的是,我的最终输出值案例不应该与我在数据集中收到的不同。如果“桑拿房”,“蒸汽房”有InitialCaps,则输出时必须一致。我的意思是,如果我的数据集将来会包含类似的单词,那么我的代码必须不区分大小写以检测它。 :) –