0
数据集:匹配字
> df
Id Clean_Data
1918916 Luxury Apartments consisting 11 towers Well equipped gymnasium Swimming Pool Toddler Pool Health Club Steam Room Sauna Jacuzzi Pool Table Chess Billiards room Carom Table Tennis indoor games
1495638 near medavakkam junction calm area near global hospital
1050651 No Pre Emi No Booking Amount No Floor Rise Charges No Processing Fee HLPROJECT HIGHLIGHTS
下面是被成功地从值列表返回匹配的单词在n元语法在Category.py
df['one_word_tokenized_text'] =df["Clean_Data"].str.split()
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 3)))
df['four_words'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 4)))
token=pd.Series(df["one_word_tokenized_text"])
Lid=pd.Series(df["Id"])
matches= token.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.HealthCare])))
match_list= [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches]
match_df = pd.DataFrame({"ID":Lid,"jc1":match_list})
def match_word(feature, row):
categories = []
for bigram in row.bigram:
joined = ' '.join(bigram)
if joined in feature:
categories.append(joined)
for trigram in row.trigram:
joined = ' '.join(trigram)
if joined in feature:
categories.append(joined)
for fourwords in row.four_words:
joined = ' '.join(fourwords)
if joined in feature:
categories.append(joined)
return categories
match_df['Health1'] = df.apply(partial(match_word, HealthCare), axis=1)
match_df['HealthCare'] = match_df[match_df.columns[[1,2]]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
代码
Category.py
category = [('steam room','IN','HealthCare'),
('sauna','IN','HealthCare'),
('Jacuzzi','IN','HealthCare'),
('Aerobics','IN','HealthCare'),
('yoga room','IN','HealthCare'),]
HealthCare= [e1 for (e1, rel, e2) in category if e2=='HealthCare']
输出:
ID HealthCare
1918916 Jacuzzi
1495638
1050651 Aerobics, Jacuzzi, yoga room
在这里,如果我提到的确切字母大小写在“类别列表”的功能,如数据集中提到的,那么代码标识,并返回值,否则它不会。 所以我希望我的代码不区分大小写,甚至可以跟踪健康类别下的“蒸汽房”,“桑拿房”。我尝试使用“.lower()”函数,但不知道如何实现它。
不,我不应该修改我的数据集值。我只是想将这些词与类别值进行匹配,而不管情况如何。 –
好吧,您已经为您的数据集添加了列,我刚才从我看到的方式编辑了我的答案,您可以: - a)为您正在创建的3列设置较低/大写变量 - b)尝试在您的Category.py 中重现(使用python代码)所有可能的大小写格式,后者似乎是矫枉过正。 – Pelican
对不起,如果我的问题很混乱,我理解你的观点,但我担心的是,我的最终输出值案例不应该与我在数据集中收到的不同。如果“桑拿房”,“蒸汽房”有InitialCaps,则输出时必须一致。我的意思是,如果我的数据集将来会包含类似的单词,那么我的代码必须不区分大小写以检测它。 :) –