2017-08-02 26 views
-2

我有一个像这样的黑名单。黑名单数组中的字符串值的一部分火花数据框

val blacklist: Array[String]=Array("one of a kind", "one of the", "industry leading", "industry's", "industry leader", "lifetime", "#1 ", "number 1", "number one", "Guarantee", "guaranteed", "guarantees", "Compete", "Competes", "competing", "Competed", "competitor", "competitors", "competition", "competitions", "competitive", "competitor's") 

和我有这样

+------+---------------------------------+ 
|name | value       | 
+------+---------------------------------+ 
|atr1 | this is one of a kind product | 
|atr2 | this product is industry leader | 
|atr3 | it is competitor's nightmare | 
+------+---------------------------------+ 

我想过滤所有出现在黑名单中值数据框。

在上述情况下所有的结果都会发生。

+------+---------------------------------+ 
|name | value       | 
+------+---------------------------------+ 
|atr1 | this is one of a kind product | 
|atr2 | this product is industry leader | 
|atr3 | it is competitor's nightmare | 
+------+---------------------------------+ 

回答

1

给出一个dataframe作为

+----+-------------------------------+ 
|name|value       | 
+----+-------------------------------+ 
|atr1|this is one of a kind product | 
|atr2|this product is industry leader| 
|atr3|it is competitor's nightmare | 
|atr4|testing for filter    | 
+----+-------------------------------+ 

您可以定义udf功能

import org.apache.spark.sql.functions._ 
def blackListFilter = udf((value: String) => blacklist.map(value.contains(_)).toSeq.contains(true)) 

,并调用它来满足您的需求

df.filter(blackListFilter($"value")) 

你应该得到

+----+-------------------------------+ 
|name|value       | 
+----+-------------------------------+ 
|atr1|this is one of a kind product | 
|atr2|this product is industry leader| 
|atr3|it is competitor's nightmare | 
+----+-------------------------------+ 
相关问题