黑名单数组中的字符串值的一部分火花数据框

-2

val blacklist: Array[String]=Array("one of a kind", "one of the", "industry leading", "industry's", "industry leader", "lifetime", "#1 ", "number 1", "number one", "Guarantee", "guaranteed", "guarantees", "Compete", "Competes", "competing", "Competed", "competitor", "competitors", "competition", "competitions", "competitive", "competitor's")

和我有这样

+------+---------------------------------+ 
|name | value       | 
+------+---------------------------------+ 
|atr1 | this is one of a kind product | 
|atr2 | this product is industry leader | 
|atr3 | it is competitor's nightmare | 
+------+---------------------------------+

我想过滤所有出现在黑名单中值数据框。

在上述情况下所有的结果都会发生。

+------+---------------------------------+ 
|name | value       | 
+------+---------------------------------+ 
|atr1 | this is one of a kind product | 
|atr2 | this product is industry leader | 
|atr3 | it is competitor's nightmare | 
+------+---------------------------------+

来源

2017-08-02 Narendra Prasad

给出一个dataframe作为

+----+-------------------------------+ 
|name|value       | 
+----+-------------------------------+ 
|atr1|this is one of a kind product | 
|atr2|this product is industry leader| 
|atr3|it is competitor's nightmare | 
|atr4|testing for filter    | 
+----+-------------------------------+

您可以定义udf功能

import org.apache.spark.sql.functions._ 
def blackListFilter = udf((value: String) => blacklist.map(value.contains(_)).toSeq.contains(true))

，并调用它来满足您的需求

df.filter(blackListFilter($"value"))

你应该得到

+----+-------------------------------+ 
|name|value       | 
+----+-------------------------------+ 
|atr1|this is one of a kind product | 
|atr2|this product is industry leader| 
|atr3|it is competitor's nightmare | 
+----+-------------------------------+

来源

2017-08-02 16:08:24

黑名单数组中的字符串值的一部分火花数据框

回答

相关问题