斯卡拉/星火有效部分字符串匹配

我写使用Scala的Spark中的一个小程序，以及跨问题就来了。我有单个字符串的List/RDD和句子的List/RDD，它们可能包含或可能不包含单个单词列表中的单词。即斯卡拉/星火有效部分字符串匹配

val singles = Array("this", "is") 
val sentence = Array("this Date", "is there something", "where are something", "this is a string")

，我想选择包含一个或多个的话，从单打使得结果应该是这样的句子：

output[(this, Array(this Date, this is a String)),(is, Array(is there something, this is a string))]

我想到了两种办法，一种通过分裂该句子和使用.contains进行过滤。另一种是将句子分割并格式化为RDD，并使用.join进行RDD交集。我正在查看大约50个单个单词和500万个句子，哪种方法会更快？还有其他解决方案吗？你能不能也帮我编码，我似乎得到我的代码没有结果（尽管它编译和运行没有错误）

来源

2015-04-21 GameOfThrows

鉴于每个单词将得到平均100K的句子，分组可能不是一个真正的选择。（单词，句子）将会是一个更好的结束格式 – maasg

您可以创建一组所需的按键，通过按键查找在句子和组密钥。

val singles = Array("this", "is") 

val sentences = Array("this Date", 
         "is there something", 
         "where are something", 
         "this is a string") 

val rdd = sc.parallelize(sentences) // create RDD 

val keys = singles.toSet   // words required as keys. 

val result = rdd.flatMap{ sen => 
        val words = sen.split(" ").toSet; 
        val common = keys & words;  // intersect 
        common.map(x => (x, sen))  // map as key -> sen 
       } 
       .groupByKey.mapValues(_.toArray)  // group values for a key 
       .collect        // get rdd contents as array 

// result: 
// Array((this, Array(this Date, this is a string)), 
//  (is, Array(is there something, this is a string)))

来源

2015-04-21 15:35:02

比公认的答案好得多！ – maasg

我只是试图解决您的问题，我已经结束了使用此代码：

def check(s:String, l: Array[String]): Boolean = { 
    var temp:Int = 0 
    for (element <- l) { 
    if (element.equals(s)) {temp = temp +1} 
    } 
    var result = false 
    if (temp > 0) {result = true} 
    result 
} 
val singles = sc.parallelize(Array("this", "is")) 
val sentence = sc.parallelize(Array("this Date", "is there something", "where are something", "this is a string")) 
val result = singles.cartesian(sentence) 
        .filter(x => check(x._1,x._2.split(" ")) == true) 
        .groupByKey() 
        .map(x => (x._1,x._2.mkString(", "))) // pay attention here(*) 
result.foreach(println)

最后图线（*）是那里只是东阳没有它，我得到的东西与CompactBuffer，像这样：

(is,CompactBuffer(is there something, this is a string))  
(this,CompactBuffer(this Date, this is a string))

随着该地图线（带mkString命令），我得到一个更可读的输出如下：

(is,is there something, this is a string) 
(this,this Date, this is a string)

希望它可以以某种方式帮助。

来源

2015-04-21 12:58:54

5百万句子的笛卡尔将是一个艰难的曲奇，但好的答案永远不会少。 – GameOfThrows

你可能是正确的...但你可以给它一个尝试，看看它是如何工作......我承认，这只是一个“快”的答案，我能找到一种方法来改善它。 –

这其实不差，考虑到它是利用星火RDD，它比我的版本只在主运行速度稍慢，但我认为更多的数据，你比我好多了;此外，如果您仔细考虑，笛卡儿可能是搜索最有效的方法。 – GameOfThrows

斯卡拉/星火有效部分字符串匹配

回答

相关问题