2017-09-25 94 views
0

主键找到两个数据帧之间的差异我在火花两个数据帧。 我做df1.except(df2) 2查找是否列有两个数据帧之间的变化。基于在火花阶

DF1是喜欢这里

|001000900|aaaaa BELLOWS CORPORATION||N| 
|001000905|ddddd DEPARTMENT OF LABOR AND EMPLOYMENT SECURITY|BUREAU OF COMPLIANCE|N| 
|001001049|gggg RAVIOLI MFG CO INC|SPINELLI BKY RAVIOLI PASTRY SP|N| 
|001001130|dddd ANGELES UNIFIED SCHOOL DISTRICT|TRANSPORTATION BRANCH|N| 
|001001143|ffff MUSIC PARTIES, INC||N| 
|001001155|BOSTON BRASS AND IRON CO||N| 
|001001171|HANCOCK MARINE, INC.||N| 
|001001184|TRILLION CORPORATION||N| 
|001001192|HAWAII STATE CHIROPRACTIC ASSOCIATION INC||N| 
|001001379|THE FRUIT SQUARE PEOPLE INC|L & M BAKERY|N| 
|001001416|J & S MARKET||N| 

DF2是像下面

|001000145|PARADISE TAN||N| 
|001000306|SHRUT & ASCH LEATHER COMPANY, INC.||N| 
|001000355|HARRISON SPECIALTY CO., INC.||N| 
|001000363|LOUIS M. GERSON CO., INC.||N| 
|001000467|SAVE THE SEA TURTLES INTERNATIONAL|ADOPT THE BEACH HI|N| 
|001000504|DIRIGO SPICE CORPORATION|CUNNINGHAM SPICE|N| 
|001000744|FREEDMAN THREAD COMPANY|COLONIAL THREAD CO|N| 
|001000756|AFFORDABLE AIR CONDITIONING|P R ENTERPRISE|N| 
|001000900|CLIFLEX BELLOWS CORPORATION||N| 
|001000905|FLORIDA DEPARTMENT OF LABOR AND EMPLOYMENT SECURITY|BUREAU OF COMPLIANCE|N| 
|001001049|SPINELLI RAVIOLI MFG CO INC|SPINELLI BKY RAVIOLI PASTRY SP|N| 
|001001130|LOS ANGELES UNIFIED SCHOOL DISTRICT|TRANSPORTATION BRANCH|N| 
|001001143|TOSCO MUSIC PARTIES, INC||N| 
|001001155|BOSTON BRASS AND IRON CO||N| 

但我想的是,我必须找到基于一个塔。有些东西就像两个数据帧之间的差异下面

我想我的输出如下面

|dunsnumber|filler1|  businessname|  tradestylename|registeredaddressindicator| 
+----------+-------+--------------------+--------------------+--------------------------+ 
| 001001130|  |dddd ANGELES UNIF...|TRANSPORTATION BR...|       N| 
| 001000900|  |aaaaa BELLOWS COR...|     |       N| 
| 001000905|  |ddddd DEPARTMENT ...|BUREAU OF COMPLIANCE|       N| 
| 001001143|  |ffff MUSIC PARTIE...|     |       N| 
| 001001049|  |gggg RAVIOLI MFG ...|SPINELLI BKY RAVI...|       N| 
+----------+-------+--------------------+--------------------+ 

这里是我的代码

import org.apache.spark.sql.functions._ 
    val textRdd1 = sc.textFile("/home/cloudera/TRF/PCFP/INCR") 
    val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split("\\|", -1))) 
    var df1 = sqlContext.createDataFrame(rowRdd1, schema) 

    val textRdd2 = sc.textFile("/home/cloudera/TRF/PCFP/MAIN") 
    val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split("\\|", -1))) 
    var df2 = sqlContext.createDataFrame(rowRdd2, schema) 
    val diffAnyColumnDF = df1.except(df2).where(df1.col("dunsnumber") === 
    df2.col("dunsnumber")).show() 

所以,如果我的主键“dunsnumber”如果任何列已更改,或者不是该主键或不匹配,那么只能找。

我希望清楚我的问题。

+0

你应该键连接它们,并使用过滤器或选择和应用过滤逻辑。 :) –

+0

你想减去或简单的除外?即你想生成的数据帧从DF1来只或DF1和DF2 –

+0

@Avishek如果主键是相同的,然后匹配的主键,如果任何属于列值是不同的,我需要的是价值.. – SUDARSHAN

回答

0

嗨所以这也为我工作..

val diffAnyColumnDF = df1.except(df2) 
val addDF= diffAnyColumnDF.join(df2, Seq("dunsnumber")).show() 
0

数据帧没有方法。减去。不过,您可以使用其他方法。 将数据转换为RDD,使用减法方法,返回到您的数据框。