Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Copying example from this question: As a conceptual example, if I have two dataframes:

words     = [the, quick, fox, a, brown, fox]
stopWords = [the, a]

then I want the output to be, in any order:

words - stopWords = [quick, brown, fox, fox]

ExceptAll can do this in 2.4 but I cannot upgrade. The answer in the linked question is specific to a dataframe:

words.join(stopwords, words("id") === stopwords("id"), "left_outer")
     .where(stopwords("id").isNull)
     .select(words("id")).show()

as in you need to know the pkey and the other columns.

Can anyone come up with an answer that will work on any dataframe?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
295 views
Welcome To Ask or Share your Answers For Others

1 Answer

Here is an implementation for you all. I have tested in Spark 2.4.2, it should work for 2.3 too (not 100% sure)

    val df1 = spark.createDataset(Seq("the","quick","fox","a","brown","fox")).toDF("c1")
    val df2 = spark.createDataset(Seq("the","a")).toDF("c1")

    exceptAllCustom(df1, df2, Seq("c1")).show()


  def exceptAllCustom(df1 : DataFrame, df2 : DataFrame, pks : Seq[String]): DataFrame = {
    val notNullCondition = pks.foldLeft(lit(0==0))((column,cName) => column && df2(cName).isNull)
    val joinCondition = pks.foldLeft(lit(0==0))((column,cName) => column && df2(cName)=== df1(cName))
    val result = df1.join(df2, joinCondition, "left_outer")
       .where(notNullCondition)

    pks.foldLeft(result)((df,cName) => df.drop(df2(cName)))
  }

Result -

+-----+
|   c1|
+-----+
|quick|
|  fox|
|brown|
|  fox|
+-----+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...