Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:

column_names = Array("A","B","C")

I'd like to do a df.select() in such a way, that I can specify which columns not to select. Example: let's say I do not want to select columns "B". I tried

df.select(column_names.filter(_!="B"))

but this does not work, as

org.apache.spark.sql.DataFrame cannot be applied to (Array[String])

So, here it says it should work with a Seq instead. However, trying

df.select(column_names.filter(_!="B").toSeq)

results in

org.apache.spark.sql.DataFrame cannot be applied to (Seq[String]).

What am I doing wrong?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
864 views
Welcome To Ask or Share your Answers For Others

1 Answer

Since Spark 1.4 you can use drop method:

Scala:

case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")

Python:

df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...