Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I've read through a number of Spark examples, and I can't seem to find out how to create an RDD with with a key column and multiple value columns from a CSV file.

I've read a little bit about Spark SQL and don't think it's what I want in this case. I'm not looking for interactive analysis with this data, more of a batch type processing.

I'm interested in Java or Scala syntax.

Can you point me in the right direction?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
171 views
Welcome To Ask or Share your Answers For Others

1 Answer

Multiple column RDD

There's no such thing really, but nor do you need one. You can create an RDD of objects with any type T. This type should model a record, so a record with multiple columns can be of type Array[String], Seq[AnyRef], or whatever best models your data. In Scala, the best choice (for type safety and code readability) is usually using a case class that represents a record.

For example, if your CSV looks like this:

+---------+-------------------+--------+-------------+
| ID      | Name              | Age    | Department  |
+---------+-------------------+--------+-------------+
| 1       | John Smith        | 29     | Dev         |
| 2       | William Black     | 31     | Finance     |
| 3       | Nancy Stevens     | 32     | Dev         |
+---------+-------------------+--------+-------------+

You could, for example:

case class Record(id: Long, name: String, age: Int, department: String)

val input: RDD[String] = sparkContext.textFile("./file")
val parsed: RDD[Record] = input.map(/* split string and create new Record */)

Now you can conveniently perform transformations on this RDD, for example if you want to transform it into a PairRDD with the ID as key, simply call keyBy:

val keyed: RDD[(Int, Record)] = parsed.keyBy(_.id)

That said - even though you're more interested in "batch processing" and not analysis - this could still be achieved more easily (and perhaps perform better, depending on what you do with this RDD) using the DataFrames API - it has good facilities for reading CSVs safely (e.g. spark-csv), and for treating data as columns without the need to create case classes matching each type of record.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...