Working with multiple column RDD in Spark?

Question

Welcome To Ask or Share your Answers For Others

Working with multiple column RDD in Spark?

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I've read through a number of Spark examples, and I can't seem to find out how to create an RDD with with a key column and multiple value columns from a CSV file.

I've read a little bit about Spark SQL and don't think it's what I want in this case. I'm not looking for interactive analysis with this data, more of a batch type processing.

I'm interested in Java or Scala syntax.

Can you point me in the right direction?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

171 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:16:07+0000

Multiple column RDD

There's no such thing really, but nor do you need one. You can create an RDD of objects with any type T. This type should model a record, so a record with multiple columns can be of type Array[String], Seq[AnyRef], or whatever best models your data. In Scala, the best choice (for type safety and code readability) is usually using a case class that represents a record.

For example, if your CSV looks like this:

+---------+-------------------+--------+-------------+
| ID      | Name              | Age    | Department  |
+---------+-------------------+--------+-------------+
| 1       | John Smith        | 29     | Dev         |
| 2       | William Black     | 31     | Finance     |
| 3       | Nancy Stevens     | 32     | Dev         |
+---------+-------------------+--------+-------------+

You could, for example:

case class Record(id: Long, name: String, age: Int, department: String)

val input: RDD[String] = sparkContext.textFile("./file")
val parsed: RDD[Record] = input.map(/* split string and create new Record */)

Now you can conveniently perform transformations on this RDD, for example if you want to transform it into a PairRDD with the ID as key, simply call keyBy:

val keyed: RDD[(Int, Record)] = parsed.keyBy(_.id)

That said - even though you're more interested in "batch processing" and not analysis - this could still be achieved more easily (and perhaps perform better, depending on what you do with this RDD) using the DataFrames API - it has good facilities for reading CSVs safely (e.g. spark-csv), and for treating data as columns without the need to create case classes matching each type of record.

Categories

Working with multiple column RDD in Spark?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags