Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

what would be an efficient way to read a csv file in which the values are containing the delimiter itself in apache spark?

Below is my dataset:

ID,Name,Age,Add,ress,Salary
1,Ross,32,Ah,med,abad,2000
2,Rachel,25,Delhi,1500
3,Chandler,23,Kota,2000
4,Monika,25,Mumbai,6500
5,Mike,27,Bhopal,8500
6,Phoebe,22,MP,4500
7,Joey,24,Indore,10000
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
168 views
Welcome To Ask or Share your Answers For Others

1 Answer

The data needs to be cleaned up as there's no way to systematically make a data frame when the text delimiter is unpredictable.

One way to do that is to move the last column, and enclose in quotes the raw address data:

val rdd = sc.textFile("file.csv")

//move last column
val rdd2 = rdd.map(s => s.substring(s.lastIndexOf(",")+1) 
               + "," + s.substring(0, s.lastIndexOf(",")))

//enclose last column in " and make a DS
val stringDataset = rdd2.map(s => s.replaceAll("(.*?,.*?,.*?,.*?,|.$)", "$1"")).toDS()

//create data frame:
val df = spark.read.option("header","true").csv(stringDataset)

df.show() outputs:

+------+---+--------+---+-----------+
|Salary| ID|    Name|Age|   Add,ress|
+------+---+--------+---+-----------+
|  2000|  1|    Ross| 32|Ah,med,abad|
|  1500|  2|  Rachel| 25|      Delhi|
|  2000|  3|Chandler| 23|       Kota|
|  6500|  4|  Monika| 25|     Mumbai|
|  8500|  5|    Mike| 27|     Bhopal|
|  4500|  6|  Phoebe| 22|         MP|
| 10000|  7|    Joey| 24|     Indore|
+------+---+--------+---+-----------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...