For example, the result of this:
df.filter("project = 'en'").select("title","count").groupBy("title").sum()
would return an Array.
How to save a spark DataFrame as a csv file on disk ?
See Question&Answers more detail:osFor example, the result of this:
df.filter("project = 'en'").select("title","count").groupBy("title").sum()
would return an Array.
How to save a spark DataFrame as a csv file on disk ?
See Question&Answers more detail:osApache Spark does not support native CSV output on disk.
You have four available solutions though:
You can convert your Dataframe into an RDD :
def convertToReadableString(r : Row) = ???
df.rdd.map{ convertToReadableString }.saveAsTextFile(filepath)
This will create a folder filepath. Under the file path, you'll find partitions files (e.g part-000*)
What I usually do if I want to append all the partitions into a big CSV is
cat filePath/part* > mycsvfile.csv
Some will use coalesce(1,false)
to create one partition from the RDD. It's usually a bad practice, since it may overwhelm the driver by pulling all the data you are collecting to it.
Note that df.rdd
will return an RDD[Row]
.
With Spark <2, you can use databricks spark-csv library:
Spark 1.4+:
df.write.format("com.databricks.spark.csv").save(filepath)
Spark 1.3:
df.save(filepath,"com.databricks.spark.csv")
With Spark 2.x the spark-csv
package is not needed as it's included in Spark.
df.write.format("csv").save(filepath)
You can convert to local Pandas data frame and use to_csv
method (PySpark only).
Note: Solutions 1, 2 and 3 will result in CSV format files (part-*
) generated by the underlying Hadoop API that Spark calls when you invoke save
. You will have one part-
file per partition.