I'm trying to write a DataFrame
into Hive
table (on S3
) in Overwrite
mode (necessary for my application) and need to decide between two methods of DataFrameWriter (Spark / Scala). From what I can read in the documentation, df.write.saveAsTable
differs from df.write.insertInto
in the following respects:
saveAsTable
uses column-name based resolution whileinsertInto
uses position-based resolution- In Append mode,
saveAsTable
pays more attention to underlying schema of the existing table to make certain resolutions
Overall, it gives me the impression that saveAsTable
is just a smarter version of insertInto
. Alternatively, depending on use-case, one might prefer insertInto
But do each of these methods come with some caveats of their own like performance penalty in case of saveAsTable
(since it packs in more features)? Are there any other differences in their behaviours apart from what is told (not very clearly) in the docs?
EDIT-1
Documentation says this regarding insertInto
Inserts the content of the DataFrame to the specified table
and this for saveAsTable
In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function
Now I can list my doubts
- Does
insertInto
always expect the table to exist? - Do
SaveMode
s have any impact oninsertInto
? - If above answer is yes, then
- what's the differences between
saveAsTable
withSaveMode.Append
andinsertInto
given that table already exists? - does
insertInto
withSaveMode.Overwrite
make any sense?
- what's the differences between