First and foremost don't use null
in your Scala code unless you really have to for compatibility reasons.
Regarding your question it is plain SQL. col("c1") === null
is interpreted as c1 = NULL
and, because NULL
marks undefined values, result is undefined for any value including NULL
itself.
spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
| null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
| null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
| null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
| null|
+------------------------------+
The only valid methods to check for NULL
are:
IS NULL
:
spark.sql("SELECT NULL IS NULL").show
+--------------+
|(NULL IS NULL)|
+--------------+
| true|
+--------------+
spark.sql("SELECT TRUE IS NULL").show
+--------------+
|(true IS NULL)|
+--------------+
| false|
+--------------+
IS NOT NULL
:
spark.sql("SELECT NULL IS NOT NULL").show
+------------------+
|(NULL IS NOT NULL)|
+------------------+
| false|
+------------------+
spark.sql("SELECT TRUE IS NOT NULL").show
+------------------+
|(true IS NOT NULL)|
+------------------+
| true|
+------------------+
implemented in DataFrame
DSL as Column.isNull
and Column.isNotNull
respectively.
Note:
For NULL
-safe comparisons use IS DISTINCT
/ IS NOT DISTINCT
:
spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
| false|
+--------------------------------+
or not(_ <=> _)
/ <=>
spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| false|
+---------------+
in SQL and DataFrame
DSL respectively.
Related:
Including null values in an Apache Spark Join