I have a column in my Spark DataFrame:
|-- topics_A: array (nullable = true)
| |-- element: string (containsNull = true)
I'm using CountVectorizer on it:
topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")
I get NullPointerExceptions, because sometimes the topic_A column contains null.
Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark.
See Question&Answers more detail:os