Update: I want to achieve this in scala spark, not pyspark, so the answer suggested in this Convert KMeans “centers” output to PySpark dataframe question doesn't work(no tolist method)
I have trained a k-means model. With "model.clusterCenters", I can get the vector of each cluster center point, the data type is Array[Vector], with "model.transform", I can get the prediction index for each sample vector as a new column.
So suppose if the training dataset dataframe is like:
+------------+--------------------+
|rest column | features|
+------------+--------------------|
| 1646177|[231.8,232.1,233....|
| 1646177|[232.2,234.2,234....|
| 1646178|[241.1,234.1,244....|
| ... |... |
-----------------------------------
after model.transform(), I get the following:
+------------+--------------------+----------+
|rest column | features|prediction|
+------------+--------------------+----------+
| 1646177|[231.8,232.1,233....| 01|
| 1646177|[232.2,234.2,234....| 01|
| 1646178|[232.1,234.1,234....| 02|
| ... |... | ...|
----------------------------------------------
after "model.clusterCenters", i can get an array of vector like following:
[230.99036144578324,231.08433734939757,231.3566265060241...]
[160.6,159.9,177.2...]
[69.3,70.1,70.6...]
...
where the number of cluster centers correspond to the unique values(01,02...) in the "prediction" column of the above dataframe generated by model.transform
What I want to achieve is to concatenate the both,the expected result should be like following:
+------------+--------------------+----------+---------------+
|rest column | features|prediction|cluster centers|
+------------+--------------------+----------+---------------+
| 1646177|[231.8,232.1,233....| 01|[123,456,789...|
| 1646177|[232.2,234.2,234....| 01|[123,456,789...|
| 1646178|[232.1,234.1,234....| 02|[232,243,223...|
| ... |... | ...|... |
-------------------------------------------------------------+
Any suggestion is appreciated!
question from:https://stackoverflow.com/questions/65913728/combine-model-clustercenters-and-model-transfrom-of-k-means-in-scala-spark