I have a dataframe that is constructed using the transform() method of the VectorAssembler class. Besides, I have a trained k-means model that can output the i-th center point when the method "clusterCenter(i)" is called on it. The center point has the same dimension as each row of the dataframe(if converted to vector). Number of center points is 2 times of the number of rows in dataframe
Now I want to calculate the cosine value between each row in dataframe and each center point vector, and append the cosine to a list. Following is my code:
val cosine_list=ListBuffer(("sample_string",0.0)) // first item in list to show
//the data structure of list
for (i<- 0 until k){ //k: number of rows in dataframe
val cen0=df.select("features").collect()(i).getAs[Vector](0)
val cen0_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen0)
for (j<-0 until 2*k){ //number of center points is 2* number of rows in df
val cen1=model.clusterCenters(j) //get the j-th center point vector
val cen1_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen1)
val sqr_cen0=Vectors.norm(cen0_new,2)
val sqr_cen1=Vectors.norm(cen1_new,2)
val dot1=DenseVector(cen0_new.toArray).dot(DenseVector(cen1_new.toArray))
val cos=dot1/(sqr_cen0*sqr_cen1)
val map_name=s"${i}_${j}"
cosine_list.append((map_name,cos))
}
The above code works fine, it just takes a lot of time(of course it also depends on the size of data). My question is that can the code snippet be improved in terms of efficiency(by using another API or whatever)? thanks in advance!
question from:https://stackoverflow.com/questions/65896652/improvement-of-scala-spark-code-snippet-for-calculating-cosine-similarity-in-ter