improvement of scala spark code snippet for calculating cosine-similarity in terms of execution time

Question

Welcome To Ask or Share your Answers For Others

improvement of scala spark code snippet for calculating cosine-similarity in terms of execution time

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have a dataframe that is constructed using the transform() method of the VectorAssembler class. Besides, I have a trained k-means model that can output the i-th center point when the method "clusterCenter(i)" is called on it. The center point has the same dimension as each row of the dataframe(if converted to vector). Number of center points is 2 times of the number of rows in dataframe

Now I want to calculate the cosine value between each row in dataframe and each center point vector, and append the cosine to a list. Following is my code:

val cosine_list=ListBuffer(("sample_string",0.0)) // first item in list to show 
                                                 //the data structure of list
    for (i<- 0 until k){ //k: number of rows in dataframe
      val cen0=df.select("features").collect()(i).getAs[Vector](0)
      val cen0_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen0)
      for (j<-0 until 2*k){ //number of center points is 2* number of rows in df
        val cen1=model.clusterCenters(j) //get the j-th center point vector
        val cen1_new=org.apache.spark.mllib.linalg.Vectors.fromML(cen1)
        val sqr_cen0=Vectors.norm(cen0_new,2)
        val sqr_cen1=Vectors.norm(cen1_new,2)
        val dot1=DenseVector(cen0_new.toArray).dot(DenseVector(cen1_new.toArray))
        val cos=dot1/(sqr_cen0*sqr_cen1)
        val map_name=s"${i}_${j}"
        cosine_list.append((map_name,cos))
      }

The above code works fine, it just takes a lot of time(of course it also depends on the size of data). My question is that can the code snippet be improved in terms of efficiency(by using another API or whatever)? thanks in advance!

question from:https://stackoverflow.com/questions/65896652/improvement-of-scala-spark-code-snippet-for-calculating-cosine-similarity-in-ter

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

382 views

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:16:45+0000

answered Oct 7, 2021 by 深蓝 (71.8m points)

Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

improvement of scala spark code snippet for calculating cosine-similarity in terms of execution time

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags