Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Edit: The answer of this questions is heavily discussed in: Sum in Spark gone bad


In Compute Cost of Kmeans, we saw how one can compute the cost of his KMeans model. I was wondering if we are able to compute the Unbalanced factor?

If there is no such functionality provide by Spark, is there any easy way to implement this?


I was not able to find a ref for the Unbalanced factor, but it should be similar to Yael's unbalanced_factor (my comments):

// @hist: the number of points assigned to a cluster
// @n:    the number of clusters
double ivec_unbalanced_factor(const int *hist, long n) {
  int vw;
  double tot = 0, uf = 0;

  for (vw = 0 ; vw < n ; vw++) {
    tot += hist[vw];
    uf += hist[vw] * (double) hist[vw];
  }

  uf = uf * n / (tot * tot);

  return uf;

}

which I found here.

So the idea is that tot (for total) will be equal to the number of points assigned to clusters (i.e. equal to the size of our dataset), while uf (for unbalanced factor) holds the square of the number of points assigned to a cluster.

Finally he uses uf = uf * n / (tot * tot); to compute it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
690 views
Welcome To Ask or Share your Answers For Others

1 Answer

In python it could be something like:

# I suppose you are passing an RDD of tuples, where the key is the cluster and the value is a vector with the features.
def unbalancedFactor(rdd):
  pdd = rdd.map(lambda x: (x[0], 1)).reduceByKey(lambda a, b: a + b) # you can obtain the number of points per cluster
  n = pdd.count()
  total = pdd.map(lambda x: x[1]).sum() 
  uf = pdd.map(lambda x: x[1] * float(x[1])).sum()

  return uf * n / (total * total)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...