Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm trying to cluster a group of points in a probabilistic manner. Using below, I have a single set of xy points, which are recorded in X and Y. I want to cluster into groups using a reference point, which is displayed in X2 and Y2.

With the help of an answer the current approach is to measure the distance from the reference point and group using k-means. Although, it provides a method to cluster using the reference point, the hard cutoff and adherence to k clusters makes it somewhat unsuitable when dealing with numerous datasets. For instance, the number of clusters needed for this example is probably 3. But a separate example may different. I'd have to manually go through and alter k every time.

Given the non-probabilistic nature of k-means a separate option could be GMM. Is it possible to account for the reference point when modelling? If I attach the output below the underlying model isn't clustering as I'm hoping for.

If I look at the probability each point is within a group it's not clustered as I'd hoped. With this I run into the same problem with manually altering the amount of components. Because the points are distributed randomly, using “AIC” or “BIC” to select the appropriate number of clusters doesn't work. There is no optimal number.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

df = pd.DataFrame({                                   
    'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
    'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],     
    'X2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
    'Y2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],           
    })

enter image description here

k-means:

df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
df['distance'] = np.sqrt((df['X2'] - df['Y2'])**2 + (df['BallY'] - df['y_post'])**2)

model = KMeans(n_clusters = 2) 

model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T) 
df['group'] = model.labels_ 

plt.scatter(df['X'], df['Y'], c = model.labels_, cmap = 'bwr', marker = 'o', s = 5)
plt.scatter(df['X2'], df['Y2'], c ='k', marker = 'o', s = 5)

enter image description here

GMM:

Y_sklearn = df[['X','Y']].values

gmm = mixture.GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(Y_sklearn)
labels = gmm.predict(Y_sklearn)
df['group'] = labels
plt.scatter(Y_sklearn[:, 0], Y_sklearn[:, 1], c=labels, s=5, cmap='viridis');
plt.scatter(df['X2'], df['Y2'], c='red', marker = 'x', edgecolor = 'k', s = 5, zorder = 10)

proba = pd.DataFrame(gmm.predict_proba(Y_sklearn).round(2)).reset_index(drop = True)
df_pred = pd.concat([df, proba], axis = 1)

enter image description here

question from:https://stackoverflow.com/questions/65947535/cluster-groups-continuously-instead-of-discrete-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
182 views
Welcome To Ask or Share your Answers For Others

1 Answer

Do you mean density estimation? You can model your data as a Gaussian Mixture and then get a probability of a point to belong to the mixture. You can use sklearn.mixture.GaussianMixture for that. By changing number of components you can control how many clusters you will have. The metric to cluster on is Euclidian distance from the reference point. So the GMM model will provide you with prediction of which cluster the data point should be classified to.

Since your metric is 1d, you will get a set of Gaussian distributions, i.e. a set of means and variances. So you can easily calculate the probability of any point to be in certain cluster, just by calculating how far it is from the reference point and put the value in the normal distribution pdf formula.

To make image more clear, I'm changing the reference point to (-5, 5) and select number of clusters = 4. In order to get the best number of clusters, use some metric that minimizes total variance and penalizes growth of number of mixtures. For example argmin(model.covariances_.sum()*num_clusters)

import pandas as pd
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
df = pd.DataFrame({                                   
    'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
    'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],     
    })

ref_X, ref_Y = -5, 5
dist  = np.sqrt((df.X-ref_X)**2 + (df.Y-ref_Y)**2)

n_mix = 4
gmm = GaussianMixture(n_mix)
model = gmm.fit(dist.values.reshape(-1,1))
x = np.linspace(-35., 35.)
y = np.linspace(-30., 30.)
X, Y = np.meshgrid(x, y)
XX = np.sqrt((X.ravel() - ref_X)**2 + (Y.ravel() - ref_Y)**2)
Z = model.score_samples(XX.reshape(-1,1))
Z = Z.reshape(X.shape)

# plot grid points probabilities
plt.set_cmap('plasma')
plt.contourf(X, Y, Z, 40)
plt.scatter(df.X, df.Y, c=model.predict(dist.values.reshape(-1,1)), edgecolor='black') 

enter image description here

You can read more here and here

P.S. score_samples() returns log likelihoods, use exp() to convert to probability


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...