I'm trying to cluster a group of points in a probabilistic manner. Using below, I have a single set of xy points, which are recorded in X
and Y
. I want to cluster into groups using a reference point, which is displayed in X2
and Y2
.
With the help of an answer the current approach is to measure the distance from the reference point and group using k-means. Although, it provides a method to cluster using the reference point, the hard cutoff and adherence to k clusters makes it somewhat unsuitable when dealing with numerous datasets. For instance, the number of clusters needed for this example is probably 3. But a separate example may different. I'd have to manually go through and alter k every time.
Given the non-probabilistic nature of k-means a separate option could be GMM
. Is it possible to account for the reference point when modelling? If I attach the output below the underlying model isn't clustering as I'm hoping for.
If I look at the probability each point is within a group it's not clustered as I'd hoped. With this I run into the same problem with manually altering the amount of components. Because the points are distributed randomly, using “AIC” or “BIC” to select the appropriate number of clusters doesn't work. There is no optimal number.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
'Y2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
})
k-means:
df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
df['distance'] = np.sqrt((df['X2'] - df['Y2'])**2 + (df['BallY'] - df['y_post'])**2)
model = KMeans(n_clusters = 2)
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T)
df['group'] = model.labels_
plt.scatter(df['X'], df['Y'], c = model.labels_, cmap = 'bwr', marker = 'o', s = 5)
plt.scatter(df['X2'], df['Y2'], c ='k', marker = 'o', s = 5)
GMM:
Y_sklearn = df[['X','Y']].values
gmm = mixture.GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(Y_sklearn)
labels = gmm.predict(Y_sklearn)
df['group'] = labels
plt.scatter(Y_sklearn[:, 0], Y_sklearn[:, 1], c=labels, s=5, cmap='viridis');
plt.scatter(df['X2'], df['Y2'], c='red', marker = 'x', edgecolor = 'k', s = 5, zorder = 10)
proba = pd.DataFrame(gmm.predict_proba(Y_sklearn).round(2)).reset_index(drop = True)
df_pred = pd.concat([df, proba], axis = 1)
question from:https://stackoverflow.com/questions/65947535/cluster-groups-continuously-instead-of-discrete-python