Clustering from correlation matrix
Prerequisites for annotating the produced dendrogram
Suppose we have a correlation matrix produced from the correlated_reactions function, as shown in the previous section. After a split of the reversible reactions to separate forward and reverse reactions, we need to map the produced reverse reactions to their KEGG terms (identical term, as the forward reaction).
The dictionary_map_reverse_reaction_id_to_pathway function is used when we split bidirectional reactions to separate forward and reverse reactions. It maps the reverse reaction to the corresponding pathway (the one that the forward reactions maps to). It enriches the dictionary created from the dictionary_reaction_id_to_pathway function and we can annotate the produced dendrogram with pathway information.
reaction_id_to_pathway_dictis a dictionary mapping reaction IDs (default-forward IDs) to pathway namesfor_rev_reactionsis a list of the splitted reactions (forward ID is the default and reverse ID has the “_rev” suffix)
group_map_100 = dictionary_map_reverse_reaction_id_to_pathway(
reaction_id_to_pathway_dict = bigg_to_pathway_dict,
for_rev_reactions = subset_extended_reactions_100)
print(group_map_100.get("PGM"))
print(group_map_100.get("PGM_rev"))
Glycolysis
Glycolysis
Hierarchical Clustering with Dendrogram Visualization
The default options to perform clustering in dingo-stats is the hierarchical clustering. This is a suggested method, however the user can procceed to clustering with any of the available methods, that are also presented in the next section.
The clustering_of_correlation_matrix function performs hierarchical clustering of the correlation matrix. It returns the dissimilarity_matrix (1sr arg.) calculated from the given correlation matrix, the integer labels (2nd arg.) corresponding to clusters and the clusters list (3rd arg.) which is a nested list containing reaction IDs grouped based on their cluster labels
correlation_matrixis a numpy 2D array of a correlation matrixreactionsis a list with the corresponding reaction IDs (must be in accordance with the order of rows in the correlation matrix)linkageis a string variable that defines the type of linkage. Available linkage types are: single, average, complete, ward.tis a float variable that defines a threshold that cuts the dendrogram at a specific height and produces clusterscorrectionis a boolean variable that ifTrueconverts the values of the the correlation matrix to absolute values.
(dissimilarity_matrix,
labels,
clusters) = clustering_of_correlation_matrix(
correlation_matrix = subset_mixed_correlation_matrix_100,
reactions = subset_extended_reactions_100,
linkage = "ward",
t = 1.0,
correction = False)
The plot_dendrogram function plots the dendrogram occured from hierarchical clustering
dissimilarity_matrixis a dissimilarity matrix calculated from theclustering_of_correlation_matrixfunctionreactionsis a list with reactions IDs, that must be in accordance with the order of rows in thedissimilarity_matrixshow_labelsis a boolean variable that ifTruelabels are shown in the x-axistis a float variable that defines a threshold that cuts the dendrogram at a specific heightlinkageis a string variable that defines the type of linkage. Available linkage types are: single, average, complete, ward.group_mapis a dictionary mapping reactions to pathways, produced from thedictionary_map_reverse_reaction_id_to_pathwayfunctionlabel_fontsizeis an integer variable that defines the size for the plotted labels (reaction IDs)heightis an integer variable that defines the height of the plotwidthis an integer variable that defines the width of the plottitleis the title for the dendrogram plot
plot_dendrogram(
dissimilarity_matrix = dissimilarity_matrix_100,
reactions = subset_extended_reactions_100,
show_labels = True,
t = 5,
linkage = "ward",
group_map = group_map_100,
label_fontsize = 8,
height = 600,
width = 1000,
title = "" )

Benchmarking with other clustering methods
In the clustering_of_correlation_matrix function we have implemented only the hierarchical (Agglomerative) clustering of the correlation matrix. However, other powerful clustering algorithms exist and we encourage users to test the different methods and decide which they are going to use.
First, we start with the elbow method to find the ideal number of clusters, as some of the clustering algorithms used below use this as a parameter.
Here, we can see that n_clusers = 3, seems to produce better results
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
X = subset_mixed_correlation_matrix_100.copy()
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++',
random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Benchmark across KMeans, AgglomerativeClustering, DBSCAN, HDBSCAN and SpectralClustering clustering methods with the Silhouette, Calinski-Harabasz and Davies-Bouldin scores, reveals which method produces better clusters
Here, we benchmark across these clustering methods and provide some metrics to give insights on clustering quality:
SilhouetteScore: Scores closer to 1 are betterCalinski-HarabaszScore: Higher scores are betterDavies-BouldinScore: Lower scores are better
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering
import hdbscan
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import numpy as np
X = subset_mixed_correlation_matrix_100.copy()
n_clusters = 3
kmeans_labels = KMeans(n_clusters=n_clusters, random_state=42).fit_predict(X)
agg_labels = AgglomerativeClustering(n_clusters=n_clusters, metric='euclidean', linkage='average').fit_predict(X)
dbscan = DBSCAN(eps=0.5, min_samples=n_clusters)
dbscan_labels = dbscan.fit_predict(X)
hdb = hdbscan.HDBSCAN(min_cluster_size=n_clusters)
hdbscan_labels = hdb.fit_predict(X)
spectral_labels = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors', random_state=42).fit_predict(X)
clusterings = {}
clusterings["KMeans"] = kmeans_labels
clusterings["Agglomerative"] = agg_labels
clusterings["DBSCAN"] = dbscan_labels
clusterings["HDBSCAN"] = hdbscan_labels
clusterings["Spectral"] = spectral_labels
def evaluate(X, labels, name):
labels = np.array(labels)
if len(np.unique(labels)) < 2 or np.all(labels == -1):
print(f"\n{name}: Not enough clusters to evaluate.")
return
print(f"\n{name} Evaluation:")
# Scores closer to 1 are better
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
# Higher scores are better
print(f"Calinski-Harabasz Score: {calinski_harabasz_score(X, labels):.4f}")
# Lower scores are better
print(f"Davies-Bouldin Score: {davies_bouldin_score(X, labels):.4f}")
for name, labels in clusterings.items():
evaluate(X, labels, name)
KMeans Evaluation:
Silhouette Score: 0.8567
Calinski-Harabasz Score: 321.7593
Davies-Bouldin Score: 0.3168
Agglomerative Evaluation:
Silhouette Score: 0.8567
Calinski-Harabasz Score: 321.7593
Davies-Bouldin Score: 0.3168
DBSCAN Evaluation:
Silhouette Score: 0.8567
Calinski-Harabasz Score: 321.7593
Davies-Bouldin Score: 0.3168
HDBSCAN Evaluation:
Silhouette Score: 0.7811
Calinski-Harabasz Score: 207.2086
Davies-Bouldin Score: 0.3169
Spectral Evaluation:
Silhouette Score: 0.8567
Calinski-Harabasz Score: 321.7593
Davies-Bouldin Score: 0.3168
PCA and tSNE methods to reduce the dimensions of the correlation matrix
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
X_tsne = TSNE(n_components=2, perplexity=3, random_state=42).fit_transform(X)
Define a function that will enable us to visualize clustering results
import numpy as np
def plot_clusters(embedding, labels, title, ax):
labels = np.array(labels)
unique_labels = np.unique(labels)
for lab in unique_labels:
mask = labels == lab
ax.scatter(embedding[mask, 0], embedding[mask, 1], label=f"Cluster {lab}", s=80)
ax.set_title(title)
ax.legend()
Visualize clustering results on a 2-dimensional space occured from the PCA algorithm
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()
for i, (name, labels) in enumerate(clusterings.items()):
plot_clusters(X_pca, labels, f"{name} (PCA)", axes[i])
plt.show()

Visualize clustering results on a 2-dimensional space occured from the tSNE algorithm
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()
for i, (name, labels) in enumerate(clusterings.items()):
plot_clusters(X_tsne, labels, f"{name} (tSNE)", axes[i])
plt.show()
