# Clustering from correlation matrix ## Prerequisites for annotating the produced dendrogram Suppose we have a correlation matrix produced from the `correlated_reactions` function, as shown in the previous section. After a split of the reversible reactions to separate forward and reverse reactions, we need to map the produced reverse reactions to their KEGG terms (identical term, as the forward reaction). The `dictionary_map_reverse_reaction_id_to_pathway` function is used when we split bidirectional reactions to separate forward and reverse reactions. It maps the reverse reaction to the corresponding pathway (the one that the forward reactions maps to). It enriches the dictionary created from the `dictionary_reaction_id_to_pathway` function and we can annotate the produced dendrogram with pathway information. - `reaction_id_to_pathway_dict` is a dictionary mapping reaction IDs (default-forward IDs) to pathway names - `for_rev_reactions` is a list of the splitted reactions (forward ID is the default and reverse ID has the "_rev" suffix) ```python group_map_100 = dictionary_map_reverse_reaction_id_to_pathway( reaction_id_to_pathway_dict = bigg_to_pathway_dict, for_rev_reactions = subset_extended_reactions_100) print(group_map_100.get("PGM")) print(group_map_100.get("PGM_rev")) ``` ```python Glycolysis Glycolysis ``` ## Hierarchical Clustering with Dendrogram Visualization The default options to perform clustering in `dingo-stats` is the hierarchical clustering. This is a suggested method, however the user can procceed to clustering with any of the available methods, that are also presented in the next section. The `clustering_of_correlation_matrix` function performs hierarchical clustering of the correlation matrix. It returns the `dissimilarity_matrix` (1sr arg.) calculated from the given correlation matrix, the integer `labels` (2nd arg.) corresponding to clusters and the `clusters` list (3rd arg.) which is a nested list containing reaction IDs grouped based on their cluster labels - `correlation_matrix` is a numpy 2D array of a correlation matrix - `reactions` is a list with the corresponding reaction IDs (must be in accordance with the order of rows in the correlation matrix) - `linkage` is a string variable that defines the type of linkage. Available linkage types are: single, average, complete, ward. - `t` is a float variable that defines a threshold that cuts the dendrogram at a specific height and produces clusters - `correction` is a boolean variable that if `True` converts the values of the the correlation matrix to absolute values. ```python (dissimilarity_matrix, labels, clusters) = clustering_of_correlation_matrix( correlation_matrix = subset_mixed_correlation_matrix_100, reactions = subset_extended_reactions_100, linkage = "ward", t = 1.0, correction = False) ``` The `plot_dendrogram` function plots the dendrogram occured from hierarchical clustering - `dissimilarity_matrix` is a dissimilarity matrix calculated from the `clustering_of_correlation_matrix` function - `reactions` is a list with reactions IDs, that must be in accordance with the order of rows in the `dissimilarity_matrix` - `show_labels` is a boolean variable that if `True` labels are shown in the x-axis - `t` is a float variable that defines a threshold that cuts the dendrogram at a specific height - `linkage` is a string variable that defines the type of linkage. Available linkage types are: single, average, complete, ward. - `group_map` is a dictionary mapping reactions to pathways, produced from the `dictionary_map_reverse_reaction_id_to_pathway` function - `label_fontsize` is an integer variable that defines the size for the plotted labels (reaction IDs) - `height` is an integer variable that defines the height of the plot - `width` is an integer variable that defines the width of the plot - `title` is the title for the dendrogram plot ```python plot_dendrogram( dissimilarity_matrix = dissimilarity_matrix_100, reactions = subset_extended_reactions_100, show_labels = True, t = 5, linkage = "ward", group_map = group_map_100, label_fontsize = 8, height = 600, width = 1000, title = "" ) ``` ![hierarchical_clustering_dendrogram](/img/hierarchical_clustering_dendrogram.png) ## Benchmarking with other clustering methods In the `clustering_of_correlation_matrix` function we have implemented only the hierarchical (Agglomerative) clustering of the correlation matrix. However, other powerful clustering algorithms exist and we encourage users to test the different methods and decide which they are going to use. First, we start with the elbow method to find the ideal number of clusters, as some of the clustering algorithms used below use this as a parameter. Here, we can see that `n_clusers` = 3, seems to produce better results ```python from sklearn.cluster import KMeans import matplotlib.pyplot as plt X = subset_mixed_correlation_matrix_100.copy() wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(X) wcss.append(kmeans.inertia_) plt.plot(range(1, 11), wcss) plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show() ``` ![elbow_method_n_clusters](/img/elbow_method_n_clusters.png) Benchmark across `KMeans`, `AgglomerativeClustering`, `DBSCAN`, `HDBSCAN` and `SpectralClustering` clustering methods with the `Silhouette`, `Calinski-Harabasz` and `Davies-Bouldin` scores, reveals which method produces better clusters Here, we benchmark across these clustering methods and provide some metrics to give insights on clustering quality: - `Silhouette` Score: Scores closer to 1 are better - `Calinski-Harabasz` Score: Higher scores are better - `Davies-Bouldin` Score: Lower scores are better ```python from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering import hdbscan from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score import numpy as np X = subset_mixed_correlation_matrix_100.copy() n_clusters = 3 kmeans_labels = KMeans(n_clusters=n_clusters, random_state=42).fit_predict(X) agg_labels = AgglomerativeClustering(n_clusters=n_clusters, metric='euclidean', linkage='average').fit_predict(X) dbscan = DBSCAN(eps=0.5, min_samples=n_clusters) dbscan_labels = dbscan.fit_predict(X) hdb = hdbscan.HDBSCAN(min_cluster_size=n_clusters) hdbscan_labels = hdb.fit_predict(X) spectral_labels = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors', random_state=42).fit_predict(X) clusterings = {} clusterings["KMeans"] = kmeans_labels clusterings["Agglomerative"] = agg_labels clusterings["DBSCAN"] = dbscan_labels clusterings["HDBSCAN"] = hdbscan_labels clusterings["Spectral"] = spectral_labels def evaluate(X, labels, name): labels = np.array(labels) if len(np.unique(labels)) < 2 or np.all(labels == -1): print(f"\n{name}: Not enough clusters to evaluate.") return print(f"\n{name} Evaluation:") # Scores closer to 1 are better print(f"Silhouette Score: {silhouette_score(X, labels):.4f}") # Higher scores are better print(f"Calinski-Harabasz Score: {calinski_harabasz_score(X, labels):.4f}") # Lower scores are better print(f"Davies-Bouldin Score: {davies_bouldin_score(X, labels):.4f}") for name, labels in clusterings.items(): evaluate(X, labels, name) ``` ```python KMeans Evaluation: Silhouette Score: 0.8567 Calinski-Harabasz Score: 321.7593 Davies-Bouldin Score: 0.3168 Agglomerative Evaluation: Silhouette Score: 0.8567 Calinski-Harabasz Score: 321.7593 Davies-Bouldin Score: 0.3168 DBSCAN Evaluation: Silhouette Score: 0.8567 Calinski-Harabasz Score: 321.7593 Davies-Bouldin Score: 0.3168 HDBSCAN Evaluation: Silhouette Score: 0.7811 Calinski-Harabasz Score: 207.2086 Davies-Bouldin Score: 0.3169 Spectral Evaluation: Silhouette Score: 0.8567 Calinski-Harabasz Score: 321.7593 Davies-Bouldin Score: 0.3168 ``` PCA and tSNE methods to reduce the dimensions of the correlation matrix ```python from sklearn.decomposition import PCA from sklearn.manifold import TSNE pca = PCA(n_components=2) X_pca = pca.fit_transform(X) X_tsne = TSNE(n_components=2, perplexity=3, random_state=42).fit_transform(X) ``` Define a function that will enable us to visualize clustering results ```python import numpy as np def plot_clusters(embedding, labels, title, ax): labels = np.array(labels) unique_labels = np.unique(labels) for lab in unique_labels: mask = labels == lab ax.scatter(embedding[mask, 0], embedding[mask, 1], label=f"Cluster {lab}", s=80) ax.set_title(title) ax.legend() ``` Visualize clustering results on a 2-dimensional space occured from the PCA algorithm ```python import matplotlib.pyplot as plt fig, axes = plt.subplots(2, 3, figsize=(16, 10)) axes = axes.ravel() for i, (name, labels) in enumerate(clusterings.items()): plot_clusters(X_pca, labels, f"{name} (PCA)", axes[i]) plt.show() ``` ![pca_clustering_correlation_matrix](/img/pca_clustering_correlation_matrix.png) Visualize clustering results on a 2-dimensional space occured from the tSNE algorithm ```python import matplotlib.pyplot as plt fig, axes = plt.subplots(2, 3, figsize=(16, 10)) axes = axes.ravel() for i, (name, labels) in enumerate(clusterings.items()): plot_clusters(X_tsne, labels, f"{name} (tSNE)", axes[i]) plt.show() ``` ![tsne_clustering_correlation_matrix](/img/tsne_clustering_correlation_matrix.png)