Jump to content

Draft:Clustergrammer

From Wikipedia, the free encyclopedia


Clustergrammer

[edit]

Clustergrammer is a web-based interactive tool designed for visualizing and analyzing high-dimensional data through heatmaps. Developed by the Ma'ayan Laboratory at the Icahn School of Medicine at Mount Sinai. The tool addresses the limitations of static heatmaps by integrating interactive features, facilitating the analysis of complex biological datasets, including genomics and proteomics.

Introduction

[edit]

Clustergrammer is a visualization tool specifically designed for high-dimensional data commonly encountered in computational biology and data science.[1]. Unlike traditional static heatmaps, it enables users to explore data interactively by zooming, panning, clustering, and reordering of rows and columns. The tool is widely applicable to various domains, including gene expression analysis, protein interaction networks, and single-cell data visualization. By leveraging web-based technologies, it facilitates the creation of accessible and shareable visualizations that simplify the interpretation of complex datasets [2]

Features

[edit]

Interactive Heatmaps

[edit]

Clustergrammer enables users to create interactive heatmaps that allow for dynamic exploration of data. Features [3]include zooming, panning, filtering, reording, search and highlighting.

The interactive heatmap displayed was generated using Clustergrammer to visualize gene expression data from the Cancer Cell Line Encyclopedia (CCLE). In this image zooming, scrolling, panning, filtering and reordering is recorded.
  • Zooming and Panning: Users can navigate large datasets efficiently, zooming in on specific regions of the heatmap to analyze fine-grained details or zooming out to observe broader patterns. Panning allows users to move across the dataset seamlessly, making it easier to explore different areas of interest.
  • Filtering and Reordering: Rows and columns can be reorganized using a variety of methods, such as hierarchical clustering, summation, variance, or alphabetical labels. This flexibility enables users to uncover patterns, relationships, or outliers in the data that might otherwise be overlooked in static representations.
  • Search and Highlighting: The tool includes robust search functions that allow users to locate specific rows, columns, or subsets of data quickly. Highlighting options enable users to emphasize particular features, facilitating comparative analysis and focused exploration.

Interactive Dimensionality Reduction

[edit]

Dimensionality reduction techniques like Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) simplify high-dimensional data for visualization. Clustergrammer enhances this process by allowing users to filter rows based on sum or variance, focusing on the most informative data points. This interactive filtering helps identify how specific dimensions affect clustering patterns. For smaller datasets, it uses animations to show the impact of these changes, aiding in data interpretation.

Clustering Algorithms

[edit]
The interactive heatmap using Clustergrammer when clustering applied to the CCLE. In this image, range can selected to increase or decrease the number of clusters formed using the hierarchial clustering and that is reflected in the dendograms interactive visualization.

Clustergrammer employs hierarchical clustering algorithms, with support for additional methods such as K-means clustering. Users can visualize dendrograms, toggle between clustering levels, and extract enriched clusters.

Interactive Dendrograms: Clustergrammer employs interactive dendrograms to represent hierarchical clustering of data rows and columns. Instead of displaying the entire tree, it shows one slice at a time using gray trapezoids. Users can adjust the dendrogram slider to explore different clustering levels, revealing larger or smaller clusters. Interacting with these trapezoids highlights specific clusters, provides detailed information, and allows exporting of row or column names. For gene-level data, users can send clustered genes to Enrichr for enrichment analysis, facilitating deeper biological insights.

Customization Options

[edit]

The tool provides various customization features:

  • Users can adjust the opacity, highlight categories, and crop data subsets for detailed exploration.
  • Integrations with external APIs, such as Enrichr, allow for enrichment analysis directly within the visualization.

Applications

[edit]

1. High-Dimensional Data Visualization

Clustergrammer is a powerful tool for analyzing large and complex datasets by creating interactive heatmaps. These visualizations enable researchers to examine high-dimensional data intuitively, even when datasets contain thousands of rows and columns. This makes it particularly useful for summarizing, filtering, and interpreting large-scale experiments or studies.

2. Gene Expression Analysis

Widely used in genomics, Clustergrammer aids in analyzing gene expression data, including single-cell RNA sequencing (scRNA-seq) [4]>. By visualizing relationships among genes or samples, the tool helps researchers identify meaningful patterns, clusters, and correlations, offering insights into underlying biological processes or gene functions.

3. Biological Network Visualization

The tool is applied to represent biological networks such as protein-protein interactions, metabolic pathways, or gene regulatory networks. Clustergrammer’s clustering capabilities help pinpoint highly interconnected nodes or significant components, which are often critical in understanding the system's overall function or discovering key biomarkers.

4. Hierarchical Clustering

Clustergrammer supports hierarchical clustering, a method for organizing data into groups based on similarity. This is essential for categorizing features like genes, conditions, or samples into clusters, revealing relationships and structures within the data. Such clustering is especially valuable in understanding biological datasets, where interconnectedness is common.

5. Single-Cell Data Analysis

In single-cell studies, Clustergrammer is instrumental in exploring datasets derived from technologies like 10X Genomics. It allows researchers to classify cells based on gene expression signatures, visualize population structures, and assess how cells relate to one another, helping to uncover novel cell types or states.

6. Comparative Data Analysis

Clustergrammer facilitates the comparison of multiple datasets or experimental conditions. By visualizing and contrasting data in heatmaps, researchers can quickly identify similarities or differences between groups, aiding in hypothesis generation or validation.

Technical details

[edit]

Architecture

[edit]

Clustergrammer operates on a modular architecture comprising:

  • Backend: Built using Python, with key libraries such as NumPy and SciPy for data processing.
  • Frontend: Employs JavaScript and D3.js for rendering interactive visualizations.
  • Integration: The tool supports integration with Jupyter Notebooks and REST APIs, enabling seamless workflow incorporation.

Core Libraries are Clustergrammer-PY and Clustergrammer-JS.

Core Components

[edit]

Clustergrammer consists of two primary components: Clustergrammer-JS, and Clustergrammer-PY.

Clustergrammer-JS

[edit]

It is a frontend and JavaScript visualization library that generates interactive heatmaps in web browsers. Built on D3.js and SVG technology, it renders complex data in an explorable format with features like:

  • Data filtering options (Data filtering capabilities encompass three main categories: value-based, categorical, and interactive filtering. Value filters allow threshold-based row or column manipulation, handling of numerical criteria, and removal of sparse data points. Category-based filtering enables grouping by metadata, visibility toggling of specific groups, and filtering based on clustering outcomes. Interactive selections provide manual row/column control, subset data visualization, and dynamic content reordering, allowing users to explore and analyze complex datasets efficiently through both preprocessing and real-time filtering options.)
  • Customizable information displays on hover
  • Seamless web application integration

The library works with JSON data produced by Clustergrammer-PY and provides developers the tools to embed dynamic visualizations in their web projects. Its source code and installation details are available on.[5].

Clustergrammer-PY

[edit]

This is a backend Python package that enables users to create dynamic heatmap visualizations through automated data analysis. The tool processes input data to generate JSON files that power interactive web-based displays via Clustergrammer-JS.

Key features include:

  • Data preprocessing capabilities like hierarchical clustering and multiple normalization options
  • Support for both file-based and DataFrame inputs
  • Integration with major scientific Python libraries (The library demonstrates broad compatibility through integration with essential scientific Python packages, including NumPy for matrix operations, Pandas for DataFrame processing, SciPy for statistical analysis, and scikit-learn for machine learning capabilities.)
  • Cross-version compatibility (Its cross-version support ensures functionality across both Python 2.7 and Python 3.x versions, maintaining backward compatibility through consistent function implementations and careful management of package dependencies.)

The package handles data transformation and prepares structured JSON output suitable for visualization. Users can access it through the source code repository [6]

Clustergrammer2

[edit]

Clustergrammer2 is a specialized Jupyter widget that enables interactive visualization of high-dimensional datasets. Developed using widget-ts-cookiecutter[7]> and regl WebGL library [8]>, it focuses on analyzing single-cell datasets, particularly RNA sequencing data. The tool also supports the exploration of large-scale data, like the analysis of gene expression patterns across thousands of cells [9].

Implementation Guide

[edit]

Clustergrammer is accessible through multiple platforms, including its web-based interface, Python API, and Jupyter Notebook integration. Below is a step-by-step guide to implementing Clustergrammer in various scenarios:

1. Using the Web Interface

[edit]

The easiest way to use Clustergrammer is through its web interface:

  1. Visit the Clustergrammer Web Tool.[10]
  2. Upload a CSV or TSV file containing your high-dimensional data.
  3. Use the interactive heatmap to explore, filter, and cluster your data dynamically.

2. Python API: Clustergrammer-PY

[edit]

The Python API provides advanced users with full control over preprocessing and visualization. Follow these steps to use the API:

Step 1: Installation
[edit]

Install the Clustergrammer-PY library using pip:

pip install clustergrammer-py
Step 2: Import the Library
[edit]

Start by importing the Clustergrammer-PY module:

from clustergrammer import Network
Step 3: Load and Preprocess Data
[edit]

Initialize the Network object and load the data:

net = Network()
net.load_df(data)
Step 4: Apply Clustering
[edit]

Use the built-in clustering algorithms:

net.cluster()
Step 5: Save and Visualize Results
[edit]

Save the clustered data as a JSON file for visualization:

net.write_json_to_file('viz', 'clustergrammer_output.json')

3. Jupyter Notebook Integration

[edit]

To Visualize Clustergrammer heatmaps directly within Jupyter Notebooks, use the Clustergrammer2 widget

1.Install the clustergrammer2 package

pip install clustergrammer2

2.Import and use the widget in a Jupyter Notebook:

import clustergrammer2
from clustergrammer2 import CGM

# Initialize the Clustergrammer2 object
cgm = CGM()

# Load data into the widget
cgm.load_data(data)

# Display the interactive heatmap
cgm.widget()

This integration allows for seamless interaction with heatmaps during data exploration.

4. Integration with REST APIs

[edit]

Clustergrammer supports REST API endpoints for automation:

  1. Prepare a JSON-formatted data file as described in the Clustergrammer documentation.
  2. Use tools like curl or Python’s requests library to send POST requests to the API:
import requests

# Define API endpoint and data payload
url = "https://clustergrammer_api_url"
payload = {"data": data.to_json()}

# Send POST request
response = requests.post(url, json=payload)

# Retrieve clustered data
clustered_data = response.json()

Case studies

[edit]

1) Analyzing MNIST Dataset Using Cluster Grammar

[edit]
The image displays the analysis results of three clusters (Cluster 5, Cluster 9, and Cluster 12) from a MNIST dataset. Each cluster is represented with the majority-digit category and corresponding counts of occurrences.These counts suggest that the clusters capture groupings of digits based on shared features, but overlapping counts, such as "Four" appearing in both Cluster 9 and Cluster 12, indicate potential feature similarity . The color-coded bars further highlight the distribution of majority digits within each cluster.

This case study demonstrates how Clustergrammer can enhance data analysis and visualization, focusing on the MNIST dataset, a widely used benchmark for handwritten digit classification. The objective was to explore clustering patterns and feature relationships within the dataset, leveraging Clustergrammer’s interactive heatmaps to uncover insights into the dataset’s structure, identify feature significance, and detect anomalies. By analyzing similarity matrices and dynamically clustering data, Clustergrammer enabled a deeper understanding of how pixel intensities and digit structures contribute to classification, providing a valuable tool for data exploration and machine learning workflows.

Data: The MNIST dataset, a benchmark for handwritten digit recognition, was analyzed using Cluster Grammar to explore clustering patterns, feature relationships, and dataset quality. The dataset consists of 70,000 grayscale images of handwritten digits (0–9), with 60,000 used for training and 10,000 for testing. Each image, originally 28×28 pixels, was flattened into a 784-dimensional vector. Preprocessing steps included normalization of pixel intensity values to the range [0, 1].

Visualization Features: The heatmap allowed users to zoom and pan which focuses on specific clusters for detailed analysis. Reorder Rows and Columns that dynamically reorganize data to highlight patterns and annotate data that adds metadata, such as digit labels, for better interpretability. Bright regions in the heatmap corresponded to high-intensity pixels critical for classification, providing insights into feature importance. Additionally, isolated rows and columns highlighted outliers, such as mislabelled or poorly written digits.

Analysis Using Clustergrammer: This analysis demonstrated the effectiveness of Cluster Grammar in visualizing and understanding high-dimensional data. It revealed valuable insights into clustering patterns, feature significance, and dataset anomalies. The interactive visualization facilitated feature selection, anomaly detection, and dataset quality assessment, offering a powerful approach for analyzing complex datasets like MNIST. Cluster Grammar’s versatility makes it suitable for broader applications in computational biology, machine learning, and data science.

2) Lung Cancer Data Analysis

[edit]
(a) Lung cancer cell lines (columns) were clustered based on a combination of PTMs and mRNA expression data (rows). (b) Zooming into a cluster containing Keratins with commonly up-regulated expression and post-translational modification in the NSCLC cluster. (c) Zooming into a cluster containing expression and methylation data for the lung associated transcription factor, NKX2-1.

This case study is taken from [11]. This demonstrates how Clustergrammer can enhance data analysis and visualization, focusing on a dataset collected from lung cancer cell lines. The goal was to analyze post-translational modifications (PTMs) and gene expression patterns across different types of lung cancer to identify relationships and biological mechanisms.

Data: Post-Translational Modifications (PTMs) are changes made to proteins after they are synthesized, including processes like phosphorylation (adding a phosphate group), acetylation (adding an acetyl group), and methylation (adding a methyl group). These modifications can alter protein function and play a critical role in cancer development. In this study, PTMs were measured in 42 lung cancer cell lines using Tandem Mass Tag (TMT) mass spectrometry, a technique for detecting protein changes. Additionally, gene expression data, which reflects the activity levels of genes in producing their products like mRNA or protein, was collected for 37 of these cell lines from the Cancer Cell Line Encyclopedia (CCLE). The analysis focused on two major types of lung cancer: Non-Small Cell Lung Cancer (NSCLC), which includes the majority of cases such as adenocarcinoma and squamous cell carcinoma, and Small Cell Lung Cancer (SCLC), a more aggressive but less common type.

Analysis Using Clustergrammer: Using Clustergrammer, patterns in the data were identified by clustering cell lines based on their post-translational modifications (PTMs) and gene expression levels. The analysis revealed that cell lines grouped into two major types: Non-Small Cell Lung Cancer (NSCLC) and Small Cell Lung Cancer (SCLC). Within these primary groups, further clustering was observed based on specific subtypes, such as adenocarcinoma and squamous cell carcinoma, as well as genetic mutations. Key observations included strong correlations between modifications (phosphorylation, acetylation, methylation) in keratin family proteins and their corresponding mRNA levels, suggesting a close link between protein modifications and gene activity. Additionally, the NKX2-1 transcription factor, a critical regulator in lung cancer, showed strong correlations between its methylation patterns, mRNA expression, and other lung-related genes such as SFTA3 and SOX2.

This study demonstrated how Clustergrammer can quickly identify important patterns in complex biological data, helping researchers understand differences between cancer types and potentially leading to better treatment strategies.

References

[edit]
  1. ^ Clustergrammer documentation: https://clustergrammer.readthedocs.io/
  2. ^ Fernandez, Nicolas F.; Gundersen, Gregory W.; Rahman, Adeeb; Grimes, Mark L.; Rikova, Klarisa; Hornbeck, Peter; Ma’ayan, Avi (2017). "Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data". Scientific Reports. 7. doi:10.1038/s41598-017-01819-3 (inactive 2024-11-20).{{cite journal}}: CS1 maint: DOI inactive as of November 2024 (link)
  3. ^ "Clustergrammer Documentation". Read the Docs. Retrieved 2024-11-19.
  4. ^ Jovic, D.; Liang, X.; Zeng, H.; Lin, L.; Xu, F.; Luo, Y. (2022). "single cell RNA". Clinical and Translational Medicine. 12 (3): e694. doi:10.1002/ctm2.694. PMC 8964935. PMID 35352511.
  5. ^ "Clustergrammer-JS GitHub Repository". GitHub. Retrieved 2024-11-19.
  6. ^ "Clustergrammer-PY GitHub Repository". GitHub. MaayanLab. Retrieved 2024-11-19.
  7. ^ "widget-ts-cookiecutter". GitHub.
  8. ^ "regl". GitHub.
  9. ^ "Clustergrammer2 GitHub Repository". GitHub. Icahn School of Medicine at Mount Sinai. Retrieved 2024-11-19.
  10. ^ "ClusterGrammer Webtool".
  11. ^ "Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data".