Draft:Clustergrammer
Review waiting, please be patient.
This may take 8 weeks or more, since drafts are reviewed in no specific order. There are 1,826 pending submissions waiting for review.
Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
Reviewer tools
|
Submission declined on 6 December 2024 by Passengerpigeon (talk).
Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
This draft has been resubmitted and is currently awaiting re-review. |
- Comment: Wikipedia is not a repository for software user manuals. Passengerpigeon (talk) 03:09, 6 December 2024 (UTC)
Clustergrammer
[edit]Clustergrammer is a web-based interactive tool designed for visualizing and analyzing high-dimensional data through heatmaps. Developed by the Ma'ayan Laboratory at the Icahn School of Medicine at Mount Sinai. The tool addresses the limitations of static heatmaps by integrating interactive features, facilitating the analysis of complex biological datasets, including genomics and proteomics.
Introduction
[edit]Clustergrammer is a visualization tool specifically designed for high-dimensional data commonly encountered in computational biology and data science.[1]. Unlike traditional static heatmaps, it enables users to explore data interactively by zooming, panning, clustering, and reordering of rows and columns. The tool is widely applicable to various domains, including gene expression analysis, protein interaction networks, and single-cell data visualization. By leveraging web-based technologies, it facilitates the creation of accessible and shareable visualizations that simplify the interpretation of complex datasets [2]
Features
[edit]Interactive Heatmaps
[edit]Clustergrammer enables users to create interactive heatmaps that allow for dynamic exploration of data. Features [3]include zooming, panning, filtering, reording, search and highlighting.
- Zooming and Panning: Users can navigate large datasets efficiently, zooming in on specific regions of the heatmap to analyze fine-grained details or zooming out to observe broader patterns. Panning allows users to move across the dataset seamlessly, making it easier to explore different areas of interest.
- Filtering and Reordering: Rows and columns can be reorganized using a variety of methods, such as hierarchical clustering, summation, variance, or alphabetical labels. This flexibility enables users to uncover patterns, relationships, or outliers in the data that might otherwise be overlooked in static representations.
- Search and Highlighting: The tool includes robust search functions that allow users to locate specific rows, columns, or subsets of data quickly. Highlighting options enable users to emphasize particular features, facilitating comparative analysis and focused exploration.
Interactive Dimensionality Reduction
[edit]Dimensionality reduction techniques like Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) simplify high-dimensional data for visualization. Clustergrammer enhances this process by allowing users to filter rows based on sum or variance, focusing on the most informative data points. This interactive filtering helps identify how specific dimensions affect clustering patterns. For smaller datasets, it uses animations to show the impact of these changes, aiding in data interpretation.
Clustering Algorithms
[edit]Clustergrammer employs hierarchical clustering algorithms, with support for additional methods such as K-means clustering. Users can visualize dendrograms, toggle between clustering levels, and extract enriched clusters.
Interactive Dendrograms: Clustergrammer employs interactive dendrograms to represent hierarchical clustering of data rows and columns. Instead of displaying the entire tree, it shows one slice at a time using gray trapezoids. Users can adjust the dendrogram slider to explore different clustering levels, revealing larger or smaller clusters. Interacting with these trapezoids highlights specific clusters, provides detailed information, and allows exporting of row or column names. For gene-level data, users can send clustered genes to Enrichr for enrichment analysis, facilitating deeper biological insights.
Customization Options
[edit]The tool provides various customization features:
- Users can adjust the opacity, highlight categories, and crop data subsets for detailed exploration.
- Integrations with external APIs, such as Enrichr, allow for enrichment analysis directly within the visualization.
Applications
[edit]1. High-Dimensional Data Visualization
Clustergrammer is a powerful tool for analyzing large and complex datasets by creating interactive heatmaps. These visualizations enable researchers to examine high-dimensional data intuitively, even when datasets contain thousands of rows and columns. This makes it particularly useful for summarizing, filtering, and interpreting large-scale experiments or studies.
Widely used in genomics, Clustergrammer aids in analyzing gene expression data, including single-cell RNA sequencing (scRNA-seq) [4]>. By visualizing relationships among genes or samples, the tool helps researchers identify meaningful patterns, clusters, and correlations, offering insights into underlying biological processes or gene functions.
3. Biological Network Visualization
The tool is applied to represent biological networks such as protein-protein interactions, metabolic pathways, or gene regulatory networks. Clustergrammer’s clustering capabilities help pinpoint highly interconnected nodes or significant components, which are often critical in understanding the system's overall function or discovering key biomarkers.
4. Hierarchical Clustering
Clustergrammer supports hierarchical clustering, a method for organizing data into groups based on similarity. This is essential for categorizing features like genes, conditions, or samples into clusters, revealing relationships and structures within the data. Such clustering is especially valuable in understanding biological datasets, where interconnectedness is common.
5. Single-Cell Data Analysis
In single-cell studies, Clustergrammer is instrumental in exploring datasets derived from technologies like 10X Genomics. It allows researchers to classify cells based on gene expression signatures, visualize population structures, and assess how cells relate to one another, helping to uncover novel cell types or states.
6. Comparative Data Analysis
Clustergrammer facilitates the comparison of multiple datasets or experimental conditions. By visualizing and contrasting data in heatmaps, researchers can quickly identify similarities or differences between groups, aiding in hypothesis generation or validation.
Technical details
[edit]Architecture
[edit]Clustergrammer operates on a modular architecture comprising:
- Backend: Built using Python, with key libraries such as NumPy and SciPy for data processing.
- Frontend: Employs JavaScript and D3.js for rendering interactive visualizations.
- Integration: The tool supports integration with Jupyter Notebooks and REST APIs, enabling seamless workflow incorporation.
Core Libraries are Clustergrammer-PY and Clustergrammer-JS.
Core Components
[edit]Clustergrammer consists of two primary components: Clustergrammer-JS, and Clustergrammer-PY.
Clustergrammer-JS
[edit]It is a frontend and JavaScript visualization library that generates interactive heatmaps in web browsers. Built on D3.js and SVG technology, it renders complex data in an explorable format with features like:
- Data filtering options (Data filtering capabilities encompass three main categories: value-based, categorical, and interactive filtering. Value filters allow threshold-based row or column manipulation, handling of numerical criteria, and removal of sparse data points. Category-based filtering enables grouping by metadata, visibility toggling of specific groups, and filtering based on clustering outcomes. Interactive selections provide manual row/column control, subset data visualization, and dynamic content reordering, allowing users to explore and analyze complex datasets efficiently through both preprocessing and real-time filtering options.)
- Customizable information displays on hover
- Seamless web application integration
The library works with JSON data produced by Clustergrammer-PY and provides developers the tools to embed dynamic visualizations in their web projects. Its source code and installation details are available on.[5].
Clustergrammer-PY
[edit]This is a backend Python package that enables users to create dynamic heatmap visualizations through automated data analysis. The tool processes input data to generate JSON files that power interactive web-based displays via Clustergrammer-JS.
Key features include:
- Data preprocessing capabilities like hierarchical clustering and multiple normalization options
- Support for both file-based and DataFrame inputs
- Integration with major scientific Python libraries (The library demonstrates broad compatibility through integration with essential scientific Python packages, including NumPy for matrix operations, Pandas for DataFrame processing, SciPy for statistical analysis, and scikit-learn for machine learning capabilities.)
- Cross-version compatibility (Its cross-version support ensures functionality across both Python 2.7 and Python 3.x versions, maintaining backward compatibility through consistent function implementations and careful management of package dependencies.)
The package handles data transformation and prepares structured JSON output suitable for visualization. Users can access it through the source code repository [6]
Clustergrammer2
[edit]Clustergrammer2 is a specialized Jupyter widget that enables interactive visualization of high-dimensional datasets. Developed using widget-ts-cookiecutter[7]> and regl WebGL library [8]>, it focuses on analyzing single-cell datasets, particularly RNA sequencing data. The tool also supports the exploration of large-scale data, like the analysis of gene expression patterns across thousands of cells [9].
Implementation Guide
[edit]Clustergrammer is accessible through multiple platforms, including its web-based interface, Python API, and Jupyter Notebook integration. Below is a step-by-step guide to implementing Clustergrammer in various scenarios:
1. Using the Web Interface
[edit]The easiest way to use Clustergrammer is through its web interface:
- Visit the Clustergrammer Web Tool.[10]
- Upload a CSV or TSV file containing your high-dimensional data.
- Use the interactive heatmap to explore, filter, and cluster your data dynamically.
2. Python API: Clustergrammer-PY
[edit]The Python API provides advanced users with full control over preprocessing and visualization. Follow these steps to use the API:
Step 1: Installation
[edit]Install the Clustergrammer-PY library using pip:
pip install clustergrammer-py
Step 2: Import the Library
[edit]Start by importing the Clustergrammer-PY module:
from clustergrammer import Network
Step 3: Load and Preprocess Data
[edit]
Initialize the Network
object and load the data:
net = Network()
net.load_df(data)
Step 4: Apply Clustering
[edit]Use the built-in clustering algorithms:
net.cluster()
Step 5: Save and Visualize Results
[edit]Save the clustered data as a JSON file for visualization:
net.write_json_to_file('viz', 'clustergrammer_output.json')
3. Jupyter Notebook Integration
[edit]To Visualize Clustergrammer heatmaps directly within Jupyter Notebooks, use the Clustergrammer2
widget
1.Install the clustergrammer2
package
pip install clustergrammer2
2.Import and use the widget in a Jupyter Notebook:
import clustergrammer2
from clustergrammer2 import CGM
# Initialize the Clustergrammer2 object
cgm = CGM()
# Load data into the widget
cgm.load_data(data)
# Display the interactive heatmap
cgm.widget()
This integration allows for seamless interaction with heatmaps during data exploration.
4. Integration with REST APIs
[edit]Clustergrammer supports REST API endpoints for automation:
- Prepare a JSON-formatted data file as described in the Clustergrammer documentation.
- Use tools like
curl
or Python’srequests
library to send POST requests to the API:
import requests
# Define API endpoint and data payload
url = "https://clustergrammer_api_url"
payload = {"data": data.to_json()}
# Send POST request
response = requests.post(url, json=payload)
# Retrieve clustered data
clustered_data = response.json()
Case studies
[edit]1) Analyzing MNIST Dataset Using Cluster Grammar
[edit]This case study demonstrates how Clustergrammer can enhance data analysis and visualization, focusing on the MNIST dataset, a widely used benchmark for handwritten digit classification. The objective was to explore clustering patterns and feature relationships within the dataset, leveraging Clustergrammer’s interactive heatmaps to uncover insights into the dataset’s structure, identify feature significance, and detect anomalies. By analyzing similarity matrices and dynamically clustering data, Clustergrammer enabled a deeper understanding of how pixel intensities and digit structures contribute to classification, providing a valuable tool for data exploration and machine learning workflows.
Data: The MNIST dataset, a benchmark for handwritten digit recognition, was analyzed using Cluster Grammar to explore clustering patterns, feature relationships, and dataset quality. The dataset consists of 70,000 grayscale images of handwritten digits (0–9), with 60,000 used for training and 10,000 for testing. Each image, originally 28×28 pixels, was flattened into a 784-dimensional vector. Preprocessing steps included normalization of pixel intensity values to the range [0, 1].
Visualization Features: The heatmap allowed users to zoom and pan which focuses on specific clusters for detailed analysis. Reorder Rows and Columns that dynamically reorganize data to highlight patterns and annotate data that adds metadata, such as digit labels, for better interpretability. Bright regions in the heatmap corresponded to high-intensity pixels critical for classification, providing insights into feature importance. Additionally, isolated rows and columns highlighted outliers, such as mislabelled or poorly written digits.
Analysis Using Clustergrammer: This analysis demonstrated the effectiveness of Cluster Grammar in visualizing and understanding high-dimensional data. It revealed valuable insights into clustering patterns, feature significance, and dataset anomalies. The interactive visualization facilitated feature selection, anomaly detection, and dataset quality assessment, offering a powerful approach for analyzing complex datasets like MNIST. Cluster Grammar’s versatility makes it suitable for broader applications in computational biology, machine learning, and data science.
2) Lung Cancer Data Analysis
[edit]This case study is taken from [11]. This demonstrates how Clustergrammer can enhance data analysis and visualization, focusing on a dataset collected from lung cancer cell lines. The goal was to analyze post-translational modifications (PTMs) and gene expression patterns across different types of lung cancer to identify relationships and biological mechanisms.
Data: Post-Translational Modifications (PTMs) are changes made to proteins after they are synthesized, including processes like phosphorylation (adding a phosphate group), acetylation (adding an acetyl group), and methylation (adding a methyl group). These modifications can alter protein function and play a critical role in cancer development. In this study, PTMs were measured in 42 lung cancer cell lines using Tandem Mass Tag (TMT) mass spectrometry, a technique for detecting protein changes. Additionally, gene expression data, which reflects the activity levels of genes in producing their products like mRNA or protein, was collected for 37 of these cell lines from the Cancer Cell Line Encyclopedia (CCLE). The analysis focused on two major types of lung cancer: Non-Small Cell Lung Cancer (NSCLC), which includes the majority of cases such as adenocarcinoma and squamous cell carcinoma, and Small Cell Lung Cancer (SCLC), a more aggressive but less common type.
Analysis Using Clustergrammer: Using Clustergrammer, patterns in the data were identified by clustering cell lines based on their post-translational modifications (PTMs) and gene expression levels. The analysis revealed that cell lines grouped into two major types: Non-Small Cell Lung Cancer (NSCLC) and Small Cell Lung Cancer (SCLC). Within these primary groups, further clustering was observed based on specific subtypes, such as adenocarcinoma and squamous cell carcinoma, as well as genetic mutations. Key observations included strong correlations between modifications (phosphorylation, acetylation, methylation) in keratin family proteins and their corresponding mRNA levels, suggesting a close link between protein modifications and gene activity. Additionally, the NKX2-1 transcription factor, a critical regulator in lung cancer, showed strong correlations between its methylation patterns, mRNA expression, and other lung-related genes such as SFTA3 and SOX2.
This study demonstrated how Clustergrammer can quickly identify important patterns in complex biological data, helping researchers understand differences between cancer types and potentially leading to better treatment strategies.
References
[edit]- ^ Clustergrammer documentation: https://clustergrammer.readthedocs.io/
- ^ Fernandez, Nicolas F.; Gundersen, Gregory W.; Rahman, Adeeb; Grimes, Mark L.; Rikova, Klarisa; Hornbeck, Peter; Ma’ayan, Avi (2017). "Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data". Scientific Reports. 7. doi:10.1038/s41598-017-01819-3 (inactive 2024-11-20).
{{cite journal}}
: CS1 maint: DOI inactive as of November 2024 (link) - ^ "Clustergrammer Documentation". Read the Docs. Retrieved 2024-11-19.
- ^ Jovic, D.; Liang, X.; Zeng, H.; Lin, L.; Xu, F.; Luo, Y. (2022). "single cell RNA". Clinical and Translational Medicine. 12 (3): e694. doi:10.1002/ctm2.694. PMC 8964935. PMID 35352511.
- ^ "Clustergrammer-JS GitHub Repository". GitHub. Retrieved 2024-11-19.
- ^ "Clustergrammer-PY GitHub Repository". GitHub. MaayanLab. Retrieved 2024-11-19.
- ^ "widget-ts-cookiecutter". GitHub.
- ^ "regl". GitHub.
- ^ "Clustergrammer2 GitHub Repository". GitHub. Icahn School of Medicine at Mount Sinai. Retrieved 2024-11-19.
- ^ "ClusterGrammer Webtool".
- ^ "Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data".
- in-depth (not just passing mentions about the subject)
- reliable
- secondary
- independent of the subject
Make sure you add references that meet these criteria before resubmitting. Learn about mistakes to avoid when addressing this issue. If no additional references exist, the subject is not suitable for Wikipedia.