The vast amount of public sequencing experiments generated from human and various model organisms could be useful in understanding human genetics. However, the sequencing data remained difficult to be reused by the broad bioinformatics research community. Therefore, we created an efficient pipeline and extracted the transcript read counts, allelic read counts and microbe read counts from >900,000 sequencing experiments generated from >20,000 studies deposited in the Sequence Read Archive (SRA). This resource could be useful towards cataloging the transcriptome, variant and microbial composition from >700,000 biosamples. To further increase the data accessibility to the community, we created a JupyterHub environment where any researcher can retrieve the -omic layers and conduct simple data analysis without programming background.
Data landscape of SRA
(Left) SRA data volume (y-axis) across the library strategies with highest availability (x-axis) and species with highest availability (hue). (Right) SRA data volume (y-axis) is increasing over time (x-axis).
The landscape of reprocessed SRA data
(Left) Data availability of the allelic read counts data and microbe read count data extracted from the SRA. (Right) Data availability of the transcriptome data extracted from the SRA.
From data searching to analysis in less than minutes with our JupyterHub:
Feel free to contact me at email@example.com if you have any questions. I will try to reply within three days.
Expression analysis using our JupyterHub
This example shows how the user can query and compare the expression profiles between T-Cell and B-Cell using our JupyterHub in less than minutes. (top) Workflow: the user input a list of experimental conditions. In this example, “B-Cell, T-Cell” as a free-text query, (middle) which will retrieve the expression profiles with annotations containing “B-Cell” or “T-Cell’. (bottom) The user could click on the analysis links to generate various differential expression analysis.
Two steps to go from data searching to analysis
Example Interactive Analysis: PCA showing that the expression profiles are separated by the first three PCs (top left) and first two PCs (top right). (bottom left) Interactive volcano plot to identify DE genes. (bottom right) The user can also query by different gene names to retrieve the studies with the highest expression levels as opposed to querying by metadata.
Variant extraction using our JuypterHub
If the user is interested in investigating the mutational landscape of BRAF V600 across the existing human public studies and its neighboring genomic sites, he can input the corresponding genomic site chr7:140,753,336 in our platform (Left) where our platform will then identify all the human sequencing studies with the allelic read counts detected at the query position and also the neighbor variants in the 15bp windows, which is informative towards evaluating the studies with the query mutation. The size of each node represents the median mapping quality score (MAPQ). (Right) Example interactive study summary table display: The table display the screenshot of our interactive JupyterHub web interface which allows the user to scroll through the available studies ranked by the number of sequencing experiments with alternative allele detected.
Comparing the microbe presence using our JupyterHub
In the following example, we showed how you can identify the differential microbe presence between experimental conditions, (left) which in this case we compared the sequencing samples with annotations containing “HeLa Cells” and “B-Cells”. (right) We were also able to validate the SRA experiments that claimed to be collected from HeLa cells, which the cell line is known to contain HPV genome.
The following word map should give you a sense of the kind of data that is available in the SRA. The size of each BioSample attribute node represents the data availability in log2 scale. The distances between the nodes in this t-SNE plot represent the textual semantic similarities between the metadata from the submitted annotations. For example, when you zoom into the attribute group “disease” you can hover over the neighboring labels and see that the BioSample attribute “disease” is closely grouped with relevant attributes like “diagnosis” and “disease status”.
For more information, please see: https://www.biorxiv.org/content/biorxiv/early/2018/09/12/414136.full.pdf
Key resource table
|SkyMap JupyterHub||This paper||http://hannahcarterlab.org/jupyterhub|
|SkyMap JupyterHub Jupyter notebooks and its Docker environment||This paper||https://github.com/brianyiktaktsui/CarterLabJupyterHub|
|SkyMap processing pipeline and the conda execution environment||This paper||https://github.com/brianyiktaktsui/Skymap|
|SkyMap FTP server hosting reprocessed data||This paper||ftp://download.hannahcarterlab.org/|
|SRA IDs explanations||NCBI||https://www.ncbi.nlm.nih.gov/sra|
- Majority of the texts in this webpage were copied from this manuscript I wrote: SkyMap JupyterHub: A cloud platform to query and analyze >900,000 re-processed public sequencing experiments from >20,000 studies
- Explanation of the human allelic read count extraction pipeline: Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive
- SRA metadata and NLP: Creating a scalable deep learning based Named Entity Recognition Model for biomedical textual data by repurposing BioSample free-text annotations
What do the different SRA accessions represent? (This section is copy and pasted from official NCBI website: https://www.ncbi.nlm.nih.gov/books/NBK56913/)
There are 6 different SRA accession types:
|Accession Prefix||Accession Name||Definition||Example|
|SRA||SRA submission accession||The submission accession represents a virtual container that holds the objects represented by the other five accessions and is used to track the submission in the archive.||Since the SRA accession number is an artificial packaging construct, there is no example available since the SRA accession number has no specific response page|
|SRP||SRA study accession||A Study is an object that contains the project metadata describing a sequencing study or project. Imported from BioProject.||HTML|
|SRX||SRA experiment accession||An Experiment is an object that contains the metadata describing the library, platform selection, and processing parameters involved in a particular sequencing experiment.||HTML|
|SRR||SRA run accession||A Run is an object that contains actual sequencing data for a particular sequencing experiment. Experiments may contain many Runs depending on the number of sequencing instrument runs that were needed.||HTML|
|SRS||SRA sample accession||A Sample is an object that contains the metadata describing the physical sample upon which a sequencing experiment was performed. Imported from BioSample.||HTML|