The vast amount of public sequencing experiments generated from human and various model organisms could be useful in understanding human genetics. However, the sequencing data remained difficult to be reused by the broad bioinformatics research community. Therefore, we created an efficient pipeline and extracted the transcript read counts, allelic read counts and microbe read counts from >900,000 sequencing experiments generated from >20,000 studies deposited in the Sequence Read Archive (SRA). This resource could be useful towards cataloging the transcriptome, variant and microbial composition from >700,000 biosamples. To further increase the data accessibility to the community, we created a JupyterHub environment where any researcher can retrieve the -omic layers and conduct simple data analysis without programming background.

Data landscape of SRA

(Left) SRA data volume (y-axis) across the library strategies with highest availability (x-axis) and species with highest availability (hue). (Right) SRA data volume (y-axis) is increasing over time (x-axis).

The landscape of reprocessed SRA data

(Left) Data availability of the allelic read counts data and microbe read count data extracted from the SRA. (Right) Data availability of the transcriptome data extracted from the SRA.


Click here to try it!!!

From data searching to analysis in less than minutes with our JupyterHub:

Feel free to contact me at if you have any questions. I will try to reply within three days.

Expression analysis using our JupyterHub

This example shows how the user can query and compare the expression profiles between T-Cell and B-Cell using our JupyterHub in less than minutes. (top) Workflow: the user input a list of experimental conditions. In this example, “B-Cell, T-Cell” as a free-text query, (middle) which will retrieve the expression profiles with annotations containing “B-Cell” or “T-Cell’. (bottom) The user could click on the analysis links to generate various differential expression analysis. 

Two steps to go from data searching to analysis

Example Interactive Analysis:  PCA showing that the expression profiles are separated by the first three PCs (top left) and first two PCs (top right). (bottom left) Interactive volcano plot to identify DE genes. (bottom right) The user can also query by different gene names to retrieve the studies with the highest expression levels as opposed to querying by metadata.

Example analysis

Variant extraction using our JuypterHub

If the user is interested in investigating the mutational landscape of BRAF V600 across the existing human public studies and its neighboring genomic sites, he can input the corresponding genomic site chr7:140,753,336 in our platform (Left) where our platform will then identify all the human sequencing studies with the allelic read counts detected at the query position and also the neighbor variants in the 15bp windows, which is informative towards evaluating the studies with the query mutation. The size of each node represents the median mapping quality score (MAPQ). (Right) Example interactive study summary table display: The table display the screenshot of our interactive JupyterHub web interface which allows the user to scroll through the available studies ranked by the number of sequencing experiments with alternative allele detected.  

Example analysis

Comparing the microbe presence using our JupyterHub

In the following example, we showed how you can identify the differential microbe presence between experimental conditions, (left) which in this case we compared the sequencing samples with annotations containing “HeLa Cells” and “B-Cells”. (right) We were also able to validate the SRA experiments that claimed to be collected from HeLa cells, which the cell line is known to contain HPV genome.



The following word map should give you a sense of the kind of data that is available in the SRA. The size of each BioSample attribute node represents the data availability in log2 scale. The distances between the nodes in this t-SNE plot represent the textual semantic similarities between the metadata from the submitted annotations. For example, when you zoom into the attribute group “disease” you can hover over the neighboring labels and see that the BioSample attribute “disease” is closely grouped with relevant attributes like “diagnosis” and “disease status”.
For more information, please see:

Key resource table

SkyMap JupyterHub This paper
SkyMap JupyterHub Jupyter notebooks and its Docker environment This paper
SkyMap processing pipeline and the conda execution environment This paper
SkyMap FTP server hosting reprocessed data This paper
SRA IDs explanations NCBI

Associated manuscripts


What do the different SRA accessions represent? (This section is copy and pasted from official NCBI website:

There are 6 different SRA accession types:

Accession Prefix Accession Name Definition Example
SRA SRA submission accession The submission accession represents a virtual container that holds the objects represented by the other five accessions and is used to track the submission in the archive. Since the SRA accession number is an artificial packaging construct, there is no example available since the SRA accession number has no specific response page
SRP SRA study accession A Study is an object that contains the project metadata describing a sequencing study or project. Imported from BioProject. HTML
SRX SRA experiment accession An Experiment is an object that contains the metadata describing the library, platform selection, and processing parameters involved in a particular sequencing experiment. HTML
SRR SRA run accession A Run is an object that contains actual sequencing data for a particular sequencing experiment. Experiments may contain many Runs depending on the number of sequencing instrument runs that were needed. HTML
SRS SRA sample accession A Sample is an object that contains the metadata describing the physical sample upon which a sequencing experiment was performed. Imported from BioSample. HTML