Summary:The purpose of this JupyterHub is to let the community to retrieve public data at ease. For example, 1.) you can retrieve the expression of a gene of any public sequencing profiles in < 1 second,  or 2.) extract the allelic read counts of a particular sequencing profile in < 1 second. Skymap is a standalone database that aims to offer. 1) a single data matrix for each omic layer for each species that spans a total of >400k sequencing runs from all the public studies, which is done by reprocessing petabytes worth of sequencing data 2.) a biological metadata file that describes the relationships between the sequencing runs and also the keywords extracted from over 3 million free text annotations using NLP 3.) a technical metadata file that describes the relationships between the sequencing runs.

 

Step by step guide to run our notebooks (Take < 1 min to complete):

  1. Click this button to login using any of your Google accounts, we don’t ask for any personal info and no registration is involved, it’s really just for logging in.
  2. Click the “notebooks” directory.
    •  
  3. Click the example notebook “basicRNAseqAnalysis.ipynb”
  4. Click “Run All” to execute the python code cells in the notebook.

** Feel free to change the notebook and run it in your way. For example, you can change the query gene from “TP53” to  “GAPDH” to extract the expression level (TPM) of GAPDH from >100,000 sequencing runs:

 

Highlights:

  1. Skymap project
    1. GitHub (Primary documentation are here)
    2. Manuscript
    3. Notebooks:
      1. loadAllelicReadCountBySrrId.ipynb: Slicing out allelic read counts (variant) from >250,000 public sequencing runs in <1 second.
      2. queryingRNAseqByGene.ipynb: Slice RNAseq expression level of >400,000 expression profiles <1 second.

Feel free to shoot me an email if you have any questions and I will try to reply with three days: btsui@eng.ucsd.edu

GitHub page: details for both the pipeline and more example analysis Jupyter notebooks. 

Blogs:

  1. The rationale of Skymap project. 
  2. Why JupyterHub for data retrieval?

References:Introducing Skymap: Allelic read counts extracted from 250,000 human sequencing runs in Sequence Read Archive

Data: If you wish to have a local copy of SkyMap, all the data are located in JupyterHub (directory: ~/efs)which you can download using rsync.

Limitation of our JupyterHub: 

  1. We don’t store your local data. Once it becomes idle for more than an hour, your Kubernete pod will be pulled and deleted, you probably want to download your own copy of data.
  2. I didn’t set a memory limit, but it should most likely die when the memory exceeds 8GB.
  3. I didn’t set a CPU limit, but in the unlikely case, you might see a slow down for the notebook.
  4. The JupyterHub might be taken down for maintenance at midnight (PST) everyday.