Download FASTQ files from NCBI SRA

Download FASTQ files from NCBI SRA
Go Back

Overview

OmicSoft Suite with the Server on the Cloud add-on supports download of FASTQ files deposited to the NCBI SRA repository. FASTQ files can be retrieved to local Windows folders, Server folders, or mapped S3 cloud folders, with massive parallel computing from temporary cloud EC2 instances.

Prerequisites

You must be connected to your OmicSoft Server using at least OmicSoft Studio v12.4
Your OmicSoft Server must be at least v12.4
Your OmicSoft Server must have Server on the Cloud configuration in ArrayServer.cfg. Please see specific requirements at the bottom of this page; this should be configured by your OmicSoft Server administrator, using information provided by your AWS administrator.
Downloads require AWS "Requester pays" for data. If running the cloud jobs on AWS US-East-1 instances, this is negligible, but you should consider this cost to your AWS account before downloading data outside of AWS (e.g. downloading to a server folder).

Options

Which identifiers are supported?

List one or more SRA-recognized identifiers, one per line, using only Upper-Case letters (sra1234567 will not be recognized). Supported identifiers include

NCBI identifiers: SRR, SRX, SRS, SRP
EBI identifiers: ERR, ERX, ERS, ERP
(Data Bank of Japan (DDBJ) database: DRR, DRX, DRS, DRP
BioSample accession: SAMN, SAMEA, SAMD
BioProject accession: PRJNA, PRJEB, PRJDB
GEO accessions: GPL, GSM, GSE, GDS (available from v. >= 12.5)

For each GEO ID, the corresponding SRA ID(s) are found. Usually there is only one SRX (SRA experiment) for each GSM (GEO sample).
If a GEO ID is not found or has no associated SRA ID, this is logged, but execution continues.
Publicly-available data are currently supported for download; controlled-access data such as those that require dbGaP approval are not available.

Check availability and metadata details by clicking Check availability and metadata after entering one or more identifiers.

Which reads to download?

Biological reads only - In most cases, "Biological reads only" is appropriate, which will retrieve one read (single-end) or two reads (paired-end) per run; this uses the "split-3" parameter.
- Combine reads within an experiment - If selected, multiple SRR (ERR, DRR, etc.) files for a single experiment ID (SRX, ERX, DRX, etc) will be saved as a single file.
biological + technical reads - Some datasets, particularly single cell 'omics experiments, should be downloaded with "biological + technical reads", as these datasets include information as one or more biological reads as well as one or more technical reads that describe the molecular identifiers; this uses the "split-reads" parameter.

Metadata download

Retrieve SRA metadata for samples - For every downloaded run/experiment file, all available tabular metadata from SRA will be retrieved and saved to the specified location as sample_metadata_sra.txt.
Retrieve metadata from GEO - For every downloaded sample, all available tabular metadata from GEO will be retrieved and saved to the specified location as GSEnnnnnn.GPLnnnnn.Design.txt including identifiers for the GEO entry and the SRA entry (to allow joining to the SRX IDs).
- Note: the GEO metadata is retrieved only for those SRA entities having center_name==’GEO’ and a value in the sample_name column (in the SRA metadata). The sample_name value is assumed to be the GEO sample id.

Retrieval of metadata from GEO requires an NCBI Entrez API key, which can be generated by following these instructions: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/ .

The API key can be defined either for the user in OmicSoft Studio (Tools | Preferences | Misc), or for all users of an OmicSoft Server installation with the parameter NcbiEntrezApiKey

Download options

Use faster, parallel gzip compression - On modern systems with support for Zlib compression, file compression of downloaded files will be parallelized, significantly improving retrieval time.Uncheck this option only if you are unable to extract the FASTQ files from the GZip archive using some third-party tools, that don't completely implement the standard.

Check availability and metadata

After entering one or more recognized IDs, click "Check availability and metadata" to view available information.

SRA metadata: Primarily technical data, with one row per run.

GEO metadata: primarily sample-level data

Key information from metadata preview:

Run-level information: Regardless of whether you input SRR, SRX, SRP, or other recognized information, detailed run-level (SRR or equivalent) information will be displayed for each ID.
Availability: Only public data are currently downloadable. Any unrecognized identifier will be listed as NotFound. Any restricted access (e.g. dbGaP) identifier will be Restricted.
Links to additional SRA details: SRA IDs are hotlinked to NCBI's SRA, so you can confirm details of the experiment, including whether technical reads are available.
Total size: Confirm that you have enough space for all the data you are about to download.

Output folder

Local folder download (Windows only) - While not connected to an OmicSoft Server, or with a local OmicSoft project open and active, select a local folder
- Downloads will be executed using computational resources on your local computer.
Server folder download - While connected to your OmicSoft Server, select a mapped server folder.
- Downloads will be executed using computational resources on your OmicSoft Server.
Cloud folder download - While connected to your OmicSoft Server with properly-configured OmicSoft Server on the Cloud, select a mapped cloud folder.
- Downloads will be executed using on-demand AWS Cloud EC2 instances, one sample per instance.

Important: To download all samples in parallel using AWS cloud instances, your output folder must be a S3 folder. The interface will confirm this for you.

Once you've set all parameters, the OK button will be enabled and you can trigger downloads.

Output

Each SRA ID will be submitted as a separate job, up to the number of parallel jobs you selected. If more IDs were selected than available jobs, identifiers will be distributed among parallel jobs to optimize download.
When your job is complete, you will find all FASTQ files and SRA metadata in the specified folder:

Best practices

Optimizing parallel downloads - Each experiment or run will be downloaded as a separate job. If downloading to local or server folder, be mindful of available CPU and space resources. If downloading to cloud folder, a parallel EC2 instance will be temporarily generated for each sample, so it is generally safe to specify one parallel job per SRR/SRX.

Downloading biological reads vs biological + technical reads - Unless you are downloading scRNAseq data, you generally only need "biological reads"; technical reads are usually used to contain UMI or other tag information for data. You can check whether technical reads are available by clicking "Check availability and metadata", then clicking one of the SRA ID links to the NCBI run selector.

Next Steps

Once you've downloaded data, you may want to run Raw QC analysis of the data, or align data using OmicSoft's pipelines or your preferred NGS analysis tools.
Follow the OmicSoft RNA-Seq analysis Tutorial for one example of subsequent steps in your analysis.

Prerequisites

Appendix: Configuration of your Server on the Cloud installation for SRA download

These steps should be completed by your OmicSoft Server administrator, working with your AWS administrator. Details of these parameters can be found in ArrayServer.cfg Configuration File in the [Cloud] section. Please be aware that your User Account must support "Requester Pays" to cover the costs of data egress. For this reason, we recommend downloading to S3 when possible.

specify the following parameters in the [Cloud] section (all are specified as part of your Server on the Cloud configuration)
- Provider=Amazon
- Ami and AmiSnapshot, using an Ami that supports v12.4+
- Region, e.g. us-east-1
- InstanceProfileArn
- Access Key and Secret Key
- OmicsoftCloudDirectory, mapped to an accessible S3 bucket for the Access Key
- MaxInstanceCount and MaxInstanceCountPerJob . Since AWS EC2 instances are charged per-minute, there is little reason to allow generous settings such as MaxInstanceCount=500 and MaxInstanceCountPerJob=100
- EnableDataEncryption=True
To download to a Cloud S3 location, you must also have one or more accessible S3 buckets mapped in [CloudFolder]
Currently, Server on the Cloud configurations using IAM roles without Access/Secret Keys do not support this function, as the NCBI download function requires Access Keys

Two additional parameters may be specified in the [Options] section, but are generally recommended only if specifically instructed by QIAGEN tech support.

SraMaxDownloadSizeInGb (default=300) - The maximum size of a single SRA accession.
SraToolkitVersion (default=3.0.2) the SRA toolkit that will be downloaded if not found locally. Not recommended to change unless advised by the QIAGEN Tech Support team.
NcbiEntrezApiKey (new in v. 12.5) - mandatory only for the new GEO metadata retrieval functionality, if the SRA download is being run on the server or on cloud job instances.

If you don't have one, you must create an NCBI account, using a Google, Microsoft or login.gov account / SSO authentication: https://ncbiinsights.ncbi.nlm.nih.gov/2021/01/05/important-changes-ncbi-accounts-2021/
at https://account.ncbi.nlm.nih.gov/signup/?

After that, an API key can be obtained as described at:https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us

An API key will allow up to 10 requests/second/IP to NCBI Entrez web service, used to query the GEO metadata database. All concurrent requests above this limit might be blocked by NCBI.

When the SRA download proc is executed on the server, if the NcbiEntrezApiKey parameter is not specified in the OScript (or Studio preferences), the value from ArrayServer.cfg is used.

The machines where the GEO metadata is retrieved must be able to access https://ftp.ncbi.nlm.nih.gov and https://eutils.ncbi.nlm.nih.gov (HTTPS protocol, port 443, outbound).
Note: If a Proxy Server is configured for OmicSoft client/server after the SRA download was run once, then it will be ignored during any future SRA download process.
If the proxy server address must be set or changed later, make sure to delete first the NCBI SRA toolkit configuration file form ~/.ncbi/ (on Linux) or C:\Users\<username>\.ncbi\ on Windows - the config fie will be re-created the next time the SRA download is run on server or Studio.

If not configured on the OmicSoft Server instance, to use the new GEO metadata retrieval functionality, or for downloading SRA data by GEO Ids, an NCBI Entrez Api Key may be separately configured for each user, in Studio / Analysis / Tools /Preferences, Misc tab:

.

Appendix: Oscript

If you routinely download datasets from SRA, you may prefer to submit download jobs via Oscript. Use this template to retrieve the data of interest.

Begin DownloadFastqFromNcbiSraCloud /Namespace=NgsLib /RunOnServer=True;
IDs
"SRR3184299
...
SRR3184302";
Options /OutputFolder="/CloudFolderQaQC/test_pearson/GSE78220" /CombineRuns=True /GetMetadata=True /IncludeTechnicalReads=False /ParallelJobNumber=5 
  /GetGeoMetadata=True /NcbiEntrezApiKey=......;

End;

Key oscript parameters for this function:

IDs: The user can enter a mix of SRA IDs and/or GEO IDs (GPL, GSM, GSE, GDS).
For each GEO ID, the corresponding SRA ID(s) are found using the NCBI Entrez Eutils web service. Usually there is only one SRX (SRA experiment) for each GSE (GEO sample).
If a GEO ID is not found or has no associated SRA ID, this is logged, but the execution continues.
ParallelJobNumber - The number of parallel jobs to start.
OutputFolder - If a Cloud folder is specified, downloads will be executed in parallel using temporary EC2 instances. If a Server folder is specified, downloads will be executed in parallel using your OmicSoft Server's resources. If a local folder is specified, downloads will be executed in parallel using your computer's resources.
IncludeTechnicalReads - If true, will return both biological and technical reads; if false, will return only biological reads. Unless the data are single cell transcriptomics, technical reads are rarely needed.
CombineRuns - If "IncludeTechnicalReads"=False and If Experiment (SRX), Sample (SRS), or Project (SRP) identifiers are specified, and this parameter is set to true, Runs (SRR) files will be combined to Experiments.
GetMetadata (default=True) - Will return available SRA metadata for every identifier as a tab-delimited text file.
GetGeoMetadata (default=False, only in v. >= 12.5) - the GEO metadata matching each specified SRA experiment is downloaded as a tab-delimited file
NcbiEntrezApiKey (only in v. >= 12.5) - mandatory only if GetGeoMetadata=True or if one of the IDs is a GEO accession, when the SRA download is executed on the same machine as Studio or Server.
UseParallelCompression (default=True, only in v. >= 12.5) - Parallelize the creation of Gzip files from FASTQ files

IPA

CLC Software

HGMD

QCI

OmicSoft Suite

OmicSoft Lands

Overview

Prerequisites

Options

Which identifiers are supported?

Which reads to download?

Metadata download

Download options

Check availability and metadata

GEO metadata: primarily sample-level data

Key information from metadata preview:

Output folder

Output

Best practices

Next Steps

Prerequisites

Appendix: Configuration of your Server on the Cloud installation for SRA download

Appendix: Oscript

IPA

CLC Software

HGMD

QCI

OmicSoft Suite

OmicSoft Lands

Overview

Prerequisites

Options

Which identifiers are supported?

Which reads to download?

Metadata download

Download options

Check availability and metadata

GEO metadata: primarily sample-level data

Key information from metadata preview:

Output folder

Output

Best practices

Next Steps

PrerequisitesAppendix: Configuration of your Server on the Cloud installation for SRA download

Appendix: Oscript

Prerequisites

Appendix: Configuration of your Server on the Cloud installation for SRA download