Introduction to Sequence Read Archive

NOTE:
Sequence Read Archive (SRA) is a publicly accessible repository that serves as a centralized resource for storing and sharing high-throughput sequencing data. It is maintained by the National Center for Biotechnology Information (NCBI), which is a part of the United States National Library of Medicine (NLM).


databases_sra.png

The SRA database: https://www.ncbi.nlm.nih.gov/sra primarily focuses on storing next-generation sequencing (NGS) data generated from a variety of platforms, such as Illumina, Ion Torrent, PacBio, and others. These sequencing technologies generate vast amounts of raw sequence data, often referred to as reads which can be used for diverse biological and biomedical research purposes.

The SRA database provides a comprehensive infrastructure to organize, store, and retrieve sequencing data. It supports data compression techniques to optimize storage space while maintaining data integrity. The stored data is annotated with essential metadata, including:

  • experimental details,
  • sample information,
  • sequencing platforms, and
  • library preparation methods.

Once the desired datasets are identified, users can download the raw sequencing data for further analysis using their preferred bioinformatics tools and pipelines.

Retrieve data from SRA

Researchers and institutions can submit their raw sequencing data to the SRA database, making it available for the scientific community. It enables scientists to access a vast collection of sequencing data from a wide range of organisms and research areas, promoting data reuse, meta-analyses, and the discovery of new insights.

Users can access the SRA database through the NCBI website or programmatically via application programming interfaces (APIs). Both approaches offers powerful search functionalities, allowing users to find specific datasets based on keywords, accession numbers, study details, organism information, or experimental attributes.

Search for data online

The SRA database offers flexible search capabilities, allowing users to search by:

  • keywords
  • data accession identifiers
  • advanced search builder

This ensures that users can efficiently find the datasets they need for their research.

KEYWORDS
Users can enter keywords related to their research interest in the search box on the SRA database homepage ⤴. These keywords can include gene names, organism names, disease terms, experimental conditions, or any other relevant information. The database will return results that match the provided keywords.

For example, let’s search results for: bumble bee worker

databases_sra_basic_search.png

If you want to narrow down your search results, you can utilize various search filters available on the right and left-hand side of the search results page.

databases_sra_basic_search_results.png



DATA ACCESSION IDs
If users already know the specific accession numbers or study names of the datasets they are looking for, they can directly enter those identifiers into the search box. This allows for quick retrieval of the desired datasets without the need for extensive keyword searching.

NOTE:
There are several top-level SRA entities and their accessions:
  • STUDY, identifiers starting with SRP#, ERP#, DRP#
  • SAMPLE, identifiers starting with SRS#, ERS#, DRS#
  • EXPERIMENT, identifiers starting with SRX#, ERX#, DRX#
  • RUN, identifiers starting with SRR#, ERR#, DRR#
  • The first letter in the accession makes a notation of the source database - SRA, EBI, or DDBJ correspondingly.

    You can also search for BioProjects and GEO Datasets:
  • PROJECT, identifiers starting with PRJNA#, GSE#

  • For example, let’s assume you just read the research article PMID: 34181797 ⤴ and have found the Data Availability Statement reffering to BioProject PRJNA730495.

    databases_sra_accession_search.png

    This time, the Search results show all items related to a single BioProject. At the top of the page, you can also find a direct link to the related publication.

    databases_sra_accession_search_results.png

    EXERCISE FOR YOU:



    ADVANCED SEARCH BUILDER
    The SRA database provides an Advanced Search feature that allows users to construct more complex and customized queries. By clicking on the Advanced link below the search box, users can access the Advanced Search Builder. By utilizing this tool, users can create sophisticated queries to retrieve datasets that match their specific requirements.

    PRO TIP:
    Advanced Search Builder is especially helpful when users have complex search criteria or need to refine their search based on multiple parameters.


    databases_sra_advanced_search.png

    The Advanced Search Builder includes various fields and filters that users can combine to construct their queries. These include:

    • Keywords
      Users can enter specific keywords related to their research interests.
    • Filters
      Users can apply filters such as organism, platform, study type, library source, instrument model, and more to narrow down the search results.
    • Operators
      Users can use logical operators like AND, OR, and NOT to combine search terms and create more precise queries.

    To get started with Advanced Search Builder please follow the instructions provided in the official SRA docs ⤴.

    Obtain run accessions SRR# identifiers

    In the SRA database, a Run or Run accession refers to a unique identifier assigned to a specific sequencing run results within a dataset. Each sequencing run corresponds to a set of raw data files generated from a single sequencing library.
    Run accessions SRR# are used to download raw sequencing data fom SRA database.

    You can obtain run accessions in a few ways:

    • A. manually from online SRA Search results
    • B. manually from online SRA Run Selector
    • C. automatically in the command line

    databases_sra_get_accessions.png

    A. manually from SRA Search results

    Once you refined your SRA search online (see section Search for data online) and see the list of results, explore search results and get run accessions.

    1. Review the Search results and check all the items you want to download.
      NOTE: If you do not check any box, by default all results will be prepared for download.

    2. Then, on the top of the page:
      • click Send to button and check the radiobutton File
      • from the Format dropdown select the option Accession List
      • download this file by clicking the Create File button
      • make sure to move this file to the location where you want to download the raw datasets (i.e., from which you are running the SRA Toolkit)

    databases_sra_obtain_accessions.png

    By default, the downloaded file is named SraAccList.csv and contains a single column with Run accessions:

    1
    2
    3
    4
    5
    6
    7
    
    acc
    SRR14567204
    SRR14567205
    SRR14567206
    SRR14567207
    SRR14567208
    ...
    

    B. manually from SRA Run Selector

    Alternatively, once you refined your SRA search online (see section Search for data online) and see the list of results, you can easily switch to the SRA Run Selector view. This online SRA tool provides a powerful and interactive interface that allows users to filter and sort the sequencing runs based on their preferences.

    NOTE:
    The Run Selector view enhances the usability of the SRA database by presenting sequencing run information in a structured and interactive manner. It enables users to efficiently explore and select the sequencing runs that meet their specific research requirements, facilitating downstream analysis and data retrieval.


    databases_sra_run_selector.png

    Here are some key features and functionalities of the Run Selector:

    • Filtering
      Users can apply filters to refine the list of sequencing runs displayed in the table. Filters are available for various attributes such as organism, library strategy, platform, read length, and more. By selecting specific filter options, users can narrow down the runs to those that match their criteria.

    • Sorting
      Users can sort the sequencing runs based on specific columns in ascending or descending order. This allows for easy organization and identification of runs based on desired characteristics, such as release date or read length.

    • Selection
      Users can select one or multiple runs from the table to further explore or download the associated sequencing data. By selecting specific runs, users can proceed to download the data files or access more detailed information about the selected runs.

    • Run details
      Clicking on a specific Run in the table provides additional details about that particular sequencing run. This may include information such as:

      • the experiment design,
      • library construction protocols,
      • experimental factors,
      • and other relevant metadata associated with the run.

      From there, you can also display Reads and download FASTA/FASTQ files for a given Run accession.
      Note, there is a download limit for 5 Gbases of Total Bases. Larger files need to be downloaded via SRA Toolkit.

    databases_sra_run_explorer.png

    C. via command line using EDirect tools

    To obtain run accessions from the command line you will need the esearch and efetch tools which are included within NCBI Entrez Direct utilities ⤴ (also known as E-utilities).

    NOTE:
    The esearch command is used to query various NCBI databases, including the SRA database, to retrieve specific information or search results.
    The efetch command with the -format runinfo option will retrieve the information for the matching datasets, including the run accessions.
    These tools allow you to perform searches and retrieve data programmatically from the command line.


    INSTALL EDIRECT (E-utilities)
    To use the esearch and efetch commands, you would need to install E-utilities (i.e., the NCBI Entrez Direct utilities, EDirect), which can be obtained from the NCBI website. You can find more information and download instructions for the EDirect at this link: https://www.ncbi.nlm.nih.gov/books/NBK179288/#chapter6_Getting_Started

    RETRIEVE SRRs
    Once you have installed NCBI’s EDirect, you can use the esearch command to search the SRA database and obtain the necessary information, such as Run accessions, using the efetch command.

    The general syntax is as follows:

    1
    
    esearch -db sra -query "your search terms" | efetch -format runinfo
    

    Replace “your search terms” with the relevant keywords or search criteria.

    For example, to get the SRR numbers associated with the BioProject PRJNA730495:

    1
    
    esearch -db sra -query PRJNA730495 | efetch -format runinfo > runinfo.csv
    

    The output file runinfo.csv will contain a table with various details, including the Run accessions (in the first column).

    1
    2
    3
    4
    5
    6
    
    Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
    SRR14567204,2021-06-25 18:25:06,2021-05-17 12:50:50,22110877,4422175400,22110877,200,1750,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-zq-24/SRR014/14567/SRR14567204/SRR14567204.lite.1,SRX10910477,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 4000,SRP320091,PRJNA730495,3,730495,SRS9000934,SAMN19232003,simple,207648,Bombus terricola,GSM5319034,,,,,,,no,,,,,GEO,SRA1233078,,public,EEDE9E10B2038FD62970B54B892532EE,C9297792C29DD79798BDE466149A11C3
    SRR14567205,2021-06-25 18:25:06,2021-05-17 12:54:45,25106715,5021343000,25106715,200,1975,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-zq-24/SRR014/14567/SRR14567205/SRR14567205.lite.1,SRX10910478,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 4000,SRP320091,PRJNA730495,3,730495,SRS9000933,SAMN19232002,simple,207648,Bombus terricola,GSM5319035,,,,,,,no,,,,,GEO,SRA1233078,,public,179BB8F87C8E7AEA09AFD31EA47B17CB,15F09D8C96B02C734C1FD0701A5BACF9
    SRR14567206,2021-06-25 18:25:06,2021-05-17 12:54:14,21165129,4233025800,21165129,200,1624,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-zq-24/SRR014/14567/SRR14567206/SRR14567206.lite.1,SRX10910479,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 4000,SRP320091,PRJNA730495,3,730495,SRS9000935,SAMN19232001,simple,207648,Bombus terricola,GSM5319036,,,,,,,no,,,,,GEO,SRA1233078,,public,F59A9C00EED65826E0F40DF85CA0E512,F76970E0700A354530E7C0F28B0E0599
    SRR14567207,2021-06-25 18:25:06,2021-05-17 12:53:18,22777492,4555498400,22777492,200,1775,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-zq-24/SRR014/14567/SRR14567207/SRR14567207.lite.1,SRX10910480,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 4000,SRP320091,PRJNA730495,3,730495,SRS9000936,SAMN19232000,simple,207648,Bombus terricola,GSM5319037,,,,,,,no,,,,,GEO,SRA1233078,,public,9A865DAAC8CB2F9336A553C44D371CCB,509CFAFFD98F8877B09D19E31305F0EE
    ...
    

    You can modify the above command to retrieve only the SRR# identifiers:

    1
    
    esearch -db sra -query PRJNA730495 | efetch -format runinfo | cut -d "," -f 1 > SRR.numbers
    

    The content of the SRR.numbers file will be like that:

    1
    2
    3
    4
    5
    6
    
    Run
    SRR14567204
    SRR14567205
    SRR14567206
    SRR14567207
    ...
    

    Download SRA data files

    To download raw data files from SRA database you have a few options. All of them assume that you have a list of Run accessions, i.e. the SRR# identifiers for data gathered for samples of your interest.

    Please see section Obtain run accessions SRR# to learn how to retrieve identifiers of samples from SRA database.

    Once you have a file with all Run accessions SRR#, for example:

    1
    2
    3
    4
    
    SRR14567204
    SRR14567205
    SRR14567206
    SRR14567207
    

    please follow the instructions for the approach of your choice:

    A. using SRA Toolkit

    The NCBI’s SRA Toolkit provides command-line utilities to download and manipulate SRA data. This software suite has been configured to connect to NCBI databases (including SRA) and download data via FTP or HTTPS.

    Install SRA Toolkit on your local machine
    Visit the NCBI SRA Toolkit repository (https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit) to download and install the SRA Toolkit appropriate for your operating system.

    Load SRA Toolkit module on the HPC infrastructure
    On an HPC (High-Performance Computing) cluster, it is common for administrators to preinstall various software packages and tools to provide users with a consistent and accessible computing environment. The SRA Toolkit, including its command-line utilities, is one such tool that may be preinstalled on the HPC cluster.

    When the SRA Toolkit is installed on an HPC cluster, it is typically made available as a module. A module is a software management system that allows users to load or unload specific software packages and their associated environments.

    1. Check available modules:
      1
      
      module avail sratoolkit
      

      This command displays the software variants that have been preinstalled and are ready to be loaded.
      For example:
      sratoolkit/3.0.0-6ikpzzj

    2. Load the selected SRA Toolkit module:
      1
      
      module load sratoolkit/3.0.0-6ikpzzj
      

    Get started with SRA Toolkit tools
    Once the module is loaded, you can use tools available in the SRA Toolkit to download the raw Sequence Reads from the SRA database.

    The SRA Toolkit provides a set of command-line utilities that facilitate the retrieval, manipulation, and analysis of data from the NCBI’s Sequence Read Archive (SRA) database.

    PRO TIP:
    Use the name of the command-line utility followed by the -h flag to learn more about the options available for each tool.


    sratoolkit command description
    prefetch Downloads binary SRA data files based on accession numbers or a text file containing accession numbers. It retrieves the data files and saves them locally for further processing.
    fastq-dump Converts SRA data files into FASTQ format, which is a common file format for storing sequence data. It extracts the raw sequence reads and quality scores from the SRA files.
    sam-dump Converts SRA data files into SAM/BAM format, which is used for storing sequence alignment information. It allows you to extract alignment data from SRA files in a readable format.
    vdb-validate Performs validation checks on SRA data files to ensure their integrity and adherence to formatting standards. It can be used to verify the integrity of downloaded SRA files.
    vdb-config Modifies configuration settings for the SRA Toolkit, such as specifying the location of downloaded files or adjusting caching options.
    fasterq-dump A faster alternative to fastq-dump, specifically designed for downloading and converting SRA data files into FASTQ format. It provides improved performance for large datasets.

    The toolkit offers additional utilities and options for more advanced operations, such as filtering, subsetting, and quality control of SRA data.

    The simple command to fetch a SRA file in FASTQ format, you can use the fastq-dump command:

    1
    
    fastq-dump SRR447882
    

    This will download the SRA file (in sra format) and then convert them to fastq file for you. If your SRA file is paired, you will still end up with a single fastq file, since, fastq-dump, by default writes them as interleaved file. To change this, you can provide --split-files argument.

    1
    
    fastq-dump --split-files SRR447882
    

    The downloaded fastq files will have @SRR number suffixed on all header lines of the file:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    head -n 12 SRR447882_1.fastq
    @SRR447882.1.1 HWI-EAS313_0001:7:1:6:844 length=84
    ATTGATCATCGACCAGAGNCTCATACACCTCACCCCACATATGTTTCCTTGCCATAGATCACATTCTTGNNNNNNNGGTGGANA
    +SRR447882.1.1 HWI-EAS313_0001:7:1:6:844 length=84
    BBBBBB;BB?;>7;?<?B#AA3@CBAA?@BAA@)=6ABBBBB?ACA;0A=257?A7+;;&########################
    @SRR447882.2.1 HWI-EAS313_0001:7:1:6:730 length=84
    AGTTGATTGTGATATAGGNGTCTATCGACATTGATGCATAGGTCCTCTATTAAACTTGTTTTGTGATGTNNNNNNNTTTTTTNA
    +SRR447882.2.1 HWI-EAS313_0001:7:1:6:730 length=84
    A?@B:@CA:=?BCBC:2C#7>BACB??@4@B@<=>;'>@>3:86>=6@=B@B<;)@@###########################
    @SRR447882.3.1 HWI-EAS313_0001:7:1:6:1343 length=84
    CATCAATGCAAGGATTGTNCCATTGGTAACAATTCCACTCCTAACTTGTCAATTGATTTTCATATAACTNNNNNNNCCAAAANT
    +SRR447882.3.1 HWI-EAS313_0001:7:1:6:1343 length=84
    BCB@BBC+5BCA>BABBA#@4BCCA>?CBBB4CB(*ABB?ABBAACCB8ABBB?(<<B?:########################
    

    Although, this normally does not affect any programs, some programs might throw an error saying that it can’t process these fastq files. To avoid this, you an request the file to be in the orignal format (--origfmt). Also, note that if you’re downloading files in bulk, you can save a lot of space by compressing them in gzip format (--gzip).

    1
    
    fastq-dump --split-files --origfmt --gzip SRR447882
    

    The fastq-dump is also capable of doing:

    • Additional filtering or clipping of the downloaded reads: to remove reads with poor quality or to trim adapters. Although, this will work for the single end reads, for paired-end reads it may cause differential treatment for each pairs and might not be usable for mapping programs that needs strict pairs.
    • Compressed format: either as gzipped or bzipped files using --gzip or --bzip2 options.
    • fasta format: by using the --fasta option

    B. using Linux commands

    WARNING:
    The NCBI Team, on October 17, 2019, announced the decommission of the SRA Fuse/FTP site and encouraged the users to try the SRA Toolkit as the recommended method of downloading the data from SRA. [source ⤴]

    In relation to the above warning, we encourage you to choose an approach based on the use of the SRA Toolkit. See section A. using SRA Toolkit.

    In cases were you cannot run the SRA toolkit or any other programs to download the file, you can still try to use the built-in commands of Linux such as wget and curl.

    WARNING:
    Note that downloading files from the SRA database using this approach is very slow (~1MB/s). Also, the files larger than 5 Gbases are not available to download using this method.

    The standard web link for downloading the SRA files is:
    https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRRNNNNNN

    PRO TIP:
    In the past the working link was also this one:
    http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=SRRNNNNNN&format=fastq

    You need to replace the SRRNNNNNN with the actual SRR number for it to work.

    You can either use wget command:

    1
    
    wget https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR14567204
    

    or curl command:

    1
    
    curl -O https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR14567204
    

    If you have a large list of ids, you can simply loop it over using a while loop

    1
    2
    3
    
    while read line; do
    wget https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=${line};
    done < list_of_ids
    
    PRO TIP:
    The datasets can also be downloaded from DDBJ or EMBL using the FTP links, but the transfer speeds might be affected if you're not near their servers.

    C. using Aspera Connect

    Aspera uses high-speed file transfer to rapidly transfer large files and data sets over an existing WAN infrastructure.

    To get the SRA files:

    1
    
    prefetch --max-size 100G --transport ascp --ascp-path "/path/to/aspera/3.6.2/bin/ascp|/path/to/aspera/3.6.2/etc/asperaweb_id_dsa.openssh" SRRNNNNNN
    

    This usually prefetches the SRA file to your home directory in folder named ncbi. If your home directory does not contain enough space to store all data, you may want to create another directory and softlink to the home. To do this:

    1
    2
    3
    
    cd ~
    mkdir -p /project/storage/your_dir/ncbi
    ln -s /project/storage/your_dir/ncbi
    

    when you run this, you will have a directory named ncbi in your home, but the data is actually stored in /project/storage/your_dir/ncbi

    1
    2
    3
    4
    5
    
    ncbi
    └── public
        └── sra
            ├── SRR006189.sra
            └── SRR006190.sra
    

    Then you can convert the SRA files back to fastq format using fastq-dump command.

    1
    2
    3
    
    for sra file in ~/ncbi/public/sra/*; do
    fastq-dump --split-files --origfmt --gzip ${sra};
    done
    



    ready-made examples

    NCBI Sequence Read Archive (SRA) stores sequence and quality data (fastq files) in aligned or unaligned formats from NextGen sequencing platforms. A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. Often times, once single BioProject will hold a considerable number of experiments and it gets tedious to download them all individually.

    Here is the guide to show how to do this in a effecient way:

    First, we need the esearch program for Entrez searching. So, navigate to the home directory and run the commands:

    1
    2
    
    sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
    sh -c "$(wget -q https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"
    

    This will download a number of scripts and several precompiled programs into an “edirect” folder in the user’s home directory. Once you add it to your $PATH, the esearch command will be available for you from any location on the file system.

    WARNING:
    If you are working on the HPC cluster, make sure to encapsulate the following commands in the script for workload manager (typically SLURM or PBS) and submit the job to the queue. Alternatively, you can start the interactive session on the computing node:
  • for SLURM use: salloc -N 1 -n 4 -t 01:00:00
  • for PBS use: qsub -I
  • In the next step, load the modules that are needed:

    1
    2
    
    module load sratoolkit
    module load parallel
    

    To get the SRR numbers associated with the project:

    1
    
    esearch -db sra -query PRJNA730495 | efetch -format runinfo |cut -d "," -f 1 > SRR.numbers
    

    To download the corresponding SRA data all at once in parallel (limit the number to 4 concurrent downloads) use the following command:

    1
    
    parallel --jobs 4 "fastq-dump --split-files --origfmt --gzip {}" ::: SRR.numbers
    

    Use ready-made script for the SLURM workload manager
    download_raw_samples.sh

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    #!/bin/bash
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=8
    #SBATCH --time=01:00:00
    
    module load sratoolkit
    module load parallel
    
    project='PRJNA730495'
    
    esearch -db sra -query $project | efetch -format runinfo > runinfo.csv
    cat runinfo.csv | cut -d "," -f 1 > SRR.numbers
    cat SRR.numbers | parallel fastq-dump --split-files --origfmt --gzip -X 1000 {}
    

    Run in the comman-line:

    1
    
    sbatch download_raw_samples.sh
    

    Next Previous Table of contents