Tutorial for NCBI PGAP
The NCBI Prokaryotic Genome Annotation Pipeline (PGAP) is designed to annotate bacterial and archaeal genomes.
PGAP is available as a docker image and NCBI Quickstart page has instructions for using PGAP with docker. I used PGAP on Nova, which has singularity as a module. Singularity can be used with docker containers.
1 module load singularity
Pull the installation file:
1 singularity pull https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py
1 2 export PGAP_INPUT_DIR=/PATH/TO/YOUR/work/directory ./pgap.py --update
1 ./pgap.py -r -o mg37_results YOUR/PATH/TO/test_genomes/MG37/input.yaml
This may take a while to finish depending on node/memory used.
I ran into two main errors in setup and running PGAP.
Out of memory
This happens when the node used is low on memory. According to NCBI, PGAP needs 32 GB of memory, better to have more.
Out of disk space
Change the installation directory to your work directory otherwise PGAP will install
in home directory (which usually has low memory if using HPC) by default and run out of space.
Alternatively, create a
.pgap directory in your work directory and softlink it in your home directory.
You might still run into this error on HPC if you have not set up your home directory.
Three files are needed to run PGAP; Assembly fasta file, metadata YAML file, and a input YAML file. You can make the YAML files using a text editor. For details on input files check NCBI’s input files page.
I used PGAP for annotation of a genome of an unknown species in genus Spirochaeta. Here is an example run:
- Metadata YAML file (#data_submol.yaml):
1 2 organism: genus_species: 'Spirochaeta'
Additional information can be added to this metadata file. Check NCBI’s input files page for an example.
- Input YAML file (#Input.yaml):
1 2 3 4 5 6 fasta: class: File location: /PATH/TO/GENOME/ASSEMBLY/FILE/Spirochaete.fasta submol: class: File location: /PATH/TO/data_submol.yaml
Once you have the input files ready, run PGAP using following command. This is an example of a run without additional options, check more options using
./pgap.py -h command.
1 ./pgap.py -r -o output_directory_name Input.yaml
This can take hours depending on size of the genome and the memory allocated. Best to run the job using slurm on HPC, i.e., create a job script and submit.