SNP calling for GBS data using Stacks pipeline

Introduction

Genotyping-by-sequencing (GBS) is a technique that allows to rapidly detect nucleotide variation across the whole genome. This allows to genotype large number of individuals, simultaneously, at a very low cost as this is a reduced representation of the genome. It can work with or without a reference genome, and easily generates large number of useful information for these individuals at population level.

Dataset

In this tutorial, we will use NAM population GBS data generated by Rodgers-Melnick et al.. This GBS data is collected from 5000 RIL individuals belonging to NAM population (25 diverse maize lines crossed to B73). Originally, SNPs were called using B73 V3 genome and the results were used for various purposes. Since there is more latest version of the genome, we can replicate the same analyses, using a different tool (Stacks) and an updated genome. In the next tutorial we will create a tutorial for calling SNPs using Tassel.

The dataset for this GBS is available on NCBI SRP009896 and there are 91 SRRs associated with it. The metadata for these SRRs are available on CyVerse Data Commons, at this location:

1
/iplant/home/shared/panzea/genotypes/GBS/v27/

Specifically, you will need AllZeaGBSv2.7_publicSamples_metadata20140411.xlsx file present in that directory. This file is essential not only to demultiplex the SRR data you download, but also, if you are interested in doing downstream analyses it will tell the sample pedigree information.

Organization

The data is organized as follows:

1
2
3
4
5
6
7
8
9
stacks/
├── 1-data
├── 2-genome
└── 3-stacks
    ├── a-barcodes
    ├── b-demux
    ├── c-alignments
    ├── d-gstacks
    └── e-population

Programs required

  1. Stacks
  2. NCBI-edirect
  3. enaBrowserTools
  4. BWA
  5. samtools
  6. Picard
  7. VCFTools

Overview

workflow Fig 1: overview of this tutorial

Organize datasets

Fastq

First, download the 91 SRR datasets from the SRA database. For this you can use the NCBI-edirect package to obtain the list of SRRs and then enaBrowserTools to download them.

1
2
3
4
5
6
7
8
9
cd stacks/1-data
esearch -db sra -query SRP009896  |\
  efetch --format runinfo |\
  cut -d , -f 1 |\
  awk 'NF>0' |\
  grep -v Run > srr.ids
while read line; do
  enaDataGet -f fastq -a $line;
done<srr.ids

This will generate 91 folders with single fastq (compressed) file each.

Barcodes

From the spreadsheet downloaded, you will need to extract the required information (flowcell, lane, barcode, and samplename) and generate a simple text file. You can use the SRR id for the flowcell to make the analyses simpler. If you have the same samplename Since there will be samples with same samplename in multiple SRR files, we need to process them separately. The combined, extracted, and cleaned barcode file is provided here in its final form for reference. If you have your own data, it is essential that you create the barcode file in similar format. This file is placed in 3-stacks directory and it will be split into multiple files based on SRR ids.

1
2
3
4
cd 3-stacks
while read line; do
  grep -Fw '$line' barcodes-gbs-data.tsv > a-barcodes/${line}-barcodes.txt;
done<../1-data/srr.ids

Genome

B73 version 4 from Gramene, download and BWA index:

1
2
3
4
5
cd 2-genome
wget ftp://ftp.gramene.org/pub/gramene/CURRENT_RELEASE/fasta/zea_mays/dna/Zea_mays.B73_RefGen_v4.dna.toplevel.fa.gz
gunzip Zea_mays.B73_RefGen_v4.dna.toplevel.fa.gz
module load bwa
bwa index Zea_mays.B73_RefGen_v4.dna.toplevel.fa

Demulitplex

Now, we are ready to start the Stacks pipeline. We will first start by demultiplexing.

1
2
3
4
5
6
module load gcc/7.3.0-xegsmw4
module load stacks
for srr in ../1-data/SRR*; do
  ID=$(basename $srr);
  echo process_radtags -p ../1-data/${ID}/ -o ./b-demux/${ID}/ -b ./a-barcodes/${ID}-barcodes.txt -e apeKI -r -c -q;
done > process.cmds

Make Slurm scripts and submit makeSLURMp.py 350 process.cmds

1
2
3
4
for sub in process*.sub; do
  sed -i 's/parallel -j 1/parallel -j 20/g' $sub;
  sbatch $sub
done

Align reads to genome

We will use BWA program for alignments

create a run script runBWA.sh

1
2
3
4
5
6
#!/bin/bash
module load bwa
input=$1
output=$(basename $input | sed 's/.fq.gz/.sam/g')
index=/path/to/stacks/2-genome/Zea_mays.B73_RefGen_v4.dna.toplevel.fa
bwa mem -t 4 ${index} ${input} > c-alignments/${output}

Generate commands and submit:

1
2
3
4
5
6
7
8
for fq in $(find /path/to/stacks/3-stacks/b-demuxed -name "*.gz"); do
  echo ./run-bwa.sh $fq;
done > bwa.cmds
makeSLURMp.py 350 bwa.cmds
for sub in bwa*.sub; do
  sed -i 's/parallel -j 1/parallel -j 9/g' $sub;
  sbatch $sub
done

Alignments will be written to c-alignments directory.

Calling SNPs

For the GBS SNP calling pipeline, you need to have coordinate sorted bam file with readgroups added to the BAM header. We will do this using samtools and picard

script runSort-and-addReadgroups.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
SAM="$1"
samtools view --threads 36 -b -o ${SAM%.*}.bam ${SAM}
samtools sort -o ${SAM%.*}_sorted.bam -T ${SAM%.*}_temp ${SAM%.*}.bam
module load picard
REF="/path/to/stacks/2-genome/Zea_mays.B73_RefGen_v4.dna.toplevel.fa"
ulimit -c unlimited
# example file looks like this:
# Z002E0081-SRR391089_sorted.bam
bam="${SAM%.*}_sorted.bam"
RGID=$(basename $bam | rev | cut -f 1 -d "-" | rev | sed 's/_sorted.bam//g')
RGSM=$(basename $bam | rev | cut -f 2- -d "-" | rev )
RGLB="${RGSM}-L001"
RGPU=001
echo -e "$RGID\t$RGSM\t$RGLB\t$RGPU"
java -Djava.io.tmpdir=$TMPDIR -Xmx50G -jar $PICARD_HOME/picard.jar AddOrReplaceReadGroups \
      I=${bam} \
      O=${bam%.*}_new.bam \
      RGID=$RGSM \
      RGLB=$RGLB \
      RGPL=ILLUMINA \
      RGPU=$RGPU \
      RGSM=$RGSM
module load samtools
samtools index ${bam%.*}_new.bam

generate commands and submit:

1
2
3
4
5
6
7
8
for sam in *.sam; do
  echo "./runSort-and-addReadgroups.sh $sam";
done > sort.cmds
makeSLURMp.py 350 sort.cmds
for sub in sort*.sub; do
  sed -i 's/parallel -j 1/parallel -j 9/g' $sub;
  sbatch $sub
done

Next, we run gstacks. This has to be run on all bam files together. So to the command generated will be really long:

1
2
3
for bam in *_new.bam; do
  echo "   -B $bam \";
done > middle

head file head

1
2
3
4
5
#!/bin/bash
ml purge
ml gcc/7.3.0-xegsmw4
ml stacks
gstacks \

tail file: tail

1
2
   -O /path/to/stacks/3-stacks/d-gstacks \
   -t 36

combine:

1
2
3
4
rm gstacks.cmds
cat head middle tail >> gstacks.cmds
makeSLURMs.py 100000 gstacks.cmds
sbatch gstacks_0.sub

Finally, run the populations script:

the script: run-populations.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/bash
ml purge
ml gcc/7.3.0-xegsmw4
ml stacks
input="/path/to/stacks/3-stacks/d-gstacks"
output="/path/to/stacks/3-stacks/e-population"
populations \
   -P ${input} \
   -O ${output} \
   -t 36 \
   --batch-size 10 \
   --vcf \
   --phylip

This will generate the final-results in the e-population directory:

1
2
3
4
5
6
7
8
9
10
11
populations.fixed.phylip
populations.fixed.phylip.log
populations.haplotypes.tsv
populations.hapstats.tsv
populations.haps.vcf
populations.log
populations.log.distribs
populations.markers.tsv
populations.snps.vcf
populations.sumstats.tsv
populations.sumstats_summary.tsv