A pipeline for assessing methylation in a genome using nanopore sequencing
DNA methylation plays a role in chromatin compaction and gene expression. Finding which genes occupy methylated regions of the genome, may provide some insight into gene regulation. Here we will use nanopolish to identify methylation sites in a genome with nanopore reads.
Prerequisite software and data
- samtools – standard install on most HPCs
- minimap2 – a quick aligner for long reads
- nanopolish – I use a conda environment below to install.
- nanopore fast5 (raw) and fastq data
Install nanopolish
1
2
3
4
5
6
7
8
/work/gif/remkv6/USDA/01_NanoporeMethylation
git clone --recursive https://github.com/jts/nanopolish.git
cd nanopolish/
make
#python dependencies to use programs from the scripts folder
pip install -r scripts/requirements.txt --user
export PATH="/work/gif/remkv6/USDA/01_NanoporeMethylation/nanopolish:$PATH"
Create file structure requested by Nanopolish
The structure is to have two separate folders. One for your fastq reads and one for the corresponding fast5 data.
1
2
3
4
5
6
7
8
9
10
/work/gif/remkv6/USDA/01_NanoporeMethylation
mkdir ooze_1_1fast5 ;cd ooze_1_1fast5 ; for f in /work/gif/archiveNova/03_RevoltingBlobMethylationNanopore/1577/1577/1577/20201119_1838_X1_FAO51552_10ff60f0/fast5_pass/barcode03/* ; do ln -s $f ; done ;cd ../
mkdir ooze_1_1fastq ;cd ooze_1_1fastq ; for f in /work/gif/archiveNova/03_RevoltingBlobMethylationNanopore/1577/1577/1577/20201119_1838_X1_FAO51552_10ff60f0/fastq_pass/barcode03/* ; do ln -s $f ; done ;cd ../
mkdir ooze_1_2fast5 ;cd ooze_1_2fast5 ; for f in /work/gif/archiveNova/03_RevoltingBlobMethylationNanopore/1577/1577/1577_2/20201123_1600_X1_FAO68129_332eba1a/fast5_pass/barcode03/* ; do ln -s $f ; done ;cd ../
mkdir ooze_1_2fastq ;cd ooze_1_2fastq ; for f in /work/gif/archiveNova/03_RevoltingBlobMethylationNanopore/1577/1577/1577_2/20201123_1600_X1_FAO68129_332eba1a/fastq_pass/barcode03/* ; do ln -s $f ; done ;cd ../
mkdir ooze_2_1fast5 ;cd ooze_2_1fast5 ; for f in /work/gif/archiveNova/03_RevoltingBlobMethylationNanopore/1577/1577/1577/20201119_1838_X1_FAO51552_10ff60f0/fast5_pass/barcode04/* ; do ln -s $f ; done ;cd ../
mkdir ooze_2_1fastq ;cd ooze_2_1fastq ; for f in /work/gif/archiveNova/03_RevoltingBlobMethylationNanopore/1577/1577/1577/20201119_1838_X1_FAO51552_10ff60f0/fastq_pass/barcode04/* ; do ln -s $f ; done ;cd ../
mkdir ooze_2_2fast5 ;cd ooze_2_2fast5 ; for f in /work/gif/archiveNova/03_RevoltingBlobMethylationNanopore/1577/1577/1577_2/20201123_1600_X1_FAO68129_332eba1a/fast5_pass/barcode04/* ; do ln -s $f ; done ;cd ../
mkdir ooze_2_2fastq ;cd ooze_2_2fastq ; for f in /work/gif/archiveNova/03_RevoltingBlobMethylationNanopore/1577/1577/1577_2/20201123_1600_X1_FAO68129_332eba1a/fastq_pass/barcode04/* ; do ln -s $f ; done ;cd ../
Once the fast5 and fastq data is softlinked, you need to concatenate all of your fastq data for each sample
1
cd ooze_1_1fastq; cat *fastq >ooze1_1.fastq; cd ../ooze_1_2fastq; cat *fastq >ooze1_2.fastq ; cd ooze_2_1fastq; cat *fastq >ooze2_1.fastq; cd ../ooze_2_2fastq; cat *fastq >ooze2_2.fastq
Index with nanopolish
1
2
3
4
5
6
/work/gif/remkv6/USDA/01_NanoporeMethylation
ml miniconda3; source activate nanopolish
nanopolish index -d ooze_1_1fast5/ ooze_1_1fastq/ooze1_1.fastq &
nanopolish index -d ooze_1_2fast5/ ooze_1_2fastq/ooze1_2.fastq &
nanopolish index -d ooze_2_1fast5/ ooze_2_1fastq/ooze2_1.fastq &
nanopolish index -d ooze_2_2fast5/ ooze_2_2fastq/ooze2_2.fastq &
Align nanopore reads from each sample to your genome
My working directory
1
/work/gif/remkv6/USDA/01_NanoporeMethylation
Softlink your genome
1
ln -s /work/gif/remkv6/04_DovetailRevoltingBlobGenome/49_RenameChromosomes/01_Transfer2Box/RevoltingBlobgenome.fasta
Create the temp files needed to sort after alignment
1
for f in *fastq/*fastq; do mkdir ${f%.*}"tmp";done
Execute alignment of nanopore reads, sort and index
1
for f in *fastq; do echo "ml minimap2;ml samtools;minimap2 -a -x map-ont RevoltingBlobgenome.fasta "$f"| samtools sort -T "${f%.*}"tmp -o "${f%.*}".sorted.bam ;samtools index "${f%.*}".sorted.bam";done >align.sh
Alignment stats
If alignment is poor, you will not get a good assessment of the methylation changes in your sample. Mine was not great, I should tweak the parameters to get a better alignment.
1
2
3
4
/work/gif/remkv6/USDA/01_NanoporeMethylation
for f in *fastq/*bam; do samtools flagstat $f >>alignmentStats.txt;done
paste <(grep "total" alignmentStats.txt) <(grep "%" alignmentStats.txt) |less
Yes this is a poor alignment, but for demonstration purposes, I will continue. I’d prefer this % to be near 70 at least.
1
2
3
4
450024 + 0 in total (QC-passed reads + QC-failed reads) 214902 + 0 mapped (47.75% : N/A)
276447 + 0 in total (QC-passed reads + QC-failed reads) 126140 + 0 mapped (45.63% : N/A)
488875 + 0 in total (QC-passed reads + QC-failed reads) 267888 + 0 mapped (54.80% : N/A)
623642 + 0 in total (QC-passed reads + QC-failed reads) 326584 + 0 mapped (52.37% : N/A)
Assess different types of methylation for each sample
I create a script below that will assess all detectable types of methylation possible with nanopolish.
A list the script below uses.
1
2
3
4
5
vi types
###########
cpg
gpc
###########
Create scripts to create the run scripts for all different types of methylation.
1
less types |while read line; do echo "for f in *bam; do echo \"ml miniconda3; source activate nanopolish; export PATH=\\\"work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:\\\$PATH\\\" ;nanopolish call-methylation -q "$line" -r \"\${f%%.*}\".fastq -b \"\$f\" -t 36 --progress -g RevoltingBlobgenome.fasta >\"\${f%..*}\"_methylation_"${line%.*}".tsv \";done >>AllMethylationScripts.sh" ;done |less
AllMethylationScripts.sh
Click to see content
for f in *fastq/*bam; do echo "ml miniconda3; source activate nanopolish; export PATH=\"work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:\$PATH\" ;nanopolish call-methylation -q cpg -r "${f%%.*}".fastq -b "$f" -t 36 --progress -g RevoltingBlobgenome.fasta >"${f%..*}"_methylation_cpg.tsv ";done >>AllMethylationScripts.sh for f in *fastq/*bam; do echo "ml miniconda3; source activate nanopolish; export PATH=\"work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:\$PATH\" ;nanopolish call-methylation -q gpc -r "${f%%.*}".fastq -b "$f" -t 36 --progress -g RevoltingBlobgenome.fasta >"${f%..*}"_methylation_gpc.tsv ";done >>AllMethylationScripts.sh ####################################################################################################################################################### The scripts above generate these. ################################################################################################################################################################### ml miniconda3; source activate nanopolish; export PATH="work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:$PATH" ;nanopolish call-methylation -q cpg -r ooze_1_1fastq/ooze1_1.fastq -b ooze_1_1fastq/ooze1_1.sorted.bam -t 36 --progress -g RevoltingBlobgenome.fasta >ooze_1_1fastq/ooze1_1.sorted.bam_methylation_cpg.tsv ml miniconda3; source activate nanopolish; export PATH="work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:$PATH" ;nanopolish call-methylation -q cpg -r ooze_1_2fastq/ooze1_2.fastq -b ooze_1_2fastq/ooze1_2.sorted.bam -t 36 --progress -g RevoltingBlobgenome.fasta >ooze_1_2fastq/ooze1_2.sorted.bam_methylation_cpg.tsv ml miniconda3; source activate nanopolish; export PATH="work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:$PATH" ;nanopolish call-methylation -q cpg -r ooze_2_1fastq/ooze2_1.fastq -b ooze_2_1fastq/ooze2_1.sorted.bam -t 36 --progress -g RevoltingBlobgenome.fasta >ooze_2_1fastq/ooze2_1.sorted.bam_methylation_cpg.tsv ml miniconda3; source activate nanopolish; export PATH="work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:$PATH" ;nanopolish call-methylation -q cpg -r ooze_2_2fastq/ooze2_2.fastq -b ooze_2_2fastq/ooze2_2.sorted.bam -t 36 --progress -g RevoltingBlobgenome.fasta >ooze_2_2fastq/ooze2_2.sorted.bam_methylation_cpg.tsv ml miniconda3; source activate nanopolish; export PATH="work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:$PATH" ;nanopolish call-methylation -q gpc -r ooze_1_1fastq/ooze1_1.fastq -b ooze_1_1fastq/ooze1_1.sorted.bam -t 36 --progress -g RevoltingBlobgenome.fasta >ooze_1_1fastq/ooze1_1.sorted.bam_methylation_gpc.tsv ml miniconda3; source activate nanopolish; export PATH="work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:$PATH" ;nanopolish call-methylation -q gpc -r ooze_1_2fastq/ooze1_2.fastq -b ooze_1_2fastq/ooze1_2.sorted.bam -t 36 --progress -g RevoltingBlobgenome.fasta >ooze_1_2fastq/ooze1_2.sorted.bam_methylation_gpc.tsv ml miniconda3; source activate nanopolish; export PATH="work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:$PATH" ;nanopolish call-methylation -q gpc -r ooze_2_1fastq/ooze2_1.fastq -b ooze_2_1fastq/ooze2_1.sorted.bam -t 36 --progress -g RevoltingBlobgenome.fasta >ooze_2_1fastq/ooze2_1.sorted.bam_methylation_gpc.tsv ml miniconda3; source activate nanopolish; export PATH="work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/:$PATH" ;nanopolish call-methylation -q gpc -r ooze_2_2fastq/ooze2_2.fastq -b ooze_2_2fastq/ooze2_2.sorted.bam -t 36 --progress -g RevoltingBlobgenome.fasta >ooze_2_2fastq/ooze2_2.sorted.bam_methylation_gpc.tsv
Get the methylation frequencies
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/work/gif/remkv6/USDA/01_NanoporeMethylation
for f in *fastq/*tsv;do ln -s $f;done
for f in *tsv; do /work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/01_Alignment/nanopolish/scripts/calculate_methylation_frequency.py $f >${f%.*}_frequency.tsv; done
#some error in the script that this corrects
for f in *_frequency.tsv ; do sed -i 's/num_CG_in_group/num_motifs_in_group/g' $f;done
#makes part of the script below
paste <(ls -1 Ooze1*gpc_frequency.tsv |tr "\n" " " |sed 's/^/<(cat /g' |sed 's/$/)/g' ) <(ls -1 Ooze2*gpc_frequency.tsv |tr "\n" " " |sed 's/^/<(cat /g' |sed 's/$/)/g') |while read line; do echo "python3 /work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/methylation_example/compare_methylation.py "$line;done |less
#had to manually add the awk command, as it was difficult to code in.
python3 /work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/methylation_example/compare_methylation.py <(cat Ooze1_1.sorted.bam_methylation_gpc_frequency.tsv Ooze1_2.sorted.bam_methylation_gpc_frequency.tsv |awk '{if (NR==1) {print} else if(substr($1,1,1)=="C") {print}else {next}}') <(cat Ooze2_1.sorted.bam_methylation_gpc_frequency.tsv Ooze2_2.sorted.bam_methylation_gpc_frequency.tsv |awk '{if (NR==1) {print} else if(substr($1,1,1)=="C") {print}else {next}}') >gpcCompare.tsv
python3 /work/gif/remkv6/04_DovetailRevoltingBlobGenome/57_NanoporeMethylation/methylation_example/compare_methylation.py <(cat Ooze1_1.sorted.bam_methylation_cpg_frequency.tsv Ooze1_2.sorted.bam_methylation_cpg_frequency.tsv |awk '{if (NR==1) {print} else if(substr($1,1,1)=="C") {print}else {next}}') <(cat Ooze2_1.sorted.bam_methylation_cpg_frequency.tsv Ooze2_2.sorted.bam_methylation_cpg_frequency.tsv |awk '{if (NR==1) {print} else if(substr($1,1,1)=="C") {print}else {next}}') >cpgCompare.tsv
Identify promoter sequences upstream of genes
DNA methylation typically affects the expression of genes by methylating promoter sequences. The easiest way to get this is with your genome, is to use bedtools flank with your gene prediction gff to assess methylation 2kb upstream of gene (a proxied promoter).
Softlink my gene prediction/annotation
1
ln -s /work/gif/remkv6/04_DovetailRevoltingBlobGenome/49_RenameChromosomes/01_Transfer2Box/OrderedRevoltingBlobGenePredictions.gff3
Get the “contig_name length”
1
2
ln -s /work/gif/remkv6/04_DovetailRevoltingBlobGenome/49_RenameChromosomes/01_Transfer2Box/RevoltingBlobgenome.fasta
ml bioawk; bioawk -c fastx '{print $name,length($seq)}' RevoltingBlobgenome.fasta >chrom.sizes
Make promoter coordinates in a bed format that bedtools will accept.
1
ml bedtools2; bedtools flank -i <(awk '$3=="gene"' OrderedRevoltingBlobGenePredictions.gff3) -g chrom.sizes -l 2000 -r 0 -s > genes.2kb.promoters.bed
Compare methylation among samples with promoter sequences
This is some testing based on at least 5 aligned reads to a region, and distinguishes differences between samples based on columns 3 and 5, and assesses how many times a promoter is methylated
CPG methylation
1
2
less cpgCompare.tsv |awk 'NR>1 &&$2>5 && $4>5 && $3>$5' |sort -k3,3nr -k5,5n |sed 's/:/\t/1' |sed 's/-/\t/1' |tr "\t" " " |sed 's/ /\t/1' |sed 's/ /\t/1' |sed 's/ /\t/1' |bedtools intersect -wo -a - -b genes.2kb.promoters.bed |cut -f 13 |sed 's/ID=//g' |sed 's/;/\t/g' |cut -f 1 |sort|uniq -c |sort -k1,1nr >Sample_1CPG.tsv
less cpgCompare.tsv |awk 'NR>1 &&$2>5 && $4>5 && $3<$5' |sort -k3,3nr -k5,5n |sed 's/:/\t/1' |sed 's/-/\t/1' |tr "\t" " " |sed 's/ /\t/1' |sed 's/ /\t/1' |sed 's/ /\t/1' |bedtools intersect -wo -a - -b genes.2kb.promoters.bed |cut -f 13 |sed 's/ID=//g' |sed 's/;/\t/g' |cut -f 1 |sort|uniq -c |sort -k1,1nr >Sample_2CPG.tsv
Sample_1CPG.tsv Click to see content
14 BLOB19558 13 BLOB19559 13 BLOB19560 12 BLOB14134 12 BLOB19557 8 BLOB19569 6 BLOB06502 6 BLOB07680 6 BLOB19562 4 BLOB04612 4 BLOB14593 3 BLOB20119 3 BLOB21251 2 BLOB01050 2 BLOB01732 2 BLOB16387 2 BLOB16388 2 BLOB19309 2 BLOB19568 1 BLOB00857 1 BLOB02492 1 BLOB12086 1 BLOB14594 1 BLOB14601 1 BLOB16386 1 BLOB19561 1 BLOB20672 1 BLOB21356
Sample_2CPG.tsv Click to see content
26 BLOB19557 23 BLOB14134 21 BLOB19559 20 BLOB19560 19 BLOB19558 18 BLOB19562 14 BLOB06502 7 BLOB19569 6 BLOB20119 5 BLOB14594 5 BLOB21251 4 BLOB01050 4 BLOB14593 4 BLOB19568 3 BLOB00857 3 BLOB01732 3 BLOB16388 3 BLOB20672 2 BLOB07680 2 BLOB16386 2 BLOB16387 2 BLOB19561 1 BLOB04612 1 BLOB09973 1 BLOB09974 1 BLOB13689 1 BLOB14595 1 BLOB19309
GPC methylation
1
2
3
less gpcCompare.tsv |awk 'NR>1 &&$2>5 && $4>5 && $3<$5' |sort -k3,3nr -k5,5n |sed 's/:/\t/1' |sed 's/-/\t/1' |tr "\t" " " |sed 's/ /\t/1' |sed 's/ /\t/1' |sed 's/ /\t/1' |bedtools intersect -wo -a - -b genes.2kb.promoters.bed |cut -f 13 |sed 's/ID=//g' |sed 's/;/\t/g' |cut -f 1 |sort|uniq -c |sort -k1,1nr >Sample_1GPC.tsv
less gpcCompare.tsv |awk 'NR>1 &&$2>5 && $4>5 && $3>$5' |sort -k3,3nr -k5,5n |sed 's/:/\t/1' |sed 's/-/\t/1' |tr "\t" " " |sed 's/ /\t/1' |sed 's/ /\t/1' |sed 's/ /\t/1' |bedtools intersect -wo -a - -b genes.2kb.promoters.bed |cut -f 13 |sed 's/ID=//g' |sed 's/;/\t/g' |cut -f 1 |sort|uniq -c |sort -k1,1nr >Sample_2GPC.tsv
Sample_1GPC.tsv Click to see content
10 BLOB19569 7 BLOB19557 7 BLOB19560 7 BLOB19562 6 BLOB04612 6 BLOB19558 5 BLOB02447 5 BLOB14593 5 BLOB19559 5 BLOB20119 3 BLOB01050 3 BLOB01732 3 BLOB16386 3 BLOB19568 2 BLOB01731 2 BLOB09973 2 BLOB09974 2 BLOB14594 2 BLOB14595 2 BLOB16388 2 BLOB19309 2 BLOB21251 2 BLOB22228 1 BLOB00081 1 BLOB00082 1 BLOB06502 1 BLOB12086 1 BLOB16387 1 BLOB21356
Sample_2GPC.tsv Click to see content
27 BLOB19559 27 BLOB19560 26 BLOB19558 20 BLOB19557 18 BLOB19562 11 BLOB04612 10 BLOB20119 7 BLOB14593 7 BLOB16386 7 BLOB21251 6 BLOB16387 6 BLOB16388 5 BLOB01050 5 BLOB02447 5 BLOB06502 5 BLOB13689 5 BLOB14134 5 BLOB19569 3 BLOB01732 3 BLOB19309 3 BLOB19561 3 BLOB19568 2 BLOB09973 2 BLOB09974 2 BLOB14594 2 BLOB14595 2 BLOB20672 1 BLOB02492 1 BLOB02922 1 BLOB18593 1 BLOB18594 1 BLOB20967 1 BLOB22377 1 BLOB22378
References
- https://github.com/jts/nanopolish
- https://nanopolish.readthedocs.io/en/latest/