Positive, Neutral, Negative Selection with Codeml using Multiple Genome Annotations
Have you ever wanted to know which genes across a set of genomes are undergoing positive (diversifying), neutral, or negative (purifying) selection? Here is a tutorial that utilizes multiple genomes and gene predictions to assess this phenomenon.
I modeled this analysis after a previously published work that demonstrated positive selection among E. coli genes –>Original Research
Prerequisite software
All of these were already installed on my HPC.
1
2
3
4
5
6
7
cufflinks/2.2.1
cdbfasta/2017-03-16
orthofinder/2.5.2
diamond/2.0.4
clustalo/1.2.4
pal2nal.v14
paml/4.9h
Download your datasets – genomes and annotations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
20_TestDataset/
GCF_000007405.1 -- shigella flexneri
GCF_001721125.1 -- E. coli -- Food-borne Pathogen Omics Research Center, FORC
GCF_003697165.2 -- E. coli -- University of Arkansas for Medical Sciences
GCF_013357365.1 -- E. coli -- USDA-ARS, Western Regional Research Center
GCF_000005845.2 -- E. coli -- Univ of Wisconsin
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/697/165/GCF_003697165.2_ASM369716v2/GCF_003697165.2_ASM369716v2_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/697/165/GCF_003697165.2_ASM369716v2/GCF_003697165.2_ASM369716v2_genomic.gff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/405/GCF_000007405.1_ASM740v1/GCF_000007405.1_ASM740v1_genomic.gff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/405/GCF_000007405.1_ASM740v1/GCF_000007405.1_ASM740v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/721/125/GCF_001721125.1_ASM172112v1/GCF_001721125.1_ASM172112v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/721/125/GCF_001721125.1_ASM172112v1/GCF_001721125.1_ASM172112v1_genomic.gff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/357/365/GCF_013357365.1_ASM1335736v1/GCF_013357365.1_ASM1335736v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/357/365/GCF_013357365.1_ASM1335736v1/GCF_013357365.1_ASM1335736v1_genomic.gff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_translated_cds.faa.gz
Extract proteins and transcripts from genome and gff
I always create my own transcripts and proteins from the gff, as in rare occurrences proteins and transcripts do not concur.
1
2
3
4
5
6
7
8
9
20_TestDataset/
module load cufflinks
gffread -g GCF_000007405.1_ASM740v1_genomic.fna GCF_000007405.1_ASM740v1_genomic.gff -x GCF_000007405.1_Transcripts.fasta -y GCF_000007405.1_Proteins.fasta
gffread -g GCF_001721125.1_ASM172112v1_genomic.fna GCF_001721125.1_ASM172112v1_genomic.gff -x GCF_001721125.1_Transcripts.fasta -y GCF_001721125.1_Proteins.fasta
gffread -g GCF_003697165.2_ASM369716v2_genomic.fna GCF_003697165.2_ASM369716v2_genomic.gff -y GCF_003697165.2_Proteins.fasta -x GCF_003697165.2_Transcripts.fasta
gffread -g GCF_013357365.1_ASM1335736v1_genomic.fna GCF_013357365.1_ASM1335736v1_genomic.gff -y GCF_013357365.1_Proteins.fasta -x GCF_013357365.1_Transcripts.fasta
gffread -g GCF_000005845.2_ASM584v2_genomic.fna GCF_000005845.2_ASM584v2_genomic.gff -x GCF_000005845.2_Transcripts.fasta -y GCF_000005845.2_Proteins.fasta
Format the fasta files so they are easier to use
You need to make sure that the proteins from each species have a unique name. i.e. species one and species two cannot have a protein named g123456.1.
1
2
3
20_TestDataset/
mkdir 01_CleanProteins
This removes periods and everything outside of the first column of the fasta header.
1
for f in *Proteins.fasta; do sed '/^[^>]/s/\.//g' $f |awk '{print $1}' >01_CleanProteins/Clean${f};done
Run orthofinder to classify orthologous groups
If you have a really large dataset, you can set orthofinder to use pre-computed BLAST/Diamond results
1
2
20_TestDataset/
ml diamond;ml orthofinder; orthofinder -a 36 -t 36 -f 01_CleanProteins
Extract gene family proteins and transcripts
1
20_TestDataset/
Make a giant fasta file containing all proteins from every species
1
2
cat 01_CleanProteins/*fasta >AllCleanProteins.fasta
ml cdbfasta; cdbfasta AllCleanProteins.fasta
Make a giant fasta file containing all transcripts from every species
1
2
for f in *_Transcripts.fasta; do awk '{print $1}' $f >Clean${f};done
ml cdbfasta; cdbfasta AllCleanTranscripts.fasta
Extract proteins and transcripts by looping through the tree files and formatting to get gene names. This can be painful at times, as orthofinder does add the fasta file name to the gene names.
1
for f in 01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/*tree.txt; do sed 's/(//g' $f |sed 's/,/\n/g' |sed 's/:.*//g' |sort|uniq|sed 's/Proteins_/\t/1'|cut -f 2 |cdbyank AllCleanProteins.fasta.cidx >${f}_Proteins.fasta;done
Do the same for the transcripts, thus it is much easier if your protein and transcript names match
1
for f in 01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/*; do sed 's/(//g' $f |sed 's/,/\n/g' |sed 's/:.*//g' |sort|uniq|sed 's/Proteins_/\t/1'|cut -f 2 |cdbyank AllCleanTranscripts.fasta.cidx >${f}_Transcripts.fasta;done
Generate protein alignments with clustalo
1
2
3
4
20_TestDataset/01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/
ml clustalo/1.2.4-iy2mf6y
for f in *_Proteins.fasta; do echo "ml clustalo;clustalo -i "$f" -o "${f%.*}"alignment";done >generateAlignment.sh
Run pal2nal – convert alignments into paml format
The suggestion is to use pal2nal with “-nogap”, though codeml has a way to deal with these. However, if you do not have sequence in your pal2nal output, likely you had gaps along the entire sequence across all of the species analyzed.
1
2
3
4
5
6
20_TestDataset/01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/
wget http://www.bork.embl.de/pal2nal/distribution/pal2nal.v14.tar.gz
tar -zxvf pal2nal.v14.tar.gz
for f in *alignment; do echo "pal2nal.v14/pal2nal.pl "$f" "${f%_*}"_Transcripts.fasta -output paml -nogap >"${f%_*}"_pal2nal" ; done >pal2nal.sh
Understanding Codeml parameters
There are lots of options that can be tweaked to test for positive selection. There are many more that I did not mention, and I only mention the ones for which I have a rudimentary understanding.
- runmode = 0 means I am specifying user trees
- seqtype = 1 means I am using codon sequences
- CodonFreq = 0 means I am assuming equal codon frequencies (codon substitution model)
- model = 0 whether ω should be allowed to vary among branches in the tree
- NSsites = 0 1 2 7 8 whether ω should be allowed to vary among sites , several of these can be specified at once.
- icode = 0 is specifies that I want to use the universal genetic code
- fix_kappa = 0 – initial transition/transversion ratio value set to 0
- kappa = 0 – assumption starting value for expectation of transitions/transversions
- fix_omega = 1 – fixes omega to test for significance between current and foreground branch
- omega = 0.001 – not 100% sure here, but likely the initial value set for omega in current branch
- RateAncestor = 0 – ancestral sequence not constructed
- clock = 0 – assumption of a molecular clock is not allowed, rates are free to vary from branch to branch
- Small_Diff = 5e-08 – supposed to use multiple values to make sure your results are not sensitive to this number uses 1e-8 to 1e-5.
- Malpha = 0 – when substitution rates are assumed to vary from site to site, Malpha specifies whether one gamma distribution will be applied across all sites
- method = 0 – nucleotide substitution model
- ndata = 1 – number of separate datasets
Prepare Codeml control files and run Codeml
1
20_TestDataset/01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/
Generate the control files appropriately named, but on one line
1
for f in OG*pal2nal; do echo "seqfile = ../"$f" \t treefile = ../"${f%_*}"\t outfile = "${f%.*}".out \t verbose = 1 \t runmode = 0 \t seqtype = 1 \t CodonFreq = 0 \t model = 0 \t NSsites = 0 1 2 7 8 \t icode = 0 \t RateAncestor = 0 \t clock = 0 \t \t method = 0 \t ndata = 1" > ${f%.*}"codeml.ctl";done
Move orthgroup named files to new folders, rename the control files to codeml.ctl, as codeml only works on codeml.ctl
1
for f in OG*ctl; do mkdir ${f}dir; mv $f ${f}dir/codeml.ctl; done
Transform the single line of tab separated values to a newline separated document
1
sed -i 's/\\t/\n/g' *dir/codeml.ctl
Run codeml – Creates an sh script that can be separated to submit to multiple nodes.
1
for f in *ctldir; do echo "ml paml/4.9h-m5kkb2e; cd "$f"; codeml; cd ../ ";done >IterateCodeml.sh
Which orthogroups are demonstrating positive selection
To avoid digging into each folder, I extract the log-likelihood for each model tested for each orthgroup.
- ntime is the number of branch lengths
- np is the number of parameters
- The third value is your log likelihood value
1
20_TestDataset/01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/
This prints the file name and grabs the values we want (np and Log Liklihood)
1
grep --with-filename "lnL" OG*_treecodeml.ctldir/*out |awk '{print $1,$4,$5}' |sed 's/)://g' |tr " " "\t" |awk '{ if (NR % 5 == 0) {print $0"#"} else {print}}' |tr "\n" "\t" |sed 's/#/\n/g' |awk '{print $1,$2,$3,$5,$6,$8,$9,$11,$12,$14,$15}' |less
Example output
1
2
3
OG0000004_treecodeml.ctldir/OG0000004_tree.out:lnL(ntime: 61 -235.745781 62 -235.745804 64 -235.745781 62 -235.753920 64 -235.753943
OG0000008_treecodeml.ctldir/OG0000008_tree.out:lnL(ntime: 55 -918.513428 56 -917.211691 58 -917.211691 56 -916.214085 58 -916.214672
OG0000009_treecodeml.ctldir/OG0000009_tree.out:lnL(ntime: 55 -740.121154 56 -719.787244 58 -719.530605 56 -720.923654 58 -719.128922
Create separate files for each test, as the number of parameters varys among and between models.
1
20_TestDataset/01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/
M0 vs M1 – Support for neutral selection can be identified if M1 provides a better fit than M0
1
grep --with-filename "lnL" OG*_treecodeml.ctldir/*out |awk '{print $1,$4,$5}' |sed 's/)://g' |tr " " "\t" |awk '{ if (NR % 5 == 0) {print $0"#"} else {print}}' |tr "\n" "\t" |sed 's/#/\n/g' |awk '{print $1,$2,$3,$5,$6,$8,$9,$11,$12,$14,$15}' |awk '{print $1,($4-$2),(2*($5+(-1*$3))) }' | tr " " "\t" >M0M1ChiSquares.tab
M1 vs M2 – Support for positive selection can be identified if M2 provides a better fit than M1
1
grep --with-filename "lnL" OG*_treecodeml.ctldir/*out |awk '{print $1,$4,$5}' |sed 's/)://g' |tr " " "\t" |awk '{ if (NR % 5 == 0) {print $0"#"} else {print}}' |tr "\n" "\t" |sed 's/#/\n/g' |awk '{print $1,$2,$3,$5,$6,$8,$9,$11,$12,$14,$15}' |awk '{if ($6>$4) {print $1,($6-$4),(2*($7+(-1*$5)))}else {print "FLIP"$1,($4-$6),(2*($5+(-1*$7)))} }' | tr " " "\t" >M1M2ChiSquares.tab
M7 vs M8 –The M8–M7 comparison offers a more stringent test for positive selection
1
grep --with-filename "lnL" OG*_treecodeml.ctldir/*out |awk '{print $1,$4,$5}' |sed 's/)://g' |tr " " "\t" |awk '{ if (NR % 5 == 0) {print $0"#"} else {print}}' |tr "\n" "\t" |sed 's/#/\n/g' |awk '{print $1,$2,$3,$5,$6,$8,$9,$11,$12,$14,$15}' |awk '{if($10>$8){print $1, ($10-$8), (2*($11+(-1*$9)))} else {print $1,($8-$10), (2*($9+(-1*$11)))} }' | tr " " "\t" >M7M8ChiSquares.tab
Create a chi-squared test database of critical values for significance
We need the significance values for the chi square tests, so with the paml module loaded type: chi2 Then we select the degres of freedom column with the 0.0500 header name and copy it to a chi2lists file (tab separated here).
chi2lists
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1 3.8415
2 5.9915
3 7.8147
4 9.4877
5 11.0705
6 12.5916
7 14.0671
8 15.5073
9 16.9190
10 18.3070
11 19.6751
12 21.0261
13 22.3620
14 23.6848
15 24.9958
Tests for neutral selection comparing models M0 and M1a
1
20_TestDataset/01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/
Is model M1 better than null(M0)? In how many orthogroups? –neutral selection
1
2
3
4
less M0M1ChiSquares.tab |awk '{print $2}' |while read line; do grep -w "$line" chi2lists ;done |paste M0M1ChiSquares.tab - |awk '$3>$5' |wc
91 455 6998
less M0M1ChiSquares.tab |awk '{print $2}' |while read line; do grep -w "$line" chi2lists ;done |paste M0M1ChiSquares.tab - |awk '$3>$5' |sed 's/:/\t/g' |cut -f 1 |sed 's/FLIP//g' |while read line; do grep --with-filename -A 100 "Model 1" $line |grep "\*" - |grep -v "branch" - |grep -v "Positively" -;done > NeutralSelectedSitesM1.txt
Number of orthogroups with significant log ratio threshold and Naive Empirical Bayes (NEB) analysis
1
2
less NeutralSelectedSitesM1.txt |awk '{print $1}' |sort|uniq|wc
89 89 4272
Positive selection tests comparing models M1a and M2a
1
20_TestDataset/01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/
How many orthogroups had significant chi squared tests for log likelihood?
1
2
3
4
less M1M2ChiSquares.tab |awk '{print $2}' |while read line; do grep -w "$line" chi2lists ;done |paste M1M2ChiSquares.tab - |awk '$3>$5' |wc
105 525 8500
less M1M2ChiSquares.tab |awk '{print $2}' |while read line; do grep -w "$line" chi2lists ;done |paste M1M2ChiSquares.tab - |awk '$3>$5' |sed 's/:/\t/g' |cut -f 1 |sed 's/FLIP//g' |while read line; do grep --with-filename -A 100 "Model 2" $line |grep "\*" - |grep -v "branch" - |grep -v "Positively" -;done > PositivelySelectedSitesM2.txt
Number of orthogroups with significant log ratio threshold and Naive Empirical Bayes (NEB) analysis
1
2
less PositivelySelectedSitesM2.txt |awk '{print $1}' |sort |uniq|wc
101 101 4848
Number of positively selected sites among the 101 orthogroups
1
2
wc PositivelySelectedSitesM2.txt
1133 6089 94230 PositivelySelectedSitesM2.txt
Positive selection tests comparing models M7 and M8 – a more stringent site model
1
20_TestDataset/01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/
How many orthogroups had significant chi squared tests for log likelihood?
1
2
less M7M8ChiSquares.tab |awk '{print $2}' |while read line; do grep -w "$line" chi2lists ;done |paste M7M8ChiSquares.tab - |awk '$3>$5' |sed 's/:/\t/g' |cut -f 1 |wc
1825 1825 85775
Positively selected sites – only 106 comparisons had sites with significance
1
less M7M8ChiSquares.tab |awk '{print $2}' |while read line; do grep -w "$line" chi2lists ;done |paste M7M8ChiSquares.tab - |awk '$3>$5' |sed 's/:/\t/g' |cut -f 1 |while read line; do grep --with-filename -A 500 "Model 8" $line |grep "\*" - |grep -v "branch" - |grep -v "Positively" - ;done >PositivelySelectedSitesM8.txt
Number of orthgroups with a significant log ratio threshold and Naive Emperical Bayes (NEB) analysis
1
2
less PositivelySelectedSitesM8.txt |awk '{print $1}' |sort|uniq|wc
301 301 14448
Number of positively selected sites among the 106 orthogroups
1
2
3
wc PositivelySelectedSitesM8.txt
5245 26637 431185 PositivelySelectedSitesM8.txt
How many of the M0 models show positive selection across the entire orthogroup
Some of these values are ~999.000, which usually means your Ds was 0, so they can be tossed.
1
2
20_TestDataset/01_CleanProteins/Results_Feb09/Orthologues_Feb09/Gene_Trees/
grep "omega" *dir/*out |awk '$4>1 && $4<998' >PositivelySelectedOrthogroupsM0.txt
How many orthogroups have positive selection under the M0 model – single dN/dS value
1
2
wc PositivelySelectedOrthogroupsM0.txt
81 324 5837 PositivelySelectedOrthogroupsM0.txt
Results compared with previous a previous study
Note that this is not an exact comparison, as I used different genome’s from the same species. Here I compare my results to the published results demonstrating genes under positive selection in Escherichia coli. Table of previous study
Create a text list gene names from the linked table above.
previouslypublished.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
fluA
eaeH
nmpC
ubiF
ompF
ompA
ycgV
yciD
rzpR
tra8_2
yddk
pqqL
nohA
purR
yeeU
yeeV
mdtC
ompC
insA_6
rfaC
wecD
lamB
tra8_3
Now lets extract the gene names from the positively selected orthogroups – including M0
Grabs all the orthogroup names, and uses them to concatenate all the orthogroup trees
1
2
cat PositivelySelected* |sed 's/_/\t/g' |awk '{print $1}' |sort |uniq |sed 's/$/_tree.txt/g' |tr "\n" " " |sed 's/^/cat /g' |sed 's/$/ >AllPositiveTress.txt/g' >concatenatePositiveSelectionTrees.sh
sh concatenatePositiveSelectionTrees.sh
Need to concatenate all of the gff’s, as some genes in the orthogroups may not have functional annotations
1
cat ../../../../*gff >../../../../Allgffs.gff
This grabs the functional gene names from the concatenated gffs.
1
less AllPositiveTress.txt |sed 's/(//g' $f |sed 's/,/\n/g' |sed 's/:.*//g' |sort|uniq|sed 's/Proteins_/\t/1'|cut -f 2 |while read line; do grep $line ../../../../Allgffs.gff ;done |awk '$3=="gene"' |sed 's/Name=/\t/g' |sed 's/;/\t/2' |cut -f 10 |sort |uniq >E.coliGenesUnderPositiveSelection.txt
Using grep to adjust for differences in capitalization
1
2
grep -f E.coliGenesUnderPositiveSelection.txt previouslypublished.txt |wc
23 23 121
All 23 positively selected genes in the previously published table were confirmed here. An addition 180 orthogroups also found to demonstrate positive selection in this analysis.
Now lets extract the gene names from the positively selected orthogroups – Excluding M0
Grabs all the orthogroup names, and uses them to concatenate all the orthogroup trees
1
2
cat PositivelySelectedSitesM2.txt PositivelySelectedSitesM8.txt |sed 's/_/\t/g' |awk '{print $1}' |sort |uniq |sed 's/$/_tree.txt/g' |tr "\n" " " |sed 's/^/cat /g' |sed 's/$/ >AllSignificantPositiveTrees.txt/g' >concatenateSignificantPositiveSelectionTrees.sh
sh concatenateSignificantPositiveSelectionTrees.sh
This grabs the functional gene names from the concatenated gffs.
1
less AllSignificantPositiveTrees.txt |sed 's/(//g' $f |sed 's/,/\n/g' |sed 's/:.*//g' |sort|uniq|sed 's/Proteins_/\t/1'|cut -f 2 |while read line; do grep $line ../../../../Allgffs.gff ;done |awk '$3=="gene"' |sed 's/Name=/\t/g' |sed 's/;/\t/2' |cut -f 10 |sort |uniq >E.coliGenesUnderSignificantPositiveSelection.txt
Using grep to adjust for differences in capitalization
1
2
grep -f E.coliGenesUnderSignificantPositiveSelection.txt previouslypublished.txt |wc
23 23 121
How many orthogroups had positive selection between the M1vsM2 and M7vsM8 tests?
1
2
3
cat PositivelySelectedSitesM2.txt PositivelySelectedSitesM8.txt |sed 's/_/\t/g' |awk '{print $1}' |sort |uniq|wc
365 365 3650
All 23 positively selected genes in the previously published table were confirmed here. An addition 342 orthogroups also found to demonstrate positive selection with significance in this analysis.
Further interpretation
Of the 3,996 orthogroups surveyed, 89 orthogroups were confirmed to have neutral selection with model M1, 101 orthogroups had positive selection with model M2, 301 orthogroups had positive selection with model M8, and 81 were found to have positive selection across the entire orthogroup with model M0.
Next steps would be to survey your significant orthogroups to identify which areas of positive/neutral selection are significant to your project. Do these positively selected proteins play a role in some adaptive function? Do they correspond to protein-protein binding domains, protein-DNA binding domains, protein-RNA binding domains? Are these proteins anomalous annotation errors or do they have a functional definition? There are lots of things to explore, your mind is the only limit.
References
- The PAML manual was tremendously helpful in increasing my understanding –PAML manual
- An earlier version of the paml manual that has some useful descriptions of input files –Earlier PAML manual
- I endlessly queried biostars to help improve my understanding – Biostars
- This tutorial for pairwise comparisons is what got me running initially – Pairwise codeml tutorial
-
Here is another set of tutorials that was helpful in understanding the value to the statistical method – Worked examples of CODEML
-
This paper was crucial in my understanding of the log ratio tests EasyCodeml Research Publication
- A codeml tutorial that helps in the understanding the output of codeml and a detailed explanation of each model –Codeml tutorials
Disclaimer
I am a biologist with bioinformatics expertise, not a statistician. I do not consider myself an expert in running or interpreting codeml. That being said, I spent lots of time reading to understand this material and feel it is accurate.