GeneOverlap R package / Fisher’s exact testing
Identifying genes that are significantly associated with a feature or characteristic of a transcriptome/genome is a common bioinformatics test.
The GeneOverlap package offers a quick and easy way to accomplish this task.
Perhaps you want to see if a mutation has LTR retrotransposons
Grab relevant files
1
2
3
4
5
6
7
8
#get your genome's gff
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gff.gz
#grab your differentially expressed gene list. (I took this from the RNA-seq tutorial on DESEQ)
less DEGArabidopsisSiva.csv
#get your LTR retrotransposon coordinates. (I took this from the LTR_finder tutorial)
ln -s ../08_LTRFinder/LTRfinder.bed
###Rename the scaffolds so that the DESEQ analysis genes will match the genes identified in LTRfinder analysis Since the gene overlap will essentially be found using bedtools, the scaffold names need to match.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#the easiest and fastest way to do this is to remove all other columns except scaffold name, so sed isnt searching the entire gff
less GCF_000001735.3_TAIR10_genomic.gff |awk '$3=="gene" {print $1}' >GeneScaffNames
Then grab the unique scaffold names and renamed them with sed.
less GCF_000001735.3_TAIR10_genomic.gff |awk '$3=="gene" {print $1}' |sort|uniq|less
sed -i 's/NC_000932.1/Plastid/g' GeneScaffNames
sed -i 's/NC_001284.2/Mitochondria/g' GeneScaffNames
sed -i 's/NC_003070.9/Chr1/g' GeneScaffNames
sed -i 's/NC_003071.7/Chr2/g' GeneScaffNames
sed -i 's/NC_003074.8/Chr3/g' GeneScaffNames
sed -i 's/NC_003075.7/Chr4/g' GeneScaffNames
sed -i 's/NC_003076.8/Chr5/g' GeneScaffNames
paste GeneScaffNames <(awk '$3=="gene"' GCF_000001735.3_TAIR10_genomic.gff) |awk '{print $1,$3,$4,$5,$6,$7,$8,$9,$10}' |tr " " "\t" >MOD.gff
Create gene lists for GeneOverlap
Now we need to create some genes lists: genes in ltr retrotransposons, genes in the genome, and a set of genes that are differentially upregulated (say top 10 percentile of genes in log fold change).
1
2
3
4
5
6
7
8
9
10
11
bedtools intersect -wo -a <(sed 's/^/Chr/g' LTRfinder.bed) -b MOD.gff |awk '{print $13}' |sed 's/ID=//g' |sed 's/;/\t/g' |cut -f 1 |sort|uniq >LTRGene.list
#total count of all genes in the genome
less MOD.gff |awk '{print $9}' |sed 's/ID=//g' |sed 's/;/\t/g' |cut -f 1 |sort|uniq|wc -l >All.genes
#figure out the top 10 percentile
less DEGArabidopsisSiva.csv |wc
38159 38159 868807
#so find the top 3816 genes with the high log fold change
less DEGArabidopsisSiva.csv |sed 's/,/\t/g' |sort -k2,2nr |awk 'NR<3817' >Top10Percentile
#what the heck, lets find the lowest also
less DEGArabidopsisSiva.csv |sed 's/,/\t/g' |sort -k2,2nr |awk 'NR>34343' >Bottom10Percentile
Time to load it all into R
1
2
3
4
5
6
module load R
library(GeneOverlap)
LTRGene.list <- read.table("LTRGene.list")
Top10Percentile <- read.table("Top10Percentile")
Bottom10Percentile <- read.table("Bottom10Percentile")
All.genes <- read.table("All.genes")
Perform the comparison
1
2
3
4
go.obj <- newGeneOverlap(LTRGene.list$V1, Top10Percentile$V1, genome.size=All.genes)
go.obj1 <- testGeneOverlap(go.obj)
go.obj <- newGeneOverlap(LTRGene.list$V1, Bottom10Percentile$V1, genome.size=All.genes)
go.obj2 <- testGeneOverlap(go.obj)
Concatenate the output and push into a file.
1
2
3
4
5
6
7
AllCombine <- c(go.obj1, go.obj2)
library(RJSONIO)
exportJSON <- toJSON(AllCombine)
write(exportJSON,"LTRExpression.out")
#exit R
cat <(ls -1 Top10Percentile ) <(ls -1 Bottom10Percentile ) |paste - <(less LTRExpression.out|sed 's/\[/\t/g' |awk 'NF<15' |sed 's/{//g' |sed 's/}//g' |sed 's/]//g' |sed 's/"//g' |sed 's/,/\t/g' |sed '/^$/d' | tr "\n" "\t" |sed 's/true/true\n/g' |sed 's/false/false\n/g') |less
### Results
1
2
3
Top10Percentile notA: 29431 inA: 68 notA: 3805 inA: 11 odds.ratio: 1.2512 pval: 0.29241 Jaccard: 0.0028321 is.tested: true
Bottom10Percentile notA: 29427 inA: 72 notA: 3809 inA: 7 odds.ratio: 0.75111 pval: 0.81441 Jaccard: 0.0018004 is.tested: true
Well, it looks like there are 11 LTR retrotransposon associated genes that are upregulated in expression and 7 that are downregulated. As a whole, LTR retrotransposon gene expression does not seem to be associated with the changes that have occurred between the two arabidopsis lines analyzed here.