Tandem Duplication Detection in a genome
There many different ways to detect tandem duplications(TD) in a genome, although each is unique in the ability to detect TD. Here I will cover how to use a software that can identify more ancient tandem duplications in a genome, and one that can only identify recent and highly similar TD. Redtandem essentially chains smaller alignments together, so that more diverged and ancient TD can be identified. Mummer is essentially a self-alignment of the genome, that readily identifies recent TD, but misses ancient TD that have diverged in nucleotide similarity.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-83
Obtain your genome file
1
2
module load GIF2/redtandem/ReDtandem
ln -s ../08_LTRFinder/TAIR10_chr_all.fas
REDTANDEM – find ancient duplications
Redtandem seems to frequently get hung up on small satellite sequences, so I mask my genome with repeatmodeler/repeatmasker first (see RepeatModeler_RepeatMasker tutorial).
Format the fasta file headers so they match the format necessary to run redtandem.pl
1
2
3
4
5
6
7
8
9
10
11
12
1. The FASTA header should be formatted as ">ath1_1-9639975" where
* ath : identifies the species considered
* 1 : is the chromosome number
* 1-9639975 : positions of the first and last nucleotide in the
sequence.
2. We also need to remove all lower-case letters from the sequence.
#Here is how I changed the TAIR10 fasta headers to the correct format
sed -e 's/a/A/g' -e 's/c/C/g' -e 's/g/G/g' -e 's/t/T/g' TAIR10_chr_all.fas.masked | bioawk -c fastx '{print ">ath"$name"_1-"length($seq)" \n"$seq}' TAIR10_chr_all.fas.masked |fold -79 >TAIR10Masked.reformatted.fasta
Now it is time to run redtandem. This took 33 minutes for this masked arabidopsis sequence
1
perl /shared/software/GIF/programs/redtandem/ReDtandem/ReDtandem.pl --species ath --dnafile TAIR10Masked.reformatted.fasta
Redtandem output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#chrom start end u_start u_end numDupli dupli
1 147972 156372 148332 149762 2 147972..150120,154224..156372,
1 428520 439980 437962 439554 3 428520..430966,434762..437208,437534..439980,
1 453020 458137 455693 458024 2 453020..455578,455579..458137,
1 494704 497486 494934 495847 2 494704..496076,496114..497486,
1 523538 535050 529802 534543 2 523538..529294,529294..535050,
1 540988 544488 541221 542507 2 540988..542738,542738..544488,
1 639609 641357 640662 640947 4 639609..639977,640103..640471,640620..640988,640989..641357,
1 657884 666072 660741 663549 3 657884..660502,660835..663453,663454..666072,
1 669140 673972 668973 671724 2 669140..671556,671556..673972,
1 779859 793763 780284 781977 2 779859..782401,791221..793763,
1 786444 791597 789191 791429 2 786444..789020,789021..791597,
1 1165235 1178464 1177142 1178199 2 1165235..1166823,1176876..1178464,
1 1248981 1289958 1287751 1289516 2 1248981..1251631,1287308..1289958,
1 1636170 1647433 1636511 1637868 2 1636170..1638208,1645395..1647433,
1 1689725 1696943 1690264 1692413 2 1689725..1692951,1693717..1696943,
1 1701014 1704107 1703183 1703922 2 1701014..1702124,1702997..1704107,
1 1701904 1704801 1703991 1704639 2 1701904..1702878,1703827..1704801,
1 1824178 1828590 1824530 1826034 2 1824178..1826384,1826384..1828590,
1 1843523 1858296 1847872 1849786 4 1843523..1846397,1847391..1850265,1850958..1853832,1855422..1858296,
1 1922068 1930673 1922412 1926028 2 1922068..1926370,1926371..1930673,
1 1935216 1940681 1935520 1936731 2 1935216..1937034,1938863..1940681,
1 2025534 2037336 2035801 2037200 4 2025534..2027206,2032166..2033838,2033992..2035664,2035664..2037336,
1 2291092 2299901 2291361 2295228 2 2291092..2295496,2295497..2299901,
1 2408858 2413171 2412187 2412974 2 2408858..2410040,2411989..2413171,
1 2454770 2465569 2455250 2457163 3 2454770..2457642,2458227..2461099,2462697..2465569,
1 2482344 2488116 2485315 2488032 2 2482344..2485230,2485230..2488116,
1 2522766 2530165 2523363 2525744 2 2522766..2526340,2526591..2530165,
1 2658263 2665830 2663067 2665277 2 2658263..2661581,2662512..2665830,
1 2848303 2877283 2848432 2852017 4 2848303..2852145,2852146..2855988,2867013..2870855,2873441..2877283,
1 2909853 2920925 2916063 2920253 2 2909853..2915389,2915389..2920925,
1 2914257 2921827 2920265 2921514 2 2914257..2916133,2919951..2921827,
1 3055446 3066439 3055291 3061098 2 3055446..3060942,3060943..3066439,
1 3788689 3821367 3797744 3799427 5 3788689..3791215,3793942..3796468,3797322..3799848,3810016..3812542,3818841..3821367,
1 3796791 3814695 3800306 3800859 4 3796791..3797623,3800166..3800998,3812843..3813675,3813863..3814695,
1 3822846 3825559 3824246 3825518 2 3822846..3824202,3824203..3825559,
1 3851875 3859448 3851899 3855638 2 3851875..3855661,3855662..3859448,
1 4196848 4222596 4197232 4198763 2 4196848..4199146,4220298..4222596,
1 4478291 4483976 4482018 4483584 2 4478291..4480643,4481624..4483976,
1 4604651 4607835 4606590 4607586 2 4604651..4606147,4606339..4607835,
1 4619722 4638888 4633655 4636956 5 4619722..4622110,4625656..4628044,4631357..4633745,4634111..4636499,4636500..4638888,
1 4818392 4832076 4822784 4824028 4 4818392..4820260,4822471..4824339,4827856..4829724,4830208..4832076,
1 4833456 4837371 4833729 4834814 2 4833456..4835086,4835741..4837371,
1 4849847 4854580 4852548 4854248 2 4849847..4852213,4852214..4854580,
1 4861119 4871779 4868866 4871223 3 4861119..4864589,4864839..4868309,4868309..4871779,
1 5022895 5031658 5028068 5030940 2 5022895..5027205,5027348..5031658,
1 5127690 5134376 5128088 5129673 2 5127690..5130070,5131996..5134376,
1 5152011 5160449 5154634 5155596 4 5152011..5153457,5154391..5155837,5157101..5158547,5159003..5160449,
1 5211861 5227594 5224483 5226924 4 5211861..5215643,5215655..5219437,5220029..5223811,5223812..5227594,
1 5366916 5381071 5370417 5374382 4 5366916..5370188,5370763..5374035,5374527..5377799,5377799..5381071,
1 5426471 5434100 5431523 5433519 2 5426471..5429631,5430940..5434100,
1 5517047 5562362 5532271 5534876 7 5517047..5519973,5522377..5525303,5525303..5528229,5528627..5531553,5532110..5535036,5535602..5538528,5559436..5562362,
#truncated for example purposes
reformat the redtandem.out to a bed file for comparative purposes
1
less redtandem.outlJHm |awk '{print $1,$6,$7}' | tr ".." " " |tr "," "\n" |sed '/^$/d' |awk -v I=0 '{if(NF>2) {I=$1" "$2;print $0} else {print I,$1,$2}}' |awk '{print $1,$3,$4,$2}' >redtandem.bed
Mummer – Find recent duplications or assembly artifacts
Run mummer (unmasked genome is ok here)
1
mummer -mum -l 500 -b -n -qthreads 16 TAIR10_chr_all.fas TAIR10_chr_all.fas >SelfAlignmentsMummer.out
Reformat the mummer output so that we have something for comparative purposes. Also remove the exact whole chromosome alignments to self.
1
2
3
less SelfAlignmentsMummer.out |awk -v source=0 '{if(substr($1,1,1)==">") {source=$0;} else {print source,$0}}' |awk '{if($3=="Reverse") {print $2,$6,$6+$7"\t"$4,$5,$5+$7} else {print $2,$5,$5+$6"\t"$3,$4,$4+$6}}' |sort -k1,1V -k2,3n |awk '{if($1==$4 && $2==$5 && $3==$6) {next} else {print $0}}' >DuplicatedRegions.txt
awk '{print $1,$2,$3,$4":"$5"-"$6}' DuplicatedRegions.txt >DuplicatedRegions.bed
How do these datasets correllate?
hint: only the recent ones will be seen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
bedtools intersect -wo -a <(less DuplicatedRegions.bed|tr " " "\t") -b <(less ../10_RedTandem/redtandem.bed |tr " " "\t") |less
1 13386657 13387307 5:10440852-10441502 1 13386618 13387900 5 650
1 13388447 13389095 2:4883882-4884530 1 13387901 13389183 5 648
1 21412731 21413328 1:9677328-9677925 1 21407689 21413155 2 424
1 21412731 21413328 1:9677328-9677925 1 21413321 21418787 2 7
1 24230387 24236076 1:6198742-6204431 1 24233150 24236638 2 2926
1 26643151 26645519 3:22696765-22699133 1 26644607 26646275 2 912
1 26645451 26646587 1:21528398-21529534 1 26644607 26646275 2 824
1 29646499 29647005 1:889432-889938 1 29645831 29647847 4 506
2 10634309 10634911 3:7279349-7279951 2 10633847 10634447 2 138
2 10634309 10634911 3:7279349-7279951 2 10634447 10635047 2 464
2 16079514 16080346 2:3626656-3627488 2 16078489 16081707 2 832
2 16412777 16414454 mitochondria:245475-247152 2 16413740 16416842 2 714
2 16414795 16416555 mitochondria:247520-249280 2 16413740 16416842 2 1760
2 16416552 16417703 mitochondria:249278-250429 2 16413740 16416842 2 290
2 16418728 16419317 mitochondria:251452-252041 2 16418927 16422029 2 390
2 16419544 16420129 mitochondria:252269-252854 2 16418927 16422029 2 585
2 16420374 16422112 mitochondria:253100-254838 2 16418927 16422029 2 1655
3 7367650 7368298 3:16221068-16221716 3 7366465 7368387 8 648
3 17386970 17387523 3:6056936-6057489 3 17387437 17391893 2 86
3 17387524 17388804 3:6057490-6058770 3 17387437 17391893 2 1280
3 17619672 17620314 3:9471268-9471910 3 17617857 17622855 6 642
5 8502569 8503246 5:10203683-10204360 5 8499963 8503197 2 628
5 14858156 14858803 5:67559-68206 5 14858281 14860575 6 522
5 15132063 15132594 1:15049737-15050268 5 15131517 15134069 5 531
5 15157469 15158050 5:11938756-11939337 5 15157873 15160579 2 177
5 15158931 15159949 5:11940219-11941237 5 15157873 15160579 2 1018
5 15260088 15261003 5:11838684-11839599 5 15260164 15263320 3 839
5 18767018 18767554 3:14601781-14602317 5 18763504 18769106 2 536
5 23994417 23995938 4:15556173-15557694 5 23994143 23996001 2 1521
Woah, only 29 regions seem to overlap between the two outputs