Using Inverted Repeat Finder to identify DNA transposon borders in genomes

Inverted repeats finder is primarily to identify inverted repeats in the genome, yet when combined with other repeat prediction software, full length DNA transposons can be identified. Software download

Obtain your genome

#I will be using the Araport11 Arabidopsis genome
TAIR10_chr_all.fas

IRF parameters

Please use: irf File Match Mismatch Delta PM PI Minscore Maxlength MaxLoop [options]

Where: (all weights, penalties, and scores are positive)
  File = sequences input file
  Match  = matching weight
  Mismatch  = mismatching penalty
  Delta = indel penalty
  PM = match probability (whole number)
  PI = indel probability (whole number)
  Minscore = minimum alignment score to report
  MaxLength = maximum stem length to report (10,000 minimum and no upper limit, but system will run out memory if this is too large)
  MaxLoop = filters results to have loop less than this value (will not give you more results unless you increase -t4,-t4,-t7 as well)
  [options] = one or more of the following :
               -m    masked sequence file
               -f    flanking sequence
               -d    data file
               -h    suppress HTML output

               -l    lowercase letters do not participate in a k-tuple match, but can be part of an alignment
               -gt   allow the GT match (gt matching weight must follow immediately after the switch)
               -mr   target is mirror repeats
               -r    set the identity value of the redundancy algorithm (value 60 to 100 must follow immediately after the switch)

               -la   lookahead test enabled. Results are slightly different as a repeat might be found at a different interval. Faster.
               -a3   perform a third alignment going inward. Produces longer or better alignments. Slower.
               -a4   same as a3 but alignment is of maximum narrowband width. Slightly better results than a3. Much slower.
               -i1   Do not stop once a repeat is found at a certain interval and try larger intervals at nearby centers. Better(?) results. Slower.
               -i2   Do not stop once a repeat is found at a certain interval and try all intervals at same and nearby centers. Better(?) results. Much slower.
               -r0   do not eliminate redundancy from the output
               -r2   modified redundancy algorithm, does not remove stuff which is redundant to redundant. Slower and not good for TA repeat regions, would not leave the largest, but a whole bunch.

               -t4   set the maximum loop separation for tuple of length4 (default 154, separation <=1,000 must follow)
               -t5   set the maximum loop separation for tuple of length5 (default 813, separation <=10,000 must follow)
               -t7   set the maximum loop separation for tuple of length7 (default 14800, limited by your system's memory, make sure you increase maxloop to the same value)
               -ngs  more compact .dat output on multisequence files, returns 0 on success.

Note the sequence file should be in FASTA format:

>Name of sequence
   aggaaacctg ccatggcctc ctggtgagct gtcctcatcc actgctcgct gcctctccag
   atactctgac ccatggatcc cctgggtgca gccaagccac aatggccatg gcgccgctgt
   actcccaccc gccccaccct cctgatcctg ctatggacat ggcctttcca catccctgtg

run inverted repeat finder on genome

#everything here is default except the -t7 parameter which is memory dependent.  -ngs just gives a more easily interpretable output.
irf TAIR10_chr_all.fas  2 3 5 80 10 40 500000 10000 -t7 20000 -ngs >irf.out

Sample output of irf.out

@1 CHROMOSOME dumped from ADB: Feb/3/09 16:9; last updated: 2009-02-02
22795 22834 40 22868 22906 39 33 95.0000 2.5000 68 75.9494 24.0506 76.3158 23.6842 0.0000 45702 45702 AAGTCGTAAAACTGAAATCTAAAAATGAAAAGATTTCGAT ATCGAAATCTTTTCATTTTTAGTTTTCAGTTTTACACTT
28517 28584 68 28623 28686 64 38 77.9412 5.8824 53 55.3030 44.6970 56.6038 43.3962 0.0000 57207 57206 GAGGACTTACATGGCCTCAAGTCACCTGTGGTGTTGTGCAAGAAGGAGAAGCAAAGTCTGTCTATGTA TACAAGACCGGCTTTTCTTCTACTTCTTGCACAACCTGAGGTTATTGAGGCTATACAAGTCTTC
113932 113997 66 114100 114165 66 102 79.1045 2.9851 60 74.2424 25.7576 79.2453 20.7547 0.0000 228097 228097 CATCTACATTGGACATATTAATGGGGTGTCTTCTACCATAATAAAATATTAAGAAAATTAAAATAT ATATTTTAGTTTTTTTAATAATTTCTTACGTGGGAAGACATCCAATATAATTGTGCAATGTGGATG
213810 213850 41 213893 213933 41 42 87.8049 0.0000 57 90.2439 9.7561 94.4444 5.5556 0.0000 427743 427743 TTTCTCTGGATAATATTATTATTTTAAAATAATAATAAAAT ATTTAATTATTATTTTAAGAAAATAATATTATGCATAGAAA
234010 234049 40 234116 234155 40 66 87.5000 0.0000 55 60.0000 40.0000 60.0000 40.0000 0.0000 468165 468165 GAGATGATGCGCAAATGCGGATATAAAGGTATATCATGAC GTCCTGATTTACATTGATATCCGCATTTGCGCATCATTTC
257700 257859 160 258480 258638 159 620 93.7500 0.6250 268 73.6677 26.3323 74.6667 25.3333 0.0000 516339 516339 ATTTACAAATGGGAATTTAATGAAAAAAACCCTCAACTTTGGCCAGATCCATTTTTAAACCCTAAACTATGTTTTTAGTAAAACAAATTCTGAACTAAAACCTGTTAATAAACTTAACCCCATAGTAATTAAATATTAATGGGATATTAATTTCCAAAAA TTTTTGGAAATTAATATCTCATTAATATTTAATTACTGTGGGGTTAAGTTTATTAACAGGTTTTAGTTCAGGGTTTGTTTAACTAAAAACATAGTTCAGGATTTAAAAATGGATCTGGCCAAAGTTGAGGGTTTTTTTCATTAAATTCCCTTTACAAAT
447046 447134 89 447562 447648 87 427 80.8989 2.2472 89 83.5227 16.4773 86.1111 13.8889 0.0000 894696 894696 CGTGCCAATGAATTTTGATGCTATAAACAAAAAAATATAGTTTAATATTTTAATAAATAATGTAAACATAACAAAAAAATTATTTATTA TATTAATTATTTTTTTTGTTATGTTTATAATATTTATTAAAATATTAAATCATATTTTCGCGTATGGTATCAAATTTTTTTGGCACG
452958 452985 28 453007 453034 28 21 92.8571 0.0000 46 71.4286 28.5714 73.0769 26.9231 0.0000 905992 905992 TATTTACTTGATAGAATGGGCCTATAAT ATTATAGGCCCATGATATCAAGTAAATA
512778 512862 85 512935 513019 85 72 91.7647 0.0000 135 64.7059 35.2941 65.3846 34.6154 0.0000 1025797 1025797 ATAGTTGATTTCTAATTTAACCTATAAATTATCGTTGATTCGGCCAAATCGACTCACCATTAACACTTCTTAACAGCTCTCCTAA TTAGGAGCGCTATTAAGAAGTGTTAACGGTGAGTCGATTTGGCCGAGTCAACGATAAATTATAGGTTAAATTAGAAGTCAACGAT
512695 512818 124 513008 513132 125 189 91.2000 0.8000 193 73.8956 26.1044 75.4386 24.5614 0.0000 1025826 1025827 AAAATGTTATTTAATACCTGAACTTTCAAAAAATGGTCAAATTAACCGTGAATTCTTGAAATGACCGTTTTATACCTCAACAAATAGTTGATTTCTAATTTAACCTATAAATTATCGTTGATTC GAAGTCAACGATAAAGTATAGGTTAAATTAGAAGTCAATTTTTTGTTGAGGTATAAAACGACAATTTCAAGAGTTTACGGTTAATTTGACCATTTTTTGAAAGTTCAGGTATTAAATAACATTTT
591098 591118 21 591192 591212 21 73 100.0000 0.0000 42 57.1429 42.8571 57.1429 42.8571 0.0000 1182310 1182310 TGTTTGCTGATTAGAGAGAGC GCTCTCTCTAATCAGCAAACA
599484 599551 68 599579 599646 68 27 85.2941 0.0000 86 73.5294 26.4706 77.5862 22.4138 0.0000 1199130 1199130 GATCATCATTATTGATGATCTCTTAAAACAATTCTTATGCTAAGAGACATGTTTTATAACTAACAAAA TTTTGTTAATTATAAGACATATCTCTTAACATAAGAGTTATGTTAAGAGACCATCAATAAGGATGGTC
639991 640140 150 640189 640339 151 48 68.8312 4.5455 54 85.7143 14.2857 87.7358 12.2642 0.0000 1280329 1280329 GTCCAATTAGTTTACACAAAATTTAAAATTTTAACACATATAATAAAAAACTTTATAAAGTTTTAATAGTAGTAATACAAAATATAGTTTTAAAAACATTTTTGAATGAAATAAAAATAAGTGTTAAAAAGTTAAATTAATTGTAAACTA TAGTTTACACAAAATTTAAAATTTTAACACATATAATAAAAAACTTTATAAAGTTTTAATAGTAGTAATATAAAATATAGTTTTAAAAACATTTTTGAATAAAATAAAAATAAGTGTTAAAAAGTTAAATTTAGTGTAAACCATATTGAAC
640182 640222 41 640298 640339 42 75 83.3333 2.3810 47 79.5181 20.4819 80.0000 20.0000 0.0000 1280520 1280520 GTCCAATTAGTTTACACAAAATTTAAAATTTTAACACATAT ATAAGTGTTAAAAAGTTAAATTTAGTGTAAACCATATTGAAC
640186 640222 37 640685 640722 38 462 84.2105 2.6316 44 85.3333 14.6667 84.3750 15.6250 0.0000 1280907 1280907 AATTAGTTTACACAAAATTTAAAATTTTAACACATAT ATAAGTGTTAAAAAGTTAAATTAAGTGTAAACTATATT
639995 640031 37 640685 640722 38 653 84.2105 2.6316 44 85.3333 14.6667 84.3750 15.6250 0.0000 1280716 1280716 AATTAGTTTACACAAAATTTAAAATTTTAACACATAT ATAAGTGTTAAAAAGTTAAATTAAGTGTAAACTATATT
759201 759258 58 759263 759316 54 4 80.0000 13.3333 44 90.1786 9.8214 91.6667 8.3333 0.0000 1518521 1518519 TTATTTTAAAGAATTGAAACTTTAAAATGTTTCAAGAAATTATAAATATTATAACTTT AAAGTTAAATATTATAATTTTAAAAACTTTTTATAAAGTTTATCTTTAAATTAA
759263 759317 55 759339 759393 55 21 79.3103 10.3448 44 93.6364 6.3636 95.6522 4.3478 0.0000 1518656 1518657 AAAGTTAAATATTATAATTTTAAAAACTTTTTATAAAGTTTATCTTTAAATTAAA TTTATTTTAAAAAATTGAAACTTTAAAAAGTTTAAAAAATTATAAATTAAATTTT
782370 782399 30 782575 782604 30 175 86.6667 0.0000 40 80.0000 20.0000 80.7692 19.2308 0.0000 1564974 1564974 AAGTGTTAACATTTTAAATTTTGTGTAAAC GTTTACACATAATTTTAAATTTGAACACTT
784928 784953 26 785411 785436 26 457 92.3077 0.0000 42 84.6154 15.3846 87.5000 12.5000 0.0000 1570364 1570364 TACAGTAAAACCTCTATAAATTAATA TATTAATTTATAAAGATTTTACTGTA
786267 786340 74 788670 788740 71 2329 81.0811 4.0541 72 73.7931 26.2069 76.6667 23.3333 0.0000 1575010 1575007 ATAAACAAAAATTCGGATAAACAAATCCATGTCTATTCCGATATATTTTTATAATCTGGATTAGTAATTCGAAA TTTCACATTAGTCATCCGGTTTATAAAATATCGGAAAAGACATTGAATTGTTTATCCCAACTTTTGTTTAT
780281 781794 1514 791699 793297 1599 9904 80.5742 9.8351 1362 68.6155 31.3845 70.1289 29.8711 0.0000 1573493 1573542 TTGCAAAGCCCAACAGTTGGAGTTCCAGCTCACCCACTAGCCGTTATTACGCAGGCTAATTTTTCAACGTGCGAGAAACCAGATTTCCTTTTTCCCGAAGACTATTTAGGGAATTTCCTCTCTTGAGCCAAATTTTCCTTTTTCCCAAGGGGGGTATTTAGGTATTTTCGTCTCTCTCGGCCAAGTTTCCTTTTTCCGTTCCTCCGTAATAAAGCTTTCATTCTTTTTGTAGGTTTCCATTAGAAGTTCAGAAATACTGTTTATAAATACAAATCCAGAGTCAGAGATCCATTCGATAGCACTTTCCTCTCTTTCTCTCATCGTAAAACCGTTCTTCGTTTTTTCTATTTCTCTCGTCTCTTTCTTCTACCCAGATCACTGATCAACCGTGAATTTTACTCTACTCTTTCATCAATTTTACTGTGTTTAATTCATAAAACAAAACAAACGATCGATTTTTGTTTTTTTTTGGGGATGAAGATCAGAAGCAAAAGCAAAGGCTATCGTCAAGAGATCAAATCAACTTGCACCGACTCCAACACTCTGTTCTCAAAAGGTACTTTTCGTCTTGTTGGAATTTTTGGCCAAATAACGTCTTTTTTTTATATGGTATCTAATTACTGTATTCTGGTTGATCGCAAAACAAAGAATTCTCATGTTTAAAGTGGGTTTCGTCTAATCCAATTTGAATTTTAGGTCTTATATTATTTCCCTAGCTCTAACCCTAATCATATATTTCTCAAAGTTTTTGATTTTTATTCAAGATTCCTTCTAATCAATTATTTCTTCTTTGTATGTAGATGTGGTTGCAACTTCGGAGCTTTTTTCATCTGTGATATCGTAGCGTGTGAAAGACGGATAATTCTACAGAGACAATGCATCATCCCATTCGGTCAGGTCAATTTTCATCCCTATTTTGTTTATAATCTATAATCTTATAGATAGTTTAATTACAAAAAAGGTTTTTATACTTTTGATTTTTCAAAACAAAATATATTTCTCAAAGTTTTGATTTTTAAAAGTTTTCAACTCTTTTTTATTTTTCTTTTTTTTTATAAATACTAAATTAATATGACATTTTGAAATCTAACCGGATTTTGGATCCCGCCATTTAACCTTGGTCATTAGCGATTCCACAAATCCAACAAGTTTTTCACATCTTCATGTATATTTAGCATTTATTTTTCTTAAAAAGTTTTCAACTCTTTTTTTTTTTTTTTTAAAGTTTTATAAATACTAAATTAATATGACGTTGTGAAATCTAACCGGATTTTGGATCCCGCCATTTAACCTTGGTCATTAGCGATTTCACAAATCCAACATGTTTTTCACATCTTCATGTATATTTAACATTTATTTATTTTTTTAAAAATTCAACTCTTTTTTTAAAGGGTTATATAAATATTAAATTAATATGATGTCGTGAAATCTAACCGGATTTTGGATCCCGCCATTGAACCGTGGTTATTAGCGATTCCACAAATCCAACATGTTTTTCATATTTTCATATATAT ATATACATGAAGATGTGAAAAACATGTTGGATTTGTGGAATCGCTAATGACCACGGTTAAATGGCGGGATCCAAAATCCGGTTATATTTCACAACGTTGTATTAATTTAGTACTTGTAGAACTTTTTTTAAAAAATAATCAAAAAAGAAAGAGGTGAAAACCTTTTAAGAAAAGAAAGTAAAAAAAATATTAAGCTAAATATATATGAAGATGTGAAAAACATGTTGGATTTGTAGAATCGCTAATGACCACGATTAAATGGCGAGATCCAAAATCCGGTTAGATTTCACAACGTAATATTAATTTAGTACTTATAAAACGTTTTTAAAAAAAAATCCAAAAAGCAAGAGTTGAATACTTATTAAGAAAAGAAAGTAAAAAAATACTAAACTAAATATATATGAAGATGTGAAAAACATGTTGGATTTGTGGAATCGCTAATGACCACGGTTAAATGGCGGGATCCAAAATCCGGTTAGATTTCACAATGTCATATTAATTTAATATTATAGAAAAAAAAAACAAAAGAGAGAGTTGAAAACCATTAAAAATTTTTTTTTTGAGAAATATATTTTGTTTTGAAAAATCAAAAGTATAAAAACCTTTTTTGTAATTAACCTATCTATAAGATTATAGATTATAAACAAGGTAGGGATGAAAATTAACCTGACCGAATGGGATGATACATTGTCTCTGTAGAATTATCCGTCTTTCACACGCTACGATATCACAGATGAAAAAAGCTCCGAAGTTGCAACCACATCTACATACAAAGAAGAAATAATTGATTAGAAGGAATCTTGAATAAAAATCAAAAACTTTGAGAAATATATGATTAGGGTTAGAGGGAAATAATATAAGACCTAAAATTCAAATTGGATTTGAAGAAAACCACTTTGAACATGAGAATTCTTTGTTTTTGCGATCAACCAGAATAATTAGATACTAAAGATTAACAAACCATATAAAAAAGACGTTGATTTGGCCGAAAATTCCAACAAGACGTAGATAAAAAAACAAAAAAAAACAGAAACAGAGATAAGGAAAAAAAATACCTTTTGAGAACAAAGTGTTGGAGTCCGCGCAAGTTGATTTGATCTCTTGACGATAGCCTTTGCTTTTTTTGCTTCTGATCGTCATCCCCAAAAAACAAAAATCGACCGTATCTTTTGTTTCATGAATCAAACACTGTAAAATTCACGACTAATATAGGTAGTGCTATCGAATGAATCTCTGAACGTTTTTCTATATGAGAAAAGGAAGAGAGAAAACAAAGAAAGTTTTTCTGATATCAGAAAAGAGAGAGAGAGAAGTCGCTATCGAATGGATCTCTGATGCTGGAGTTATACTTATAAACAATTTTATGTACGTTGAATCAGAAACCCAACATAAAACCTAGAGAAAAAGGAAAACTTGGCTGTGAGAGAGTCGAAATTACCTAAATACCCCATTGCAGAAAAGGAAAACTCGGCCGAGAGAGCCGAAATTACATAAATATCCCCATCGAAAAAAGGAAACACTATTTTTGACCGTTGGAAATTTCGTTAACGTAATAACGCCTAGTGGCTGAGCTGGAACTCCAACTGTTGGGCTTTGTAA

In order to determine if these inverted repeats are in fact the terminal inverted repeats of DNA transposons, we need to perform a bedtools intersect with another repeat database from the same genome (REPET/Repeatmodeler). The easiest way to do this is to convert the irf.out output into a bed file listing the start and end positions of the left and right border.

Here is an example of how to convert the above to a bed file (assuming multiple chromosomes).

# if the first character doesn't matche "@" print left and right border, else print the chromosome line; Remove everything after CHROMOSOME;if the first letter of column 1 is "@", switch the order of chromosome and number, else print as is; delete "@"; if the first letter of the first column is "C" add the line to the name variable, else print name, left border, right border.
less irf.out |awk '{if (substr($1,1,1)!="@") {print $1,$5} else {print $0}}' |sed 's/CHROMOSOME.*/CHROMOSOME/g' |awk '{if (substr($1,1,1)=="@") {print $2$1} else {print $0}}' |sed 's/@//g' |awk -v name=0 '{if(substr($1,1,1)=="C") {name=$1} else {print name,$1,$2}}'  |tr " " "\t" >IRF.bed

Table of contents