Repeatmodeler is a repeat-identifying software that can provide a list of repeat family sequences to mask repeats in a genome with RepeatMasker.
Things to consider with this software is that it can take a long time with large genomes (>1Gb==>96hrs on a 16 cpu node). You also need to set the correct parameters in repeatmodeler so that you get repeats that are not only grouped by family, but are also annotated.
Repeatmodeler http://www.repeatmasker.org/RepeatModeler/ RepeatMasker http://www.repeatmasker.org/RMDownload.html
Get your genome and unzip
1
2
3
#I will be using the Araport11 Arabidopsis genome
wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas
TAIR10_chr_all.fas
Run repeatmodeler
This can take longer than 96 hours on one node with 16threads if the genome is larger than 1Gb. For arabidopsis it took 9.5 hours
1
2
3
4
# make up a name for your database, choose your search engine, the number of threads, and the genome file
module load GIF2/repeatmodeler/1.0.8
BuildDatabase -name TAIR10_chr_all.DB -engine rmblast TAIR10_chr_all.fas
RepeatModeler -database TAIR10_chr_all.DB -engine ncbi -pa 16
Run RepeatMasker
This can take takes about 24-48 hours to finish on a genome over 1Gb. However the arabidopsis run below took 14 minutes with 16 threads.
1
2
3
4
5
6
7
8
9
#I moved to a different directory, so I softlinked my classified file. Make sure you use the consensi.fa.classified file, or your repeats will just be masked by repeatmasker, but unannotated.
#make sure you softlink the classified file, otherwise you will not get a table of classified elements after the run.
ln -s RM_3245.WedMay21605262018/consensi.fa.classified
#This will produce a gff for the repeat mapping, a masked fasta file, and a table summarizing the repeats found in the genome.
module load GIF2/repeatmasker/4.0.6
RepeatMasker -pa 16 -gff -lib consensi.fa.classified TAIR10_chr_all.fas
This is what I find in Arabidopsis.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
==================================================
file name: TAIR10_chr_all.fas
sequences: 7
total length: 119667750 bp (119482146 bp excl N/X-runs)
GC level: 36.06 %
bases masked: 17878420 bp ( 14.94 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
SINEs: 381 91996 bp 0.08 %
ALUs 0 0 bp 0.00 %
MIRs 0 0 bp 0.00 %
LINEs: 2505 1306247 bp 1.09 %
LINE1 2407 1277415 bp 1.07 %
LINE2 0 0 bp 0.00 %
L3/CR1 0 0 bp 0.00 %
LTR elements: 6875 7287909 bp 6.09 %
ERVL 0 0 bp 0.00 %
ERVL-MaLRs 0 0 bp 0.00 %
ERV_classI 0 0 bp 0.00 %
ERV_classII 0 0 bp 0.00 %
DNA elements: 5916 2718611 bp 2.27 %
hAT-Charlie 435 163286 bp 0.14 %
TcMar-Tigger 0 0 bp 0.00 %
Unclassified: 7261 3594123 bp 3.00 %
Total interspersed repeats: 14998886 bp 12.53 %
Small RNA: 442 115046 bp 0.10 %
Satellites: 1064 981082 bp 0.82 %
Simple repeats: 35831 1435203 bp 1.20 %
Low complexity: 9032 443907 bp 0.37 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
The query species was assumed to be homo
RepeatMasker version open-4.0.6 , default mode
run with rmblastn version 2.2.27+
The query was compared to classified sequences in "consensi.fa.classified"
RepBase Update 20160829, RM database version 20160829
Now there is also a GFF that can be used for many other genomic comparisons.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
##gff-version 2
##date 2018-05-03
##sequence-region TAIR10_chr_all.fas
1 RepeatMasker similarity 1 115 13.1 + . Target "Motif:A-rich" 1 107
1 RepeatMasker similarity 1066 1097 10.0 + . Target "Motif:(C)n" 1 32
1 RepeatMasker similarity 1155 1187 17.1 + . Target "Motif:(TTTCTT)n" 1 33
1 RepeatMasker similarity 4291 4328 8.4 + . Target "Motif:(AT)n" 1 38
1 RepeatMasker similarity 5680 5702 9.3 + . Target "Motif:(T)n" 1 23
1 RepeatMasker similarity 8669 8699 0.0 + . Target "Motif:(CT)n" 1 31
1 RepeatMasker similarity 9961 10030 20.7 + . Target "Motif:(AT)n" 1 70
1 RepeatMasker similarity 10814 10885 28.7 + . Target "Motif:(AT)n" 1 71
1 RepeatMasker similarity 11915 11960 12.0 + . Target "Motif:(ATC)n" 1 46
1 RepeatMasker similarity 11985 12001 0.0 + . Target "Motif:(GAA)n" 1 17
1 RepeatMasker similarity 12875 12917 16.5 + . Target "Motif:(TTCTTG)n" 1 44
1 RepeatMasker similarity 12985 12997 21.8 + . Target "Motif:(TGGTTTT)n" 1 36
1 RepeatMasker similarity 12998 13021 8.9 + . Target "Motif:(T)n" 1 24
1 RepeatMasker similarity 13346 13368 4.5 + . Target "Motif:(AG)n" 1 23
1 RepeatMasker similarity 15436 15469 10.8 + . Target "Motif:(ATT)n" 1 32
1 RepeatMasker similarity 15872 15895 4.3 + . Target "Motif:(TA)n" 1 25
1 RepeatMasker similarity 16804 16838 0.0 + . Target "Motif:(TTG)n" 1 31
1 RepeatMasker similarity 17009 17256 7.3 + . Target "Motif:rnd-5_family-1313" 1203 1449
1 RepeatMasker similarity 17256 17735 6.9 + . Target "Motif:rnd-5_family-1313" 1230 1808
1 RepeatMasker similarity 17817 18078 17.5 + . Target "Motif:rnd-4_family-1372" 2109 2389
1 RepeatMasker similarity 18100 18642 3.0 + . Target "Motif:rnd-5_family-843" 1 549
1 RepeatMasker similarity 18661 18731 2.8 + . Target "Motif:rnd-5_family-843" 1156 1226
1 RepeatMasker similarity 20510 20557 19.0 + . Target "Motif:(AT)n" 1 50
1 RepeatMasker similarity 23109 23167 17.3 + . Target "Motif:GA-rich" 1 52
1 RepeatMasker similarity 34438 34456 5.6 + . Target "Motif:(TTG)n" 1 19
1 RepeatMasker similarity 37736 37777 0.0 + . Target "Motif:(GAA)n" 1 41
1 RepeatMasker similarity 37790 37825 6.9 + . Target "Motif:GA-rich" 1 34
1 RepeatMasker similarity 41361 41385 0.0 + . Target "Motif:(TA)n" 1 25
1 RepeatMasker similarity 41549 41567 5.5 + . Target "Motif:(TA)n" 1 19
1 RepeatMasker similarity 41716 41760 30.0 + . Target "Motif:A-rich" 1 45
1 RepeatMasker similarity 42444 42488 18.2 + . Target "Motif:(T)n" 1 45
1 RepeatMasker similarity 43659 43713 30.3 + . Target "Motif:(AATTTT)n" 1 54
1 RepeatMasker similarity 46523 46564 22.0 + . Target "Motif:(TTC)n" 1 42
1 RepeatMasker similarity 46832 46864 18.1 + . Target "Motif:A-rich" 1 32
1 RepeatMasker similarity 47067 47290 17.8 - . Target "Motif:rnd-5_family-3415" 1427 1673
1 RepeatMasker similarity 49387 49415 15.3 + . Target "Motif:(TTCTT)n" 1 30
1 RepeatMasker similarity 50433 50515 29.6 + . Target "Motif:(CAG)n" 1 83
1 RepeatMasker similarity 53368 53434 20.4 + . Target "Motif:(TAATTTG)n" 1 69
1 RepeatMasker similarity 55677 55919 7.9 + . Target "Motif:rnd-5_family-7947" 1 242
1 RepeatMasker similarity 55880 56020 20.0 + . Target "Motif:rnd-5_family-2172" 1 117
1 RepeatMasker similarity 56021 56236 14.9 + . Target "Motif:rnd-5_family-1066" 747 946
1 RepeatMasker similarity 56237 56293 20.0 + . Target "Motif:rnd-5_family-2172" 118 165
1 RepeatMasker similarity 56311 56576 6.8 + . Target "Motif:rnd-5_family-4425" 1129 1398
1 RepeatMasker similarity 59061 59095 19.6 + . Target "Motif:A-rich" 1 35
1 RepeatMasker similarity 61867 61893 16.6 + . Target "Motif:(T)n" 1 27
1 RepeatMasker similarity 62347 62372 0.0 + . Target "Motif:(AAG)n" 1 26
1 RepeatMasker similarity 62392 62803 16.2 + . Target "Motif:rnd-5_family-2965" 1 531
1 RepeatMasker similarity 63507 63531 0.0 + . Target "Motif:(TCTTTC)n" 1 25
1 RepeatMasker similarity 64859 64872 27.7 + . Target "Motif:A-rich" 1 37
#truncated file for visualization of gff