Clean-up Maker GFF file

First remove fasta sequences from the GFF file

1
2
3
4
grep -n "##FASTA"
# this will return line number where the "#FASTA" occured, lets say it was 10500th line
# which means you need to take everything before that number
head -n 10499 maker_ouput.gff3 > maker_output_noseq.gff3

Second, remove features that are not standard to GFF3 (alternative predictions that are not in the final set)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# next, you need to keep only those features that have "maker" in the second field.
# But you can't select them using grep as it will mess up the GFF3 structure
# other way is to exclue everything that is not "maker"
# so, get the list of "words" that you want to exclude

grep -v "^#" maker_output_noseq.gff3 | cut -f 2 |sort | uniq

# all the words that you get from this command should be excluded from your final output (entire line),
# so make a command to exclude them using the example below

file=maker_output_noseq.gff3
awk '$2 !~ /augustus/' $file > ${file}.1
awk '$2 !~ /blastn/' ${file}.1 > ${file}.2
awk '$2 !~ /blastx/' ${file}.2 > ${file}.3
awk '$2 !~ /est2genome/' ${file}.3 > ${file}.4
awk '$2 !~ /genemark/' ${file}.4 > ${file}.5
awk '$2 !~ /protein2genome/' ${file}.5 > ${file}.6
awk '$2 !~ /snap/' ${file}.6 > ${file}.7
rm ${file}.1 ${file}.2 ${file}.3 ${file}.4 ${file}.5 ${file}.6
mv ${file}.7 ${file%.*}_clean.gff

Finally, make it valid GFF3 using genometools

1
2
module load genometools
gt gff3 -sort -tidy -addids yes -o maker-final-output.gff3 maker_output_noseq_clean.gff3