Submission Guidelines for Chromosomal level Assemblies to ENA
This is the part 3 of the ENA data submission. Here we show how to submit the genome and connect it to your existing BioProject, BioExperiments and SRA datasets. This assumes that you have already finished the part 1 and 2 (BioProject creation and raw data submission) steps. To begin, you will need the following information:
- Genome Assembly (scaffolds in fasta format)
- AGP (A Golden Path) file that explains how these scaffolds should be placed and oriented to make Pseudomolecules.
- Chromosome list file (a simple tabular file that lists the chr name specified in AGP file)
- Manifest file, a predefined text file, with the metadata for genome submission.
- Optical map file, if you have one.
All these files needs to be gzipped. For preparing manifest file, you will need the ids for BioProject/SRA etc. We will go through preparing each of these files in detail below.
1. Prepare your scaffolds:
The scaffold file should be in fasta
format, and compressed (gzipped). If you have cholorplast or mitochondrial genomes in your scaffolds, it should be specified in the chromosome list file, indicating them.
It gets complicated if there are scaffolds aren’t ordered or oriented, so please do not use this method if that’s your case. However, it is okay to have sequences that do not belong to any chromosomes (unplaced). It is also imprtant that the scaffolds names follow the naming convention and it matches the ids in AGP files.
Sequence names should be:
- unique within the submission (fasta and AGP)
- consistent between files
Here is an example scaffold file in fasta format (just one line of sequence provided for brevity)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
>scaf_1
AACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA..
>scaf_2
AAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTA..
>scaf_3
GACCATTTTTATGTCGCGTTCCGCCACACGTGTTTTTGTCCCCGGAGCACCTTAAAGCGGTTCTTGGCCTCCCGCGAG..
>scaf_4
TAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAG..
>scaf_5
CCCTGACACCCGTTTTCGGAATGGGTGACGTGCTGCAACGAAATTGCACGAAACCACCCCAAACATGAGTTTTGGACC..
>scaf_6
GGATTGGGCATGTTCGTTGCGAAAAACGAAGAAATGGTTCCGGTGGCAAAAACTCGTGCTTTGTATGCACCCCGACAA..
>scaf_7
AAGTCCTTCACAAAACGGCGATAGAAGCCAGCAAGTCCTAGGAAACTCCGCACCTGTGTGATAGCCTTTGGCATAGGC..
2. Format the AGP file to match AGP requirements by ENA/NCBI
The AGP files generated by some programs will not be compatible with the ENA submission. Moreover, ENA requires you to have different names for sacaffolds that are unplaced (explained below). The format description can be found here. You can validate your AGP files with this validation webtool. If you prefer a local, command line version, it is available to download here. Since our AGP files were created with ALLMAPS, we had: 1. issue of object name (column 1) being identical to component_id (column 6). 2. for the gaps identified as map
in AGP file failed to validate with the following error invalid value for gap_type (column 7)
and 3. unplaced scaffolds with the orientation of ?
. To fix these issues, we can run the following script.
Also note that ENA no longer accepts unplaced scaffolds in the AGP file; the unplaced scaffolds may remain in the submitted fasta file, but those scaffolds should not be included in the AGP.
script fixAGP.sh
1
2
3
4
5
6
7
#!/bin/bash
agp="$1"
awk 'BEGIN{OFS=FS="\t"}$9=="?"{$9="+"}{print}' ${agp} |\
sed '/^scaf_/s/^scaf_/scaffold_/g' |\
sed '/^scaf-alt/s/^scaf-alt/scaffold-alt/g' |\
awk 'BEGIN{OFS=FS="\t"}$7=="map"{$7="scaffold"}{print}' |\
sed 's/\tyes\t/\tyes\tmap/g'
and run them as:
1
fixAGP.sh genome-allmaps.agp > genome-allmaps-fixed.agp
Now the new AGP file should validate without any errors. If it fails to validate, you will have to go back and fix the error before proceeding. You will not be able to submit the files without fixing errors
The fixed AGP file looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# COMMENT: Generated by ALLMAPS v0.7.6 (2019-05-31)
# COMMAND: python -m jcvi.assembly.allmaps path combined.bed /work/LAS/mhufford-lab/arnstrm/Canu_1.8/genetic_maps/B73_plat/B73_index/B73.PLATINUM.scaffolds.fasta
# FIELDS: object, object_beg, object_end, part_number, component_type, component_id/gap_length, component_beg/gap_type, component_end/linkage, orientation/linkage_evidence
chr1 1 137725724 1 W scaf_8 1 137725724 +
chr1 137725725 137725824 2 U 100 scaffold yes map
chr1 137725825 250635676 3 W scaf_10 1 112909852 -
chr1 250635677 250635776 4 U 100 scaffold yes map
chr1 250635777 308452471 5 W scaf_12 1 57816695 -
chr2 1 736277 1 W scaf_20 1 736277 -
chr2 736278 736377 2 U 100 scaffold yes map
chr2 736378 243675191 3 W scaf_1 1 242938814 +
chr3 1 238017767 1 W scaf_2 1 238017767 +
chr4 1 17691841 1 W scaf_18 1 17691841 -
chr4 17691842 17691941 2 U 100 scaffold yes map
chr4 17691942 250330460 3 W scaf_3 1 232638519 -
chr5 1 200585289 1 W scaf_4 1 200585289 -
chr5 200585290 200585389 2 U 100 scaffold yes map
chr5 200585390 226353449 3 W scaf_16 1 25768060 -
chr6 1 16971076 1 W scaf_19 1 16971076 -
chr6 16971077 16971176 2 U 100 scaffold yes map
chr6 16971177 58763533 3 W scaf_14 1 41792357 +
chr6 58763534 58763633 4 U 100 scaffold yes map
chr6 58763634 181357234 5 W scaf_9 1 122593601 +
chr7 1 57105421 1 W scaf_13 1 57105421 -
chr7 57105422 57105521 2 U 100 scaffold yes map
chr7 57105522 158340868 3 W scaf_11 1 101235347 +
chr7 158340869 158340968 4 U 100 scaffold yes map
chr7 158340969 185808916 5 W scaf_15 1 27467948 +
chr8 1 160853524 1 W scaf_6 1 160853524 +
chr8 160853525 160853624 2 U 100 scaffold yes map
chr8 160853625 182411202 3 W scaf_17 1 21557578 -
chr9 1 163004744 1 W scaf_5 1 163004744 +
chr10 1 152435371 1 W scaf_7 1 152435371 +
scaffold_21 1 712123 1 W scaf_21 1 712123 +
scaffold_22 1 570058 1 W scaf_22 1 570058 +
scaffold_23 1 435198 1 W scaf_23 1 435198 +
(partial file)
3. Prepare Chromosome list file:
This file is most simplest (assuming you don’t have organellar genomes) and easy to generate. In our case, we had 10 chromosomes, and the list file was generated as follows:
1
2
3
for f in {1..10}; do
echo -e "chr$f\t$f\tChromosome";
done > chr-list.txt
The contents for chr-list.txt
1
2
3
4
5
6
7
8
9
10
chr1 1 Chromosome
chr2 2 Chromosome
chr3 3 Chromosome
chr4 4 Chromosome
chr5 5 Chromosome
chr6 6 Chromosome
chr7 7 Chromosome
chr8 8 Chromosome
chr9 9 Chromosome
chr10 10 Chromosome
4. Generate a manifest file
A manifest file ties together all the information that is required for the submission. Again, it is a simple text file, with value
and attributes
column. The value
can only be specific terms as allowed by ENA. If you have more than 1 genome submission, easiest way would be to put all the information in tabular format and generate the manifest file by scripting. This will avoid typos and keep the files consistent.
In our case, we had 27 genomes to submit, so all the required information was added in a single tabular text file (see example here: completed_manifest-trimmed.txt
)
This is then broken down for each genome wiht a simple one-liner:
1
2
3
4
5
6
7
8
9
for i in {2..28}; do
awk -v x=${i} 'NR==1 || NR==x' completed_manifest-trimmed.txt |\
datamash transpose > manifest_${i}.txt;
done
for f in manifest_*.txt; do
g=$(grep "ASSEMBLYNAME" $f |cut -f 2 |cut -f 2 -d "-");
echo mv $f manifest_${g}.txt;
done
An example manifest file looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
SAMPLE ERS3371164
STUDY PRJEB32225
ASSEMBLYNAME Zm-B73-REFERENCE-NAM-5.0
ASSEMBLY_TYPE clone or isolate
COVERAGE 83
PROGRAM Canu 1.8
PLATFORM PacBio SEQUEL
MINGAPLENGTH 13
MOLECULETYPE genomic DNA
DESCRIPTION whole genome assembly of B73
RUN_REF ERR3288215,ERR3288215,ERR3288216
FASTA B73.PLATINUM.scaffolds-v1.fasta.gz
AGP B73.PLATINUM-pg_and_ibm-based_v1.agp.gz
CHROMOSOME_LIST chr-list.txt.gz
(all run ref info is not included for brevity)
5. Submit!
You will need the java command line program- webin-cli
for this. It can be obtained from this enasequence
GitHub repo
1
wget https://github.com/enasequence/webin-cli/releases/download/v1.8.11/webin-cli-1.8.11.jar
You can validate your submissions first before actually submitting.
1
2
3
4
5
6
7
java -jar ../webin-cli-1.8.6.jar \
-validate \
-ascp \
-manifest=manifest_CML52.txt \
-context=genome \
-username=YOURUSERNAME \
-password=YOURPASSWORD
The stdout:
1
2
3
4
INFO : Your application version is 1.8.6
INFO : A new application version is available. Please download the latest version 1.8.11 from https://github.com/enasequence/webin-cli/releases
INFO : Creating report file: /work/LAS/mhufford-lab/arnstrm/Canu_1.8/EBI-submission-files/manifest-reports/./webin-cli.report
INFO : The submission has been validated successfully.
Once validated you can submit the genome as follows:
1
2
3
4
5
6
7
java -jar ../webin-cli-1.8.6.jar \
-submit \
-ascp \
-manifest=manifest_CML52.txt \
-context=genome \
-username=YOURUSERNAME \
-password=YOURPASSWORD
The stdout:
1
2
3
4
5
6
7
8
9
10
INFO : Your application version is 1.8.6
INFO : Creating report file: /work/LAS/mhufford-lab/arnstrm/Canu_1.8/EBI-submission-files/Scaffolds/./webin-cli.report
INFO : Submission has not been validated previously.
INFO : The submission has been validated successfully.
INFO : Invoking: ascp --file-checksum=md5 -d --mode=send --overwrite=always -QT -l300M --host=webin.ebi.ac.uk --user="YOURUSERNAME" --src-base="/work/LAS/mhufford-lab/arnstrm/Canu_1.8/EBI-submission-files/Scaffolds" --file-list="/tmp/FILE14052397021471900767LIST" "webin-cli/genome/Zm-CML52-REFERENCE-NAM-1.0"
Completed: 650401K bytes transferred in 19 seconds
(271681K bits/sec), in 3 files.
INFO : Files have been uploaded to webin.ebi.ac.uk.
INFO : The submission has been completed successfully. The following analysis accession was assigned to the submission: ERZ1028866
After a day or two, you should get a confirmation email stating that your submission has been received and accession has been provided.
6. Optical maps
Many genome assembly protocols now use optical maps to help guide the construction of scaffolds. The following protocol can be used to submit an optical map along with your genome. These protocols can also be reviewed here.
To submit an analysis programmatically, two XML files must be generated to describe the submission.
- Analysis XML - used for describing the analysis you would like to submit
- Submission XML - tells ENA how to process this submission
In this example, we are submitting the optical map file for the B104 genome. We will submit the optical map file B104_golden_maps.cmap.gz
via Aspera ascp (described in Section 2, Submitting raw datasets to ENA); then we will submit both xml files.
Submit the optical map file:
First, get a checksum for your optical map:
md5 B104_golden_maps.cmap.gz
Next, submit your optical map using Aspera:
1
ascp -QT -l300M -L- B104_golden_maps.cmap.gz Webin-00000@webin.ebi.ac.uk:.
Here is the format for both the optical map xml and the submission xml:
B104_optical_map.xml
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<ANALYSIS_SET>
<ANALYSIS alias="B104_BioNanao_map">
<TITLE>B104 BioNano Optical Mapping data</TITLE>
<DESCRIPTION>Zea mays cultivar B104 Optical Mapping data produced by BioNano</DESCRIPTION>
<STUDY_REF accession="PRJEB44462"/>
<SAMPLE_REF accession="ERS6402692"/>
<ANALYSIS_TYPE>
<GENOME_MAP>
<PROGRAM>Saphyr</PROGRAM>
<PLATFORM>BioNano</PLATFORM>
</GENOME_MAP>
</ANALYSIS_TYPE>
<FILES>
<FILE filename="B104_golden_maps.cmap.gz" filetype="BioNano_native" checksum_method="MD5" checksum="19274394578djdhw9184hg9"/>
</FILES>
</ANALYSIS>
</ANALYSIS_SET>
submission.xml
:
1
2
3
4
5
6
7
<SUBMISSION>
<ACTIONS>
<ACTION>
<ADD/>
</ACTION>
</ACTIONS>
</SUBMISSION>
Note that no other information needs to be added to the submission.xml
file than what is written above.
Test the submission to make sure all files are correct:
1
curl -u Webin-00000:PaSSW0rd -F "SUBMISSION=@submission.xml" -F "ANALYSIS=@B104_optical_map.xml" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/"
To know if the submission was successful, look in the first line of the
Once you are happy with the result of the submission you can use the CURL command again but this time using the production service. Simply change the part in the URL from wwwdev.ebi.ac.uk
to www.ebi.ac.uk
:
1
curl -u Webin-00000:PaSSW0rd -F "SUBMISSION=@submission.xml" -F "ANALYSIS=@B104_optical_map.xml" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/"