Submission Guidelines for Chromosomal level Assemblies to ENA

This is the part 3 of the ENA data submission. Here we show how to submit the genome and connect it to your existing BioProject, BioExperiments and SRA datasets. This assumes that you have already finished the part 1 and 2 (BioProject creation and raw data submission) steps. To begin, you will need the following information:

Genome Assembly (scaffolds in fasta format)
AGP (A Golden Path) file that explains how these scaffolds should be placed and oriented to make Pseudomolecules.
Chromosome list file (a simple tabular file that lists the chr name specified in AGP file)
Manifest file, a predefined text file, with the metadata for genome submission.
Optical map file, if you have one.

All these files needs to be gzipped. For preparing manifest file, you will need the ids for BioProject/SRA etc. We will go through preparing each of these files in detail below.

1. Prepare your scaffolds:

The scaffold file should be in fasta format, and compressed (gzipped). If you have cholorplast or mitochondrial genomes in your scaffolds, it should be specified in the chromosome list file, indicating them. It gets complicated if there are scaffolds aren’t ordered or oriented, so please do not use this method if that’s your case. However, it is okay to have sequences that do not belong to any chromosomes (unplaced). It is also imprtant that the scaffolds names follow the naming convention and it matches the ids in AGP files. Sequence names should be:

unique within the submission (fasta and AGP)
consistent between files

Here is an example scaffold file in fasta format (just one line of sequence provided for brevity)

>scaf_1
AACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA..
>scaf_2
AAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTA..
>scaf_3
GACCATTTTTATGTCGCGTTCCGCCACACGTGTTTTTGTCCCCGGAGCACCTTAAAGCGGTTCTTGGCCTCCCGCGAG..
>scaf_4
TAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAG..
>scaf_5
CCCTGACACCCGTTTTCGGAATGGGTGACGTGCTGCAACGAAATTGCACGAAACCACCCCAAACATGAGTTTTGGACC..
>scaf_6
GGATTGGGCATGTTCGTTGCGAAAAACGAAGAAATGGTTCCGGTGGCAAAAACTCGTGCTTTGTATGCACCCCGACAA..
>scaf_7
AAGTCCTTCACAAAACGGCGATAGAAGCCAGCAAGTCCTAGGAAACTCCGCACCTGTGTGATAGCCTTTGGCATAGGC..

2. Format the AGP file to match AGP requirements by ENA/NCBI

The AGP files generated by some programs will not be compatible with the ENA submission. Moreover, ENA requires you to have different names for sacaffolds that are unplaced (explained below). The format description can be found here. You can validate your AGP files with this validation webtool. If you prefer a local, command line version, it is available to download here. Since our AGP files were created with ALLMAPS, we had: 1. issue of object name (column 1) being identical to component_id (column 6). 2. for the gaps identified as map in AGP file failed to validate with the following error invalid value for gap_type (column 7) and 3. unplaced scaffolds with the orientation of ?. To fix these issues, we can run the following script.

Also note that ENA no longer accepts unplaced scaffolds in the AGP file; the unplaced scaffolds may remain in the submitted fasta file, but those scaffolds should not be included in the AGP.

script fixAGP.sh

#!/bin/bash
agp="$1"
awk 'BEGIN{OFS=FS="\t"}$9=="?"{$9="+"}{print}' ${agp} |\
sed '/^scaf_/s/^scaf_/scaffold_/g' |\
sed '/^scaf-alt/s/^scaf-alt/scaffold-alt/g' |\
awk 'BEGIN{OFS=FS="\t"}$7=="map"{$7="scaffold"}{print}' |\
sed 's/\tyes\t/\tyes\tmap/g'

and run them as:

fixAGP.sh genome-allmaps.agp > genome-allmaps-fixed.agp

Now the new AGP file should validate without any errors. If it fails to validate, you will have to go back and fix the error before proceeding. You will not be able to submit the files without fixing errors

The fixed AGP file looks like this:

# COMMENT: Generated by ALLMAPS v0.7.6 (2019-05-31)
# COMMAND: python -m jcvi.assembly.allmaps path combined.bed /work/LAS/mhufford-lab/arnstrm/Canu_1.8/genetic_maps/B73_plat/B73_index/B73.PLATINUM.scaffolds.fasta
# FIELDS: object, object_beg, object_end, part_number, component_type, component_id/gap_length, component_beg/gap_type, component_end/linkage, orientation/linkage_evidence
chr1    1       137725724       1       W       scaf_8  1       137725724       +
chr1    137725725       137725824       2       U       100     scaffold        yes     map
chr1    137725825       250635676       3       W       scaf_10 1       112909852       -
chr1    250635677       250635776       4       U       100     scaffold        yes     map
chr1    250635777       308452471       5       W       scaf_12 1       57816695        -
chr2    1       736277  1       W       scaf_20 1       736277  -
chr2    736278  736377  2       U       100     scaffold        yes     map
chr2    736378  243675191       3       W       scaf_1  1       242938814       +
chr3    1       238017767       1       W       scaf_2  1       238017767       +
chr4    1       17691841        1       W       scaf_18 1       17691841        -
chr4    17691842        17691941        2       U       100     scaffold        yes     map
chr4    17691942        250330460       3       W       scaf_3  1       232638519       -
chr5    1       200585289       1       W       scaf_4  1       200585289       -
chr5    200585290       200585389       2       U       100     scaffold        yes     map
chr5    200585390       226353449       3       W       scaf_16 1       25768060        -
chr6    1       16971076        1       W       scaf_19 1       16971076        -
chr6    16971077        16971176        2       U       100     scaffold        yes     map
chr6    16971177        58763533        3       W       scaf_14 1       41792357        +
chr6    58763534        58763633        4       U       100     scaffold        yes     map
chr6    58763634        181357234       5       W       scaf_9  1       122593601       +
chr7    1       57105421        1       W       scaf_13 1       57105421        -
chr7    57105422        57105521        2       U       100     scaffold        yes     map
chr7    57105522        158340868       3       W       scaf_11 1       101235347       +
chr7    158340869       158340968       4       U       100     scaffold        yes     map
chr7    158340969       185808916       5       W       scaf_15 1       27467948        +
chr8    1       160853524       1       W       scaf_6  1       160853524       +
chr8    160853525       160853624       2       U       100     scaffold        yes     map
chr8    160853625       182411202       3       W       scaf_17 1       21557578        -
chr9    1       163004744       1       W       scaf_5  1       163004744       +
chr10   1       152435371       1       W       scaf_7  1       152435371       +
scaffold_21     1       712123  1       W       scaf_21 1       712123  +
scaffold_22     1       570058  1       W       scaf_22 1       570058  +
scaffold_23     1       435198  1       W       scaf_23 1       435198  +

(partial file)

3. Prepare Chromosome list file:

This file is most simplest (assuming you don’t have organellar genomes) and easy to generate. In our case, we had 10 chromosomes, and the list file was generated as follows:

for f in {1..10}; do
  echo -e "chr$f\t$f\tChromosome";
done > chr-list.txt

The contents for chr-list.txt

chr1    1       Chromosome
chr2    2       Chromosome
chr3    3       Chromosome
chr4    4       Chromosome
chr5    5       Chromosome
chr6    6       Chromosome
chr7    7       Chromosome
chr8    8       Chromosome
chr9    9       Chromosome
chr10   10      Chromosome

4. Generate a manifest file

A manifest file ties together all the information that is required for the submission. Again, it is a simple text file, with value and attributes column. The value can only be specific terms as allowed by ENA. If you have more than 1 genome submission, easiest way would be to put all the information in tabular format and generate the manifest file by scripting. This will avoid typos and keep the files consistent.

In our case, we had 27 genomes to submit, so all the required information was added in a single tabular text file (see example here: completed_manifest-trimmed.txt)

This is then broken down for each genome wiht a simple one-liner:

for i in {2..28}; do
  awk -v x=${i} 'NR==1 || NR==x' completed_manifest-trimmed.txt |\
  datamash transpose > manifest_${i}.txt;
done

for f in manifest_*.txt; do
  g=$(grep "ASSEMBLYNAME" $f |cut -f 2 |cut -f 2 -d "-");
  echo mv $f manifest_${g}.txt;
done

An example manifest file looks like this:

SAMPLE  ERS3371164
STUDY   PRJEB32225
ASSEMBLYNAME    Zm-B73-REFERENCE-NAM-5.0
ASSEMBLY_TYPE   clone or isolate
COVERAGE        83
PROGRAM Canu 1.8
PLATFORM        PacBio SEQUEL
MINGAPLENGTH    13
MOLECULETYPE    genomic DNA
DESCRIPTION     whole genome assembly of B73
RUN_REF ERR3288215,ERR3288215,ERR3288216
FASTA   B73.PLATINUM.scaffolds-v1.fasta.gz
AGP     B73.PLATINUM-pg_and_ibm-based_v1.agp.gz
CHROMOSOME_LIST chr-list.txt.gz

(all run ref info is not included for brevity)

5. Submit!

You will need the java command line program- webin-cli for this. It can be obtained from this enasequence GitHub repo

wget https://github.com/enasequence/webin-cli/releases/download/v1.8.11/webin-cli-1.8.11.jar

You can validate your submissions first before actually submitting.

java -jar ../webin-cli-1.8.6.jar \
   -validate \
   -ascp \
   -manifest=manifest_CML52.txt \
   -context=genome \
   -username=YOURUSERNAME \
   -password=YOURPASSWORD

The stdout:

INFO : Your application version is 1.8.6
INFO : A new application version is available. Please download the latest version 1.8.11 from https://github.com/enasequence/webin-cli/releases
INFO : Creating report file: /work/LAS/mhufford-lab/arnstrm/Canu_1.8/EBI-submission-files/manifest-reports/./webin-cli.report
INFO : The submission has been validated successfully.

Once validated you can submit the genome as follows:

java -jar ../webin-cli-1.8.6.jar \
   -submit \
   -ascp \
   -manifest=manifest_CML52.txt \
   -context=genome \
   -username=YOURUSERNAME \
   -password=YOURPASSWORD

The stdout:

INFO : Your application version is 1.8.6
INFO : Creating report file: /work/LAS/mhufford-lab/arnstrm/Canu_1.8/EBI-submission-files/Scaffolds/./webin-cli.report
INFO : Submission has not been validated previously.
INFO : The submission has been validated successfully.
INFO : Invoking: ascp --file-checksum=md5 -d --mode=send --overwrite=always -QT -l300M --host=webin.ebi.ac.uk --user="YOURUSERNAME" --src-base="/work/LAS/mhufford-lab/arnstrm/Canu_1.8/EBI-submission-files/Scaffolds" --file-list="/tmp/FILE14052397021471900767LIST" "webin-cli/genome/Zm-CML52-REFERENCE-NAM-1.0"

Completed: 650401K bytes transferred in 19 seconds
 (271681K bits/sec), in 3 files.
INFO : Files have been uploaded to webin.ebi.ac.uk.
INFO : The submission has been completed successfully. The following analysis accession was assigned to the submission: ERZ1028866

After a day or two, you should get a confirmation email stating that your submission has been received and accession has been provided.

6. Optical maps

Many genome assembly protocols now use optical maps to help guide the construction of scaffolds. The following protocol can be used to submit an optical map along with your genome. These protocols can also be reviewed here.

To submit an analysis programmatically, two XML files must be generated to describe the submission.

Analysis XML - used for describing the analysis you would like to submit
Submission XML - tells ENA how to process this submission

In this example, we are submitting the optical map file for the B104 genome. We will submit the optical map file B104_golden_maps.cmap.gz via Aspera ascp (described in Section 2, Submitting raw datasets to ENA); then we will submit both xml files.

Submit the optical map file:

First, get a checksum for your optical map:

md5 B104_golden_maps.cmap.gz

Next, submit your optical map using Aspera:

ascp -QT -l300M -L- B104_golden_maps.cmap.gz Webin-00000@webin.ebi.ac.uk:.

Here is the format for both the optical map xml and the submission xml:

B104_optical_map.xml:

<ANALYSIS_SET>
    <ANALYSIS alias="B104_BioNanao_map">
        <TITLE>B104 BioNano Optical Mapping data</TITLE>
        <DESCRIPTION>Zea mays cultivar B104 Optical Mapping data produced by BioNano</DESCRIPTION>
        <STUDY_REF accession="PRJEB44462"/>        
        <SAMPLE_REF accession="ERS6402692"/>      
        <ANALYSIS_TYPE>
            <GENOME_MAP>
                <PROGRAM>Saphyr</PROGRAM>
                <PLATFORM>BioNano</PLATFORM>
            </GENOME_MAP>
        </ANALYSIS_TYPE>
        <FILES>
            <FILE filename="B104_golden_maps.cmap.gz" filetype="BioNano_native" checksum_method="MD5" checksum="19274394578djdhw9184hg9"/>
        </FILES>
    </ANALYSIS>
</ANALYSIS_SET>

submission.xml:

<SUBMISSION>
   <ACTIONS>
      <ACTION>
         <ADD/>
      </ACTION>
   </ACTIONS>
</SUBMISSION>

Note that no other information needs to be added to the submission.xml file than what is written above.

Test the submission to make sure all files are correct:

curl -u Webin-00000:PaSSW0rd -F "SUBMISSION=@submission.xml" -F "ANALYSIS=@B104_optical_map.xml" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/"

To know if the submission was successful, look in the first line of the block. The attribute success will have value true or false. If the value is false then the submission did not succeed. In this case check the rest of the receipt for error messages and after making corrections, try the submission again.

Once you are happy with the result of the submission you can use the CURL command again but this time using the production service. Simply change the part in the URL from wwwdev.ebi.ac.uk to www.ebi.ac.uk:

curl -u Webin-00000:PaSSW0rd -F "SUBMISSION=@submission.xml" -F "ANALYSIS=@B104_optical_map.xml" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/"

ENA data submission part 3

Arun Seetharam

Margaret Woodhouse