Genome scaffolding with the Juicer, JuiceBox, 3D-DNA pipeline
Here at the ISUGIF, we have used multiple different contig assemblers and had success with this pipeline. Typically we assemble the genome with long reads and polish (long or illumina reads) prior to running the pipeline. I do not know if this is something that affects the downstream product, but it is an important consideration, considering that long read assemblies are typically error prone.
Another thing to consider is the type of HiC you are getting, we’ve used both endonuclease and restriction enzyme generated HiC with success. If you plan to use the HiC for anything further than HiC, you will want to specify this. The steps are a bit different, as with an endonuclease you can just say “-s none”. With restriction enzyme generated HiC, you will have to create a restriction_sites/ folder and run your genome you will have to creat the proposed restriction enzyme cut sites for your genome.
Gather the necessary files, setup the juicer file structure, and run juicer
Resources
It is not always clear where to find the information you need to run the pipeline, so I will lay out the many resources that I accessed while learning to run this pipeline.
There is lots of information to be had here at the juicer github site.
- https://github.com/aidenlab/juicer/wiki
If you have problems with installation and/or need help with understanding a result. The aiden lab forum is an invaluable resource.
- https://www.aidenlab.org/forum.html
Software Dependencies
1
2
3
4
5
6
BWA -- burrows-Wheeler Aligner
GNU core utils
Java >1.7
python
bioawk
Folder structure
This is the folder structure that we are trying to achieve with a few modifications. Juicer was already installed as a module, so the initial setup of recommended for juicer did not apply here. You are going to a folder with four subfolders containing: HiC reads, bwa index of your genome, predicted restriction enzyme cuts for your genome (if needed), and your fastq files split into appropriate sizes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
├── 01_JuicerSetup
│ ├── chrom.sizes
│ ├── fastq
│ │ ├── hic_R1.fastq
│ │ └── hic_R2.fastq
│ ├── references
│ │ ├── Big.MysteryGenomePilon3.PILON.fasta
│ │ ├── Big.MysteryGenomePilon3.PILON.fasta.amb
│ │ ├── Big.MysteryGenomePilon3.PILON.fasta.ann
│ │ ├── Big.MysteryGenomePilon3.PILON.fasta.bwt
│ │ ├── Big.MysteryGenomePilon3.PILON.fasta.pac
│ │ └── Big.MysteryGenomePilon3.PILON.fasta.sa
│ ├── restriction_sites
│ | └── Big.MysteryGenomePilon3.PILON.fasta_DpnII.txt
| └── splits/
| | ├── x000_R1.fastq
| | ├── x000_R2.fastq
| | ├── x001_R1.fastq
| | ├── x001_R2.fastq
| | ├── x002_R1.fastq
| | ├── x002_R2.fastq
| | ├── x003_R1.fastq
| | ├── x003_R2.fastq
| | ├── x004_R1.fastq
| | ├── x004_R2.fastq
| | ├── x005_R1.fastq
| | ├── x005_R2.fastq
| | ├── x006_R1.fastq
| | ├── x006_R2.fastq
| | ├── x007_R1.fastq
| | ├── x007_R2.fastq
| | ├── x008_R1.fastq
| | ├── x008_R2.fastq
| | ├── x009_R1.fastq
| | ├── x009_R2.fastq
| | ├── x010_R1.fastq
| | ├── x010_R2.fastq
| | ├── x011_R1.fastq
| | ├── x011_R2.fastq
| | ├── x012_R1.fastq
| | ├── x012_R2.fastq
| | ├── x013_R1.fastq
| | ├── x013_R2.fastq
| | ├── x014_R1.fastq
| | └── x014_R2.fastq
softlink and index the genome sequence in the references/ folder
1
2
3
4
5
#index the references in this explicitly named folder.
mkdir references; cd references
ln -s /work/gif/archiveResults/JustJammin/WhiteMysteryGenome/Big.MysteryGenomePilon3.PILON.fasta
module load bwa/0.7.17-zhcbtza
bwa index Big.MysteryGenomePilon3.PILON.fasta
Create restriction enzyme proposed cutting sites
I did not do this for this genome assembly, as my HiC was generated with an endonuclease.
1
2
3
mkdir restriction_sites; cd restriction_sites
python ../../juicer/misc/generate_site_positions.py DpnII Big.MysteryGenomePilon3.PILON.fasta ../references/Big.MysteryGenomePilon3.PILON.fasta
cd ../
Get your HiC reads ready
While juicer does support splitting gzipped files, we had problems getting juicer to split the reads. Thus I worked around this to make juicer assume the fastq files are already split.
- fastq files should look like filename_R1.fastq and filename_R2.fastq
1
2
3
4
5
6
7
8
9
mkdir fastq;cd fastq
ln -s /work/gif/archiveNova/JustJammin/Mystery/whiteMystery/HiC/JustJammin/White_S1_L001_R1_001.fastq hic_R1.fastq
ln -s /work/gif/archiveNova/JustJammin/Mystery/whiteMystery/HiC/JustJammin/White_S1_L001_R2_001.fastq hic_R2.fastq
cd ../
mkdir splits
cd splits/
split -a 3 -l 90000000 -d --additional-suffix=_R2.fastq ../fastq/hic_R2.fastq &
split -a 3 -l 90000000 -d --additional-suffix=_R1.fastq ../fastq/hic_R1.fastq &
cd ../
Create a chromosome sizes file in the format “name\tsequence_length”
1
2
3
module load bioawk
bioawk -c fastx '{print $name"\t"length($seq)}' references/Big.MysteryGenomePilon3.PILON.fasta >chrom.sizes
Submit a juicer run
Move to the submission directory and submit The trick here is setting juicer’s short, medium, and long queue times, as they will not likely be default for your system.
1
2
3
cd ../
#/02_WhiteMysteryPseudomolecule/01_JuicerSetup
ml jdk/10.0.2_13-fr57jru; ml juicer; juicer.sh -d 01_JuicerSetup -p chrom_sizes -s none -z 01_JuicerSetup/references/Big.MysteryGenomePilon3.PILON.fasta -q short -Q 2:00:00 -l medium -L 12:00:00 -t 36
Troubleshooting a juicer run
Your stdout and stderr will tell you almost nothing about your run. If at the end of your run, you are seeing large merged_sort.txt and a merged_nodups.txt files, then you likely succeeded. Note that these files are usually hundreds of MB.
If you do not have a merged_sort.txt, the first thing to check is if your split fastq files (in splits/) each aligned and created a sam file. If you want to know why they failed, then you’ll have to investigate the debug folder.
If you have a merged_sort.txt file, but lack the merged_nodups.txt, then the read deduplication step failed. In most cases this will be a memory issue that can be fixed by creating a greater number of splits of your fastq files.
You will need to go into the debug folder, which will be /02_WhiteMysteryPseudomolecule/01_JuicerSetup/debug for me.
If juicer does not complete, you can restart from the last completed stage using -S with one of the following options “chimeric”, “merge”, “dedup”, “final”, “postproc”, or “early”.
Set up and run 3D-DNA assembly
Install 3d-dna
When I install I usually use singularity containers or conda evironments to address my lack of sudo privileges. LastZ installation is not necessary, as it is only useful for highly heterozygous genomes.
Create conda environment for installation
1
2
3
4
5
6
#/02_WhiteMysteryPseudomolecule/03_3D-DNA
#create a conda environment, as I am using a remote hpc.
module load miniconda3/4.3.30-qdauveb
conda create -n 3d-dna
source activate 3d-dna
Install lastz inside the 3d-dna environment
1
2
3
4
5
6
git clone https://github.com/lastz/lastz.git
cd lastz/
cd src/
make
make install
make test
Install Python packages
1
2
3
4
# these python packages are needed.
pip install numpy --user
pip install scipy --user
pip install matplotlib --user
Clone and set path to 3d-dna
1
2
3
4
#/02_WhiteMysteryPseudomolecule/03_3D-DNA
git clone https://github.com/theaidenlab/3d-dna.git
PATH=$PATH:/02_WhiteMysteryPseudomolecule/03_3D-DNA/lastz
PATH=$PATH:/02_WhiteMysteryPseudomolecule/03_3D-DNA/3d-dna
3d-dna assembly that deals with uneven repeat coverage in the genome
1
2
3
module load miniconda3/4.3.30-qdauveb;source activate 3d-dna;module load jdk;module load parallel;cd /02_WhiteMysteryPseudomolecule/03_3D-DNA/3d-dna;bash run-asm-pipeline.sh --editor-repeat-coverage 20 /02_WhiteMysteryPseudomolecule/01_JuicerSetup/references/Big.MysteryGenomePilon3.PILON.fasta /02_WhiteMysteryPseudomolecule/01_JuicerSetup/aligned/merged_nodups.txt
3d-dna assembly with different ways to modify coverage
1
module load miniconda3/4.3.30-qdauveb;source activate 3d-dna;module load jdk;module load parallel;cd /02_WhiteMysteryPseudomolecule/03_3D-DNA/3d-dna;bash run-asm-pipeline.sh --editor-saturation-centile 5 /02_WhiteMysteryPseudomolecule/01_JuicerSetup/references/Big.MysteryGenomePilon3.PILON.fasta /02_WhiteMysteryPseudomolecule/01_JuicerSetup/aligned/merged_nodups.txt
3d-dna assembly for an assembly considered highly heterozygous
1
module load miniconda3/4.3.30-qdauveb;source activate 3d-dna;module load jdk;module load parallel;cd /02_WhiteMysteryPseudomolecule/03_3D-DNA/3d-dna;bash run-asm-pipeline.sh -m diploid /02_WhiteMysteryPseudomolecule/01_JuicerSetup/references/Big.MysteryGenomePilon3.PILON.fasta /02_WhiteMysteryPseudomolecule/01_JuicerSetup/aligned/merged_nodups.txt
3d-dna assembly with poor quality HiC
This is the option you should try if your HiC coverage is low or suboptimal in some way.
1
2
3
4
5
6
If you see an error that is something like this, "Unknown resolution: BP_100000", then you need to modify the resolution that 3d-dna uses for this step or modify the normalization methods, which are described in the Aiden forum. "One way to see the overcorrection of vanilla coverage is to look at supplemental figure S1b. Places where coverage was high in the original map are "overcompensated" and become depleted. With square root, this effect is dampened, but the best remains KR. At high resolution, we suggest VC as the alternative to KR if KR doesn't work for some reason; at low resolution, square root VC better approximates KR"
modify the normalization for your HiC reads at this line. Change to NONE, VC, or sqrtVC
"norm="KR"
vi /02_WhiteMysteryPseudomolecule/04_3ddnaRepCov20/3d-dna/split/run-asm-splitter.sh
If changing normalization doesn’t work, then the “ADDITIONAL OPTIONS” in need to be modified until there is enough coverage for 3d-dna to operate. These are in run-asm-pipeline.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-q|--mapq mapq Mapq threshold for scaffolding and visualization (default is 1).
**misjoin detector**
--editor-coarse-resolution editor_coarse_resolution
Misjoin editor coarse matrix resolution, should be one of the following: 2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 1000 (default is 25000).
--editor-coarse-region editor_coarse_region
Misjoin editor triangular motif region size (default is 125000).
--editor-coarse-stringency editor_coarse_stringency
Misjoin editor stringency parameter (default is 55).
--editor-saturation-centile editor_saturation_centile
Misjoin editor saturation parameter (default is 5).
--editor-fine-resolution editor_fine_resiolution
Misjoin editor fine matrix resolution, should be one of the following: 2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 1000 (default is 1000).
--editor-repeat-coverage editor_repeat_coverage
Misjoin editor threshold repeat coverage (default is 2).
Juicebox
File requirements
For juicebox, you’ll need two files from 3d-dna, your final .hic file and your final .assembly file.
1
2
Big.MysteryGenomePilon3.PILON.final.hic
Big.MysteryGenomePilon3.PILON.final.assembly
Download juicebox for your system (1.11.08 currently)
- https://github.com/aidenlab/Juicebox/wiki/Download
While I did want to include a video of edits, the aiden lab does a pretty good job of documenting how to use juicebox, once it is loaded.
Check out these videos to discover how to read the HiC maps, spot inversions, and get a general feel for juicebox.
- https://www.youtube.com/watch?v=Nj7RhQZHM18
- https://www.youtube.com/watch?v=IMmVp8FodmY
Juicebox can be a resource hog and crashes occasionally, so be sure to frequently save your modifications by exporting the assembly. Also note, that if juicebox crashes and you need to reload your .hic and .assembly files, be sure to always load .hic, then load the original .assembly file, then load your modified .assembly file.
Before editing with Juicebox
After editing with Juicebox
3d-dna finalize
Since the assembly files will be overwritten if we run files will be overwritten in the 3d-dna assembly, I copied the old 3ddna folder and removed all data that was not script related for the final assembly process of 3d-dna.
1
2
3
4
/03_White_Mystery/06_3ddna_finalize/3d-dna
# I use the additional options of -i 500 to force 3d-dna to try to incorporate smaller contigs.
module load miniconda3/4.3.30-qdauveb;source activate 3d-dna;module load jdk;module load parallel;bash run-asm-pipeline-post-review.sh --sort-output -s seal -i 500 -r Big.MysteryGenomePilon3.PILON.donefinal.review.assembly.assembly /02_WhiteMysteryPseudomolecule/01_JuicerSetup/references/Big.MysteryGenomePilon3.PILON.fasta /02_WhiteMysteryPseudomolecule/01_JuicerSetup/aligned/merged_nodups.txt
Assembly results
Before juicer, juicebox, and 3d-dna
1
2
3
4
5
6
7
8
9
10
11
12
13
Number of scaffolds 23058
Total size of scaffolds 1204812274
Longest scaffold 4090936
Shortest scaffold 92
Number of scaffolds > 1K nt 18200 78.9%
Number of scaffolds > 10K nt 7623 33.1%
Number of scaffolds > 100K nt 2445 10.6%
Number of scaffolds > 1M nt 166 0.7%
Number of scaffolds > 10M nt 0 0.0%
Mean scaffold size 52251
Median scaffold size 3118
N50 scaffold length 444717
L50 scaffold count 721
After juicer, juicebox, and 3d-dna
1
2
3
4
5
6
7
8
9
10
11
12
13
Number of scaffolds 19758
Total size of scaffolds 1207942174
Longest scaffold 87406776
Shortest scaffold 92
Number of scaffolds > 1K nt 14801 74.9%
Number of scaffolds > 10K nt 2796 14.2%
Number of scaffolds > 100K nt 26 0.1%
Number of scaffolds > 1M nt 18 0.1%
Number of scaffolds > 10M nt 18 0.1%
Mean scaffold size 61137
Median scaffold size 2160
N50 scaffold length 60392520
L50 scaffold count 9