Arabidopsis thaliana dataset
Background
Arabidopsis thaliana which has a genome size of approximately 135 Mb.
Here is a link to the article written about these data and experiment: Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. PacBio reads were taken from Pacific Biosciences Model Organism Genome Sequencing-Arabidopsis thaliana P4C2.
Files
Run | Instrument | Layout | Insert (bp) | ReadLength | TotalReads | Bases (Mbp) |
---|---|---|---|---|---|---|
SRR3157034 | HiSeq 2000 | paired-end | 0 | 100x2 | 93,446,768 | 17,823 |
SRR3166543 | HiSeq 2000 | paired-end | 0 | 100x2 | 162,362,560 | 30,968 |
SRR3156163 | HiSeq 2000 | mate-pair | 8,000 | 100x2 | 51,332,776 | 9,790 |
SRR3156596 | HiSeq 2000 | mate-pair | 20,000 | 100x2 | 61,030,552 | 11,640 |
SRR1284771 | PacBio RSII | Single | NA | unknown | 163,482 | 2,3 |
How to download the data from SRA
Downloading from SRA will be performed using the sra-toolkit.
-
create a file with the SRA ids and name it
srr.ids
1 2 3 4
SRR3156163 SRR3156596 SRR3157034 SRR3166543
-
Assuming the sra-toolkit is installed then load the module and run the following bash script on the command line.
1 2 3 4
module load sra-toolkit while read line; do fastq-dump --split-files --origfmt ${line}; done<srr.ids
-
The pacbio data has to be downloaded separately
1
fastq-dump --table SEQUENCE --origfmt SRR3156160
Note: sra-toolkit will create a folder named
ncbi
in your home directory/home/userid/ncbi
If you have a disk storage limit on your home directory (most supercomputers do), you will want to move that folder to a different location and then create a softlink in your home folder.Error example: 2019-04-16T19:43:49 fastq-dump.2.8.1 err: unknown while writing file within file system module - unknown system error errno=Disk quota exceeded(122)
Assembly statistics for Arabidopsis
Statistics | value |
---|---|
Total sequence length | 118,890,721 |
Total ungapped length | 117,113,196 |
Gaps between scaffolds | 0 |
Number of scaffolds | 30 |
Scaffold N50 | 22,588,203 |
Scaffold L50 | 3 |
Number of contigs | 525 |
Contig N50 | 1,193,183 |
Contig L50 | 27 |
Total number of chromosomes and plasmids | 7 |
Number of component sequences (WGS or clone) | 30 |