Arabidopsis thaliana dataset
Background
Arabidopsis thaliana which has a genome size of approximately 135 Mb.
Here is a link to the article written about these data and experiment: Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. PacBio reads were taken from Pacific Biosciences Model Organism Genome Sequencing-Arabidopsis thaliana P4C2.
Files
| Run | Instrument | Layout | Insert (bp) | ReadLength | TotalReads | Bases (Mbp) |
|---|---|---|---|---|---|---|
| SRR3157034 | HiSeq 2000 | paired-end | 0 | 100x2 | 93,446,768 | 17,823 |
| SRR3166543 | HiSeq 2000 | paired-end | 0 | 100x2 | 162,362,560 | 30,968 |
| SRR3156163 | HiSeq 2000 | mate-pair | 8,000 | 100x2 | 51,332,776 | 9,790 |
| SRR3156596 | HiSeq 2000 | mate-pair | 20,000 | 100x2 | 61,030,552 | 11,640 |
| SRR1284771 | PacBio RSII | Single | NA | unknown | 163,482 | 2,3 |
How to download the data from SRA
Downloading from SRA will be performed using the sra-toolkit.
-
create a file with the SRA ids and name it
srr.ids1 2 3 4
SRR3156163 SRR3156596 SRR3157034 SRR3166543
-
Assuming the sra-toolkit is installed then load the module and run the following bash script on the command line.
1 2 3 4
module load sra-toolkit while read line; do fastq-dump --split-files --origfmt ${line}; done<srr.ids
-
The pacbio data has to be downloaded separately
1
fastq-dump --table SEQUENCE --origfmt SRR3156160
Note: sra-toolkit will create a folder named
ncbiin your home directory/home/userid/ncbiIf you have a disk storage limit on your home directory (most supercomputers do), you will want to move that folder to a different location and then create a softlink in your home folder.Error example: 2019-04-16T19:43:49 fastq-dump.2.8.1 err: unknown while writing file within file system module - unknown system error errno=Disk quota exceeded(122)
Assembly statistics for Arabidopsis
| Statistics | value |
|---|---|
| Total sequence length | 118,890,721 |
| Total ungapped length | 117,113,196 |
| Gaps between scaffolds | 0 |
| Number of scaffolds | 30 |
| Scaffold N50 | 22,588,203 |
| Scaffold L50 | 3 |
| Number of contigs | 525 |
| Contig N50 | 1,193,183 |
| Contig L50 | 27 |
| Total number of chromosomes and plasmids | 7 |
| Number of component sequences (WGS or clone) | 30 |