Arabidopsis thaliana dataset

Andrew Severin

Andrew Severin

His PhD was in Biophysics/NMR spectroscopy. He did a Bioinformatics Postdoc in Soybean genetics and now runs the Genome Informatics Facility at Iowa State University. He is passionate about evolution and the science behind the genome. There is so much we don't know about how the elements in a genome interact to create the fine balance of gene expression, modification and 3D structure that create the dynamic range of phenotypes we observe. As sequencing technology continues to improve and the cost continues to decrease, we will be able to ask more complex questions that increase our understanding via comparative and translational genomics.

Arabidopsis thaliana dataset

Background

Arabidopsis thaliana which has a genome size of approximately 135 Mb.

Here is a link to the article written about these data and experiment: Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. PacBio reads were taken from Pacific Biosciences Model Organism Genome Sequencing-Arabidopsis thaliana P4C2.

Files

Run	Instrument	Layout	Insert (bp)	ReadLength	TotalReads	Bases (Mbp)
SRR3157034	HiSeq 2000	paired-end	0	100x2	93,446,768	17,823
SRR3166543	HiSeq 2000	paired-end	0	100x2	162,362,560	30,968
SRR3156163	HiSeq 2000	mate-pair	8,000	100x2	51,332,776	9,790
SRR3156596	HiSeq 2000	mate-pair	20,000	100x2	61,030,552	11,640
SRR1284771	PacBio RSII	Single	NA	unknown	163,482	2,3

How to download the data from SRA

Downloading from SRA will be performed using the sra-toolkit.

create a file with the SRA ids and name it srr.ids

SRR3156163
SRR3156596
SRR3157034
SRR3166543

Assuming the sra-toolkit is installed then load the module and run the following bash script on the command line.

module load sra-toolkit
while read line; do
  fastq-dump --split-files --origfmt ${line};
done<srr.ids

The pacbio data has to be downloaded separately
```
1
fastq-dump --table SEQUENCE --origfmt SRR3156160
```
Note: sra-toolkit will create a folder named ncbi in your home directory /home/userid/ncbi If you have a disk storage limit on your home directory (most supercomputers do), you will want to move that folder to a different location and then create a softlink in your home folder.

Error example: 2019-04-16T19:43:49 fastq-dump.2.8.1 err: unknown while writing file within file system module - unknown system error errno=Disk quota exceeded(122)

Assembly statistics for Arabidopsis

Statistics	value
Total sequence length	118,890,721
Total ungapped length	117,113,196
Gaps between scaffolds	0
Number of scaffolds	30
Scaffold N50	22,588,203
Scaffold L50	3
Number of contigs	525
Contig N50	1,193,183
Contig L50	27
Total number of chromosomes and plasmids	7
Number of component sequences (WGS or clone)	30

Back to the Assembly and Annotation Index page