====== Generating counts data ======
Trinity package provides align_and_estimate_abundance.pl program to generate abundance estimates.
This program takes the reads file and the transcriptome to generate counts. So, we have to generate such commands for every sample. (Some samples have SE data and some have PE data). To make it easier, I have separate commands to SE and PE datasets.
Files required:
When there are multiple input files, here is one way to generate the align_and_estimate commands for all of them at once. For better understanding, here is a sample file name WBC-B1-22_ACTTGA_L002_R1_001.fastq.gz
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
## Paired end
ls WBC-B1* |\
grep -v 'L005' |\
paste - - |\
tr '_' ' ' |\
awk '{print $1"_"$2"_"$3"_"$4"_"$5,$6"_"$7"_"$8"_"$9"_"$10,$1}' |\
awk '{print "align_and_estimate_abundance.pl --est_method RSEM --aln_method bowtie --trinity_mode --transcripts transcripts.fa \\";print "--thread_count 16 \\";print "--seqType fq --left "$1,"--right "$2" \\";print "--output_prefix "$3,"--output_dir "$3" &";print "wait";print " "}' \
> WBC-B1-PE-commands.RSEM
ls WBC-B2* |\
grep -v 'L005' |\
paste - - |\
tr '_' ' ' |\
awk '{print $1"_"$2"_"$3"_"$4"_"$5,$6"_"$7"_"$8"_"$9"_"$10,$1}' |\
awk '{print "align_and_estimate_abundance.pl --est_method RSEM --aln_method bowtie --trinity_mode --transcripts transcripts.fa \\";print "--thread_count 16 \\";print "--seqType fq --left "$1,"--right "$2" \\";print "--output_prefix "$3,"--output_dir "$3" &";print "wait";print " "}' \
> WBC-B2-PE-commands.RSEM
## Single end
ls *L005_R1* |\
tr '_' ' ' |\
awk '{print $1"_"$2"_"$3"_"$4"_"$5,$6"_"$7"_"$8"_"$9"_"$10,$1"SE"}' |\
awk '{print "align_and_estimate_abundance.pl --est_method RSEM --aln_method bowtie --trinity_mode --transcripts transcripts.fa \\";print "--thread_count 16 \\";print "--seqType fq --single "$1," \\";print "--output_prefix "$3,"--output_dir "$3" &";print "wait";print " "}' \
> WBC-SE-commands.RSEM
##Bean vs corn
ls WBC-mg* |\
paste - - |\
tr '_' ' ' |\
awk '{print $1"_"$2"_"$3"_"$4"_"$5,$6"_"$7"_"$8"_"$9"_"$10,$1}' |\
awk '{print "align_and_estimate_abundance.pl --est_method RSEM --aln_method bowtie --trinity_mode --transcripts transcripts.fa \\";print "--thread_count 64 \\";print "--seqType fq --left "$1,"--right "$2" \\";print "--output_prefix "$3,"--output_dir "$3" &";print "wait";print " "}' \
> WBC-mgcommands.RSEM
ls WBC-mg* |\
paste - - |\
tr '_' ' ' |\
awk '{print $1"_"$2"_"$3"_"$4"_"$5,$6"_"$7"_"$8"_"$9"_"$10,$1}' |\
awk '{print "align_and_estimate_abundance.pl --est_method RSEM --aln_method bowtie --trinity_mode --transcripts transcripts.fa \\";print "--thread_count 64 \\";print "--seqType fq --left "$1,"--right "$2" \\";print "--output_prefix "$3,"--output_dir "$3" &";print "wait";print " "}' \
> WBC-mgcommands.RSEM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --time=48:00:00
#SBATCH --mem=64G
#SBATCH --mail-user=<username@iastate.edu>
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --job-name=alignestimateWGC
#SBATCH --error=alignestimateWBC-B2.%J.err
#SBATCH --output=alignestimateWBC-B2.%J.out
cd ${SLURM_SUBMIT_DIR}
module load trinityrnaseq/gcc/64/2.1.1
align_and_estimate_abundance.pl \
--est_method RSEM \
--aln_method bowtie \
--trinity_mode \
--transcripts transcripts.fa \
--thread_count 16 \
--seqType fq \
--left WBC-B2-01_AGTTCC_L002_R1_001.fastq.gz \
--right WBC-B2-01_AGTTCC_L002_R2_001.fastq.gz \
--output_prefix WBC-B2-01 \
--output_dir WBC-B2-01 &
wait
align_and_estimate_abundance.pl \
--est_method RSEM \
--aln_method bowtie \
--trinity_mode \
--transcripts transcripts.fa \
--thread_count 16 \
--seqType fq \
--left WBC-B2-02_ATGTCA_L002_R1_001.fastq.gz \
--right WBC-B2-02_ATGTCA_L002_R2_001.fastq.gz \
--output_prefix WBC-B2-02 --output_dir WBC-B2-02 &
wait
align_and_estimate_abundance.pl \
--est_method RSEM \
--aln_method bowtie \
--trinity_mode \
--transcripts transcripts.fa \
--thread_count 16 \
--seqType fq \
--left WBC-B2-03_CCGTCC_L002_R1_001.fastq.gz \
--right WBC-B2-03_CCGTCC_L002_R2_001.fastq.gz \
--output_prefix WBC-B2-03 --output_dir WBC-B2-03 &
wait
Once the align_and_estimate_abundance jobs are done, each folder has *.isoforms.results
file. Concatenate these files to generate counts matrix.
1
2
3
4
5
module load trinityrnaseq/gcc/64/2.1.1
/software/apps/trinityrnaseq/gcc/64/2.1.1/trinityrnaseq-2.1.1/util/abundance_estimates_to_matrix.pl \
--est_method RSEM \
--out_prefix all */*isoform*results
Look at the ERROR and OUTPUT files from RSEM to see whether the reads aligned well to the transcriptome. Pay attention to what percentage of reads in each sample aligned. For samples that have very low mapping rate, exclude them from downstream analyses.