6. Analyzing RNAseq expression with Salmon

We will use salmon to quantify expression. Salmon is a new breed of software for quantifying RNAseq reads that is both really fast and takes transcript length into consideration (Patro et al. 2015).

Additional references:

https://github.com/ngs-docs/2015-nov-adv-rna/blob/master/salmon.rst

http://angus.readthedocs.io/en/2016/rob_quant/tut.html

https://2016-aug-nonmodel-rnaseq.readthedocs.io/en/latest/quantification.html


Be sure you have loaded the right Python packages

source ~/pondenv/bin/activate

Run Salmon

First, build an index for your new transcriptome:

cd ${PROJECT}
mkdir -p quant
cd quant
ln -s ${PROJECT}/assembly/trinity_out_dir/Trinity.fasta .
salmon index --index nema --transcripts Trinity.fasta --type quasi

And also link in the QC reads (produced in 1. Quality Trimming and Filtering Your Sequences):

ln -s ../quality/*R1*.qc.fq.gz .
ln -s ../quality/*R2*.qc.fq.gz .

Then, run the salmon command:

for R1 in *R1*.qc.fq.gz
do
  sample=$(basename $R1 extract.qc.fq.gz)
  echo sample is $sample, R1 is $R1
  R2=${R1/R1/R2}
  echo R2 is $R2
  salmon quant -i nema -p 2 -l IU -1 <(gunzip -c $R1) -2 <(gunzip -c $R2) -o ${sample}quant
done

This will create a bunch of directories named something like 0Hour_ATCACG_L002001.quant, containing a bunch of files. Take a look at what files there are:

find 0Hour_ATCACG_L002_R1_001* -type f

You should see:

0Hour_ATCACG_L002_R1_001.extract.quant/quant.sf
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/observed_bias.gz
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/observed_bias_3p.gz
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/fld.gz
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/expected_bias.gz
0Hour_ATCACG_L002_R1_001.extract.quant/aux_info/meta_info.json
0Hour_ATCACG_L002_R1_001.extract.quant/cmd_info.json
0Hour_ATCACG_L002_R1_001.extract.quant/logs/salmon_quant.log
0Hour_ATCACG_L002_R1_001.extract.quant/libParams/flenDist.txt
0Hour_ATCACG_L002_R1_001.extract.quant/lib_format_counts.json
0Hour_ATCACG_L002_R1_001.extract.quant.counts

The two most interesting files are salmon_quant.log and quant.sf. The latter contains the counts; the former contains the log information from running things.

Working with the counts

The quant.sf files actually contain the relevant information about expression – take a look:

head -20 0Hour_ATCACG_L002_R1_001.extract.quant/quant.sf

The first column contains the transcript names, and the fifth column is what edgeR etc will want - the “raw counts”. However, they’re not in a convenient location / format for edgeR to use; let’s fix that.

Now, grab the script...

curl -L -O https://raw.githubusercontent.com/dib-lab/eel-pond/master/gather-counts.py

and run it:

python ./gather-counts.py

This will give you a bunch of .counts files, processed from the quant.sf files and named for the directory they are in.


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.