5. Annotating your transcriptome


Dammit is an annotation pipeline written with pydoit by Camille Scott. Dammit translates an unannotated transcriptome with Transdecoder then uses the following protein databases as evidence for annotation: Pfam-A, Rfam, OrthoDB, uniref90 (uniref is optional with --full).

If a protein dataset is available, this can also be supplied to the dammit pipeline with --user-databases as optional evidence for annotation.

In addition, BUSCO v2 is run (just like in 4-evaluating-assembly.rst – NOTE maybe this is not necessary to run in step 4?), which will compare the gene content in your transcriptome with a lineage-specific data set. The output is a proportion of your transcriptome that matches with the data set, which can be used as an estimate of the completeness of your transcriptome based on evolutionary expectation (Simho et al. 2015). There are several lineage-specific datasets available from the authors of BUSCO. We will use the metazoa dataset for this transcriptome.

Install stuff

# ami-002f0f6a
# Ubuntu 15.10, m4.xlarge with 100GB storage
# In addition to install from 0-install-aws.rst, do these:

sudo apt-get -y install python3-dev python-numpy ruby hmmer unzip \
    infernal ncbi-blast+ liburi-escape-xs-perl emboss liburi-perl \
    build-essential libsm6 libxrender1 libfontconfig1 \
    parallel libx11-dev
sudo gem install crb-blast

Install miniconda

curl -LO https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
source ~/.bashrc

Create a python 3 environment for dammit

conda create -y -n dammit anaconda python=3.5
source activate dammit

Install shmlast

conda install -y --file <(curl https://raw.githubusercontent.com/camillescott/shmlast/master/environment.txt)
pip install --upgrade pip
pip install shmlast

Install last

curl -LO http://last.cbrc.jp/last-658.zip
unzip last-658.zip
pushd last-658 && make && make install prefix=~ && popd

Install the proper version of GNU parallel:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
sudo cp /home/ubuntu/bin/parallel /usr/bin/parallel


curl -LO https://github.com/TransDecoder/TransDecoder/archive/2.0.1.tar.gz
tar -xvzf 2.0.1.tar.gz
cd TransDecoder-2.0.1; make


git clone https://gitlab.com/ezlab/busco.git

Put everything in the path:

echo export PATH=$HOME/last-658/src:$PATH >> /home/ubuntu/miniconda3/bin/activate
echo export PATH=$HOME/last-658/scripts:$PATH >> /home/ubuntu/miniconda3/bin/activate
echo export PATH=$HOME/busco:$PATH >> /home/ubuntu/miniconda3/bin/activate
echo export PATH=$HOME/TransDecoder-2.0.1:$PATH >> /home/ubuntu/miniconda3/bin/activate

Install the proper version of matplotlib

pip install https://pypi.python.org/packages/source/m/matplotlib/matplotlib-1.5.1.tar.gz

Finally, install dammit from the refactor/1.0 branch

pip install https://github.com/camillescott/dammit/archive/refactor/1.0.zip

Install databases (this step alone takes ~15-20 min) # Is there a faster install? # Don’t need everything?

dammit databases --install

By default, the metazoan busco group will be installed. For the eukaryota database, use this:

dammit databases --install --busco-group eukaryota

Make a directory for annotation and put files there

mkdir -p annotation
cd annotation
ln -s ${PROJECT}/assembly/trinity_out_dir/Trinity.fasta .

Download a custom Nematostella vectensis protein database available from JGI:

curl -LO ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/Eukaryota/UP000001593_45351.fasta.gz
#curl -LO ftp://ftp.jgi-psf.org/pub/JGI_data/Nematostella_vectensis/v1.0/annotation/proteins.Nemve1FilteredModels1.fasta.gz
gunzip proteins.Nemve1FilteredModels1.fasta.gz

Putnam NH, Srivastava M, Hellsten U, Dirks B, Chapman J, Salamov A, Terry A, Shapiro H, Lindquist E, Kapitonov VV, Jurka J, Genikhovich G, Grigoriev IV, Lucas SM, Steele RE, Finnerty JR, Technau U, Martindale MQ, Rokhsar DS. (2007) Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science. 317, 86-94.

Run the dammit pipeline

# after trinity

source activate dammit

Run the command:

dammit annotate Trinity.fasta --busco-group metazoa --user-databases proteins.Nemve1FilteredModels1.fasta --n_threads 2 | tee dammit_Trinity_outfile.log

If dammit runs successfully, there will be a directory Trinity.fasta.dammit with ~dozen files inside, including Trinity.fasta.dammit.gff3, Trinity.fasta.dammit.fasta and a data frame matching new annotated contig id with the previous assembler-generated contig id: Trinity.fasta.dammit.namemap.csv. If the above dammit command is run again, there will be a message: **Pipeline is already completed!**

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.