Installation and configuration: Amazon Web Services¶
Boot up an m4.xlarge machine from Amazon Web Services running Ubuntu 15.10 LTS (e.g. us-west AMI ami-05384865 or us-east ami-002f0f6a); increase the root volume size to ~100 GB. The m4.xlarge machines have 16 GB of RAM, and 4 CPUs, and will be enough to complete the assembly of the Nematostella data set. If you are using your own data, be aware of your space requirements and obtain an appropriately sized machine (“instance”) and storage (“volume”).
Install software¶
On the new machine, run the following commands to update the base software:
sudo apt-get update && \
sudo apt-get -y install screen git curl gcc make g++ python-dev unzip \
default-jre pkg-config libncurses5-dev r-base-core r-cran-gplots \
python-matplotlib python-pip python-virtualenv sysstat fastqc \
trimmomatic bowtie samtools blast2 wget bowtie2 openjdk-8-jre \
hmmer ruby
Install khmer from its source code.
cd ~/
python2.7 -m virtualenv pondenv
source pondenv/bin/activate
cd pondenv
pip install -U setuptools
git clone --branch v2.0 https://github.com/dib-lab/khmer.git
cd khmer
make install
The use of virtualenv
allows us to install Python software without having
root access. If you come back to this protocol in a different terminal session
you will need to run:
source ~/pondenv/bin/activate
Installing Trinity¶
To install Trinity:
cd ${HOME}
wget https://github.com/trinityrnaseq/trinityrnaseq/archive/Trinity-v2.3.2.tar.gz \
-O trinity.tar.gz
tar xzf trinity.tar.gz
cd trinityrnaseq*/
make |& tee trinity-build.log
Assuming it succeeds, modify the path appropriately in your virtualenv activation setup:
echo export PATH=$PATH:$(pwd) >> ~/pondenv/bin/activate
source ~/pondenv/bin/activate
You will also need to set the default Java version to 1.8
sudo update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
Install transrate¶
We use transrate to evaluate assemblies. Install!
cd
curl -LO https://bintray.com/artifact/download/blahah/generic/transrate-1.0.3-linux-x86_64.tar.gz
tar -zxf transrate-1.0.3-linux-x86_64.tar.gz
echo 'export PATH=$PATH:"$HOME/transrate-1.0.3-linux-x86_64"' >> ~/pondenv/bin/activate
curl -LO ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.3.0/ncbi-blast-2.3.0+-x64-linux.tar.gz
tar -zxf ncbi-blast-2.3.0+-x64-linux.tar.gz
echo 'export PATH="$HOME/ncbi-blast-2.3.0+/bin:$PATH"' >> ~/pondenv/bin/activate
source ~/pondenv/bin/activate
Install busco¶
Install stuff:
cd
git clone https://gitlab.com/ezlab/busco.git
cd busco
echo "export PATH=$PATH:$(pwd)" >> ~/pondenv/bin/activate
curl -OL http://busco.ezlab.org/datasets/metazoa_odb9.tar.gz
curl -OL http://busco.ezlab.org/datasets/eukaryota_odb9.tar.gz
tar -xzvf metazoa_odb9.tar.gz
tar -xzvf eukaryota_odb9.tar.gz
source ~/pondenv/bin/activate
Install salmon¶
We will use Salmon to quantify expression of transcripts. Salmon is a new breed of software for quantifying RNAseq reads that is both really fast and takes transcript length into consideration (Patro et al. 2015).
cd
curl -LO https://github.com/COMBINE-lab/salmon/releases/download/v0.7.2/Salmon-0.7.2_linux_x86_64.tar.gz
tar -xvzf Salmon-0.7.2_linux_x86_64.tar.gz
cd Salmon*/bin
echo export PATH=$PATH:$(pwd) >> ~/pondenv/bin/activate
source ~/pondenv/bin/activate
Load your data onto /mnt/data¶
Load your data into /mnt/work/data
. You may need to make the
/mnt/
directory writeable by doing
sudo chmod a+rwxt /mnt
first, and then creating the subdirectories
cd /mnt
mkdir -p work work/data
cd /mnt/work/data
Define your $PROJECT variable to be the location of your work
directory; in this case, it will be /mnt/work
:
export PROJECT=/mnt/work
Now load your data in!
Note
If you want to try things out with a small test data set, you can use a subset of the Nematostella data from Tulin et al. (2013):
cd /mnt/work
curl -O https://s3.amazonaws.com/public.ged.msu.edu/mrnaseq-subset.tar
cd data
tar xvf ../mrnaseq-subset.tar
Check that your data is where it should be¶
Check:
ls $PROJECT/data
If you see all the files you think you should, good! Otherwise, debug.
If you’re using the Tulin et al. data provided in the snapshot above, you should see a bunch of files like:
0Hour_ATCACG_L002_R1_001.fastq.gz
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.