.. _week_Seven: *************************** Week Seven 28-Feb-12 *************************** ======= Tuesday ======= .. note:: Learning Goals: * Log in to lonestar.tacc.teragrid.org * recreate working directories * submit a job through qsub Lecture --------- **Texas Advanced Computing Center: TACC** **lonestar.tacc.teragrid.org** Log in to the TACC lonestar cluster lonestar.tacc.teragrid.org:: You should have received login details from XSEDE for your new account. jjv5$ ssh tg801771@lonestar.tacc.teragrid.org Make sure we are using the bash shell:: login1$ echo $SHELL /bin/bash # If needed we can change the defualt shell to bash: login1$ chsh -l /bin/sh /bin/bash /sbin/nologin /bin/tcsh /bin/csh /bin/ksh /bin/zsh **Recreate directory structure** Make new directories for quiz,homework,projects:: login2$ mkdir quiz homework projects login2$ ls homework projects quiz Create any other directories as needed **Transfer files from AWS EC2 server to lonestar** Open a second terminal window:: # Log in to the EC2 server $ ssh ec2-23-20-18-242.compute-1.amazonaws.com jjv5@ec2-23-20-18-242.compute-1.amazonaws.com's password: $ cd lectures/ $ ls week5 $ cd week5/ $ ls Thurs Tues $ cd Thurs/ $ ls # use sftp to connect to lonestar $ sftp tg801771@lonestar.tacc.teragrid.org Connecting to lonestar.tacc.teragrid.org... The authenticity of host 'lonestar.tacc.teragrid.org (129.114.53.21)' can't be established. RSA key fingerprint is 5c:36:42:99:aa:2d:52:58:70:3a:20:c2:3a:33:e4:2f. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'lonestar.tacc.teragrid.org,129.114.53.21' (RSA) to the list of known hosts. Password: # transfer files as needed sftp> cd lectures sftp> cd week5 sftp> cd Thurs sftp> lls 4 example2.py example4.py myNumbers.txt runBlast.sh week5.fa.blast.asn example1.py example3.py example5.py parseBlast.py week5.fa sftp> put runBlast.sh Uploading runBlast.sh to /home1/00921/tg801771/lectures/week5/Thurs/runBlast.sh runBlast.sh 100% 815 0.8KB/s 00:00 sftp> **Create a job script** Create the script runHello.sh shown below:: #!/bin/bash #$ -pe 1way 12 # 12 cores per node - must take them all #$ -q development # Queue name #$ -N helloWorld #$ -A TG-MCB120034 #$ -V # inherit submission env #$ -j y # combine stderr & stdout into stdout #$ -o $JOB_NAME.o$JOB_ID # Name of the output file (eg. myMPI.oJobID) #$ -l h_rt=00:05:00 # Run time (hh:mm:ss) #$ -M jjv5.jjv5@gmail.com #$ -m bea echo "Hello, I am running" date hostname **Submit the job to the development queue** The queue is specifiec in the job script itself:: qsub runHello.sh Monitor the job with th qstat command:: login2$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 479531 0.00000 helloWorld tg801771 qw 02/28/2012 05:10:32 12 Homework --------- **Reading** - http://www.tacc.utexas.edu/user-services/user-guides/lonestar-user-guide - read the first several sections through File Systems - pay attention to the File Systems section description of $WORK,$HOME,$SCRATCH - the rest of the material is just an overview and will not be on a quiz **Exercises** **Turn In** - transfer the bash shell scripts, python programs and 16S blast database from the AWS EC2 server to lonestar create a shell script to run on lonestar using qsub that does the following:: cd to your home directory list all files print the date - be sure to use the proper qsub options and resource specifications in your script - use the development queue to make sure the job runs properly - when you are sure it runs correctly, change the queue name to 'normal' - use qstat to monitor how long it takes before your job runs in the 'normal' queue ======= Thursday ======= .. note:: Learning Goals: * Run BLAST from shell script * reformat and parse BLAST output * look up taxon ID from blast hit GI Video ------ .. raw:: html Lecture --------- **Run BLAST vs 16S database**:: #!/bin/bash # James Vincent # March 1, 2012 # # # Run blast on week5.fa vs 16SMicrobial database # Reformat output to include Query seq-id, subject seq-id, score and e-value # # Database DB=/home/jjv5/blast/databases/16SMicrobial # Query QUERY=week5.fa OUTFILE=$QUERY.blast.asn # BLAST output format: 11 is ASN, 6 is table no header OUTFMT=11 # BLAST programs BLASTN=/mnt/blast/ncbi-blast-2.2.25+/bin/blastn BLAST_FORMATTER=/mnt/blast/ncbi-blast-2.2.25+/bin/blast_formatter BLASTDBCMD=/mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd # Run blast #$BLASTN -db $DB -query $QUERY -outfmt $OUTFMT -out $OUTFILE # Reformat ASN to hit custom hit table #$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid evalue bitscore" -out $OUTFILE.table #$BLAST_FORMATTER -archive $OUTFILE -outfmt "6 sgi" -out $OUTFILE.table.sgi # default format values: 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' # for i in `cat week5.fa.blast.asn.table.sgi` do $BLASTDBCMD -db $DB -entry $i -outfmt "%g %T " done **Parse BLAST output with python program**:: #!/usr/bin/env python """ James Vincent Mar 1 , 2012 parseBlast.py Open a text file loop over lines split lines into fields Sum numbers from certain field """ import sys # Get file name myInfileName = sys.argv[1] infile = open(myInfileName) mySum = 0.0 myCount = 0 # loop over each line in the file for thisLine in infile.readlines(): # BLAST input file has hit lines like this: # 1 gi|219856848|ref|NR_024667.1| 0.0 2551 myFields = thisLine.strip().split() thisScore = int(myFields[3]) # Accumulate scores greater than 3 if thisScore > 2600: # accumulate scores mySum = mySum + thisScore # count number of scores matching myCount = myCount + 1 # Print sum, count and average print "Sum is: ",mySum print "Count is: ",myCount print "Average is: ",mySum/myCount **Use blastdbcmd to look up taxonid**:: #!/bin/bash # James Vincent # March 1, 2012 # # # Get taxonimic ID of BLAST hit from GI number # # Database DB=/home/jjv5/blast/databases/16SMicrobial # Query QUERY=week5.fa OUTFILE=$QUERY.blast.asn.table # BLAST programs BLASTDBCMD=/mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd # Extract taxon ID $BLASTDBCMD -db $DB -entry 310975189 -outfmt "%g %T " **Look up taxon ids from all blast hits**:: #!/bin/bash # James Vincent # March 1, 2012 # # Loop over all lines in BLAST output with hit GI # Get taxonimic ID of BLAST hit from GI number # # Database DB=/home/jjv5/blast/databases/16SMicrobial # Query QUERY=week5.fa OUTFILE=$QUERY.blast.asn.table # BLAST programs BLASTDBCMD=/mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd # Extract taxon ID for i in `cat $OUTFILE.sgi` do $BLASTDBCMD -db $DB -entry $i -outfmt "%g %T " done **Look up taxon id from python program**:: Homework --------- **Reading** - http://docs.python.org/library/subprocess.html - http://www.ncbi.nlm.nih.gov/books/NBK1763/ Section 4.6.10 blastdbcmd - http://bash.cyberciti.biz/guide/For_loop **Exercises** **Turn In** - Complete the exercises in http://bash.cyberciti.biz/guide/For_loop - Put each example in a separate shell script and leav them in your homework directory - Write a python program that parses BLAST output in table format:: Use the input file week5.fa from class to run BLAST against 16S database Reformat the BLAST output into table format that includes the bitscore In your python program select only BLAST output hits that have a bitscore greater than 2000