.. _week_Eleven:


***************************
Week Eleven       27-Mar-12
***************************

=======
Tuesday
=======

.. note:: Learning Goals:

   * Create complete python program to parse BLAST with taxonomy
   * Reformat BLAST output to include percent identity
   * Modify python program to filter hits based on % identity
   * Output top hits based on different hit criteria
|
|
|
Video
-----
.. raw:: html

 <iframe width=675 height=400  src="http://www.uvm.edu/~epscor/jwplayer.php?video=video/3_27_2012_JSC_bio2710.mp4" frameborder="0" allowfullscreen></iframe>

|
|


Lecture
---------

**Full python program to count hits and fetch taxonomy**::
          
   
   #!/usr/bin/env python

   """

   James Vincent
   Mar 27 , 2012

   parseBlast.py
   Version 2.0

   Open a BLAST text file
   loop over lines
   keep count of hits per GI
   split lines into fields
   Extract GI number from each line
   Convert GI to taxon ID number
   convert taxon ID to genus and species text strings

   Output: top five hits by count of GI with genus and species for each
   """

   import sys
   from collections import Counter
   from Bio import Entrez


   topHits = 10                  # Number of top hits to output
   Entrez.email = "jim@uvm.edu"  # Always tell NCBI who you are

   def getTaxonID(giNumber):
      """ return taxon id from entrez given input gi number """

      myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
      myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]
      
      return myTaxonID


   def getTaxonGenusSpecies(taxonID):
      """ return genus as string from Entrez given input taxon ID number """

      handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
      myGenus = handle[0]["Genus"]
      mySpecies = handle[0]["Species"]

      return myGenus,mySpecies


   def getScore(blastLine):
      """ parse blast output line and return score """

      # BLAST input file has hit lines like this:
      # fmt "6 qseqid sseqid evalue bitscore"
      # 1        gi|219856848|ref|NR_024667.1|   0.0     2551
      myFields = blastLine.strip().split()
    
      thisScore  = float(myFields[3])
    
      return thisScore


   def getGI(blastLine):
      """ parse blast output line and return GI of hit """

      # blast line looks like this:
      # 1        gi|219856848|ref|NR_024667.1|   0.0     2551

      # spit line into fields by white space
      myFields = blastLine.strip().split()

      # split subject (hit) ID field on |
      myIDs = myFields[1].split("|")
      thisGI= myIDs[1]

      return thisGI


   # Get input file name
   myInfileName = sys.argv[1]
   infile = open(myInfileName)

   # Counter for hit GIs
   myGICounter = Counter()

   for thisLine in infile.readlines():
      # Extract score 
      thisScore = getScore(thisLine)
      # Extract GI
      thisGI  = getGI(thisLine)
      # Count this GI hit
      myGICounter[thisGI] += 1

      
   # Print top five hits
   # print myGICounter.most_common(5)
   for thisItem in myGICounter.most_common(topHits):
      thisGI = thisItem[0]
      thisCount = thisItem[1]

      # Get taxon ID
      taxonID = getTaxonID(thisGI)
      # Get genus, species 
      genus,species =  getTaxonGenusSpecies(taxonID)
      print thisGI,thisCount,taxonID,genus,species

|
|
|


**Reformat BLAST output to include percent identity**::

    $BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid qlen length pident evalue bitscore " -out $OUTFILE.table.pident

|
|
**Add functions to parse percent idenity and hit length**::
   
   #!/usr/bin/env python

   """

   James Vincent
   Mar 27 , 2012

   parseBlast.py
   Version 3.0

   Open a BLAST text file
   loop over lines
   keep count of hits per GI
   split lines into fields
   Extract GI number from each line
   Extract hit length and % idenity
   Only count hits that meet criteria for length and % identity

   Convert GI to taxon ID number
   convert taxon ID to genus and species text strings

   Output: top five hits by count of GI with genus and species for each
   """

   import sys
   from collections import Counter
   from Bio import Entrez


   topHits = 10                  # Number of top hits to output
   minHitIdentity = 100.0         # Min percent idenity of hit to keep

   Entrez.email = "jim@uvm.edu"  # Always tell NCBI who you are

   def getTaxonID(giNumber):
      """ return taxon id from entrez given input gi number """

      myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
      myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]
      
      return myTaxonID


   def getTaxonGenusSpecies(taxonID):
      """ return genus as string from Entrez given input taxon ID number """

      handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
      myGenus = handle[0]["Genus"]
      mySpecies = handle[0]["Species"]

      return myGenus,mySpecies

   #  -- BLAST parsing functions assume
   #     blast table format lines look like this
   # qseqid sseqid qlen length pident evalue bitscore 
   #1	gi|343198516|ref|NR_042772.1|	100	100	100.00	1e-47	 185
   #1	gi|219878377|ref|NR_025516.1|	100	90	96.67	4e-37	 150

   def getQueryLength(blastLine):
      """ return length of query from blast line """

      myFields = blastLine.strip().split()
    
      thisQueryLength  = int(myFields[2])
    
      return thisQueryLength


   def getHitLength(blastLine):
      """ return length of hit from blast line """

      myFields = blastLine.strip().split()
    
      thisHitLength  = int(myFields[3])
    
      return thisHitLength


   def getIdent(blastLine):
      """ parse blast output line and return percent identity of alignment """

      myFields = blastLine.strip().split()
    
      thisIdent  = float(myFields[4])
    
      return thisIdent


   def getGI(blastLine):
      """ parse blast output line and return GI of hit """

      myFields = blastLine.strip().split()

      # split subject (hit) ID field on |
      myIDs = myFields[1].split("|")
      thisGI= myIDs[1]

      return thisGI


   # Get input file name
   myInfileName = sys.argv[1]
   infile = open(myInfileName)

   # Counter for hit GIs

   myGICounter = Counter()

   for thisLine in infile.readlines():

      # Extract GI
      thisGI  = getGI(thisLine)

      # Extract hit  and query length
      thisHitLength  = getHitLength(thisLine)
      thisQueryLength  = getQueryLength(thisLine)

      # Extract hit percent identity
      thisIdent  = getIdent(thisLine)

      # Count this GI hit based on conditions
      if (thisHitLength == thisQueryLength)  and (thisIdent >= minHitIdentity):
	 myGICounter[thisGI] += 1


   # Print top five hits
   # print myGICounter.most_common(5)
   for thisItem in myGICounter.most_common(topHits):
      thisGI = thisItem[0]
      thisCount = thisItem[1]
	      
      # Get taxon ID
      taxonID = getTaxonID(thisGI)
      # Get genus, species 
      genus,species =  getTaxonGenusSpecies(taxonID)
      print thisGI,thisCount,taxonID,genus,species

|
|
|


Homework
--------

**Reading**

  Review programs in lecture materials for this week and last


**Exercises**

  Make sure you can write the programs shown in lecture


**Turn In**

Write a python program that parses BLAST table formatted output:
  
  * Use week9.fa BLAST results
  * Format BLAST results in table format and include percent identity
  * Count number of hits at 100% identity and same length as query for each query ID (not GI)
  * Write output that contains two columns: QUERY ID and number of hits::

       # Number of hits at 100% identity, full length of query
       # QUERY ID  Number Hits
       1	23
       2	21
       3	20
       4	14
       5	12


=======
Thursday
=======

|
|
|

.. note:: Learning Goals:

   * Create complete python program to parse BLAST table output and explore hit counts
   * Output hit table with per query hits and %
   * Output top hits based on different hit criteria

|
|
Video
-----
.. raw:: html

 <iframe width=675 height=400  src="http://www.uvm.edu/~epscor/jwplayer.php?video=video/3_29_2012_JSC_bio2710.mp4" frameborder="0" allowfullscreen></iframe>

|
|


Lecture
---------

**Full python program to count hits at 100% length and 100% identity per query**::

   
   #!/usr/bin/env python

   """

   James Vincent
   Mar 29 , 2012

   parseBlastQueryHits.py
   Version 1.0

   Open a BLAST text file
   loop over lines
   split lines into fields
   keep count of hits per query that are 100% length of query
   and 100% identity
   Output two column table with query ID, number of hits

   """

   import sys
   from collections import Counter
   from Bio import Entrez


   topHits = 10                  # Number of top hits to output
   minHitIdentity = 100.0         # Min percent idenity of hit to keep

   Entrez.email = "jim@uvm.edu"  # Always tell NCBI who you are

   def getTaxonID(giNumber):
      """ return taxon id from entrez given input gi number """

      myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber))
      myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"]
      
      return myTaxonID


   def getTaxonGenusSpecies(taxonID):
      """ return genus as string from Entrez given input taxon ID number """

      handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID))
      myGenus = handle[0]["Genus"]
      mySpecies = handle[0]["Species"]

      return myGenus,mySpecies

   #  -- BLAST parsing functions assume
   #     blast table format lines look like this
   # qseqid sseqid qlen length pident evalue bitscore 
   #1	gi|343198516|ref|NR_042772.1|	100	100	100.00	1e-47	 185
   #1	gi|219878377|ref|NR_025516.1|	100	90	96.67	4e-37	 150

   def getQueryLength(blastLine):
      """ return length of query from blast line """

      myFields = blastLine.strip().split()
    
      thisQueryLength  = int(myFields[2])
    
      return thisQueryLength


   def getHitLength(blastLine):
      """ return length of hit from blast line """

      myFields = blastLine.strip().split()
    
      thisHitLength  = int(myFields[3])
    
      return thisHitLength


   def getIdent(blastLine):
      """ parse blast output line and return percent identity of alignment """

      myFields = blastLine.strip().split()
    
      thisIdent  = float(myFields[4])
    
      return thisIdent

   def getGI(blastLine):
      """ parse blast output line and return GI of hit """

      myFields = blastLine.strip().split()

      # split subject (hit) ID field on |
      myIDs = myFields[1].split("|")
      thisGI= myIDs[1]

      return thisGI


   def getQueryID(blastLine):
      """ parse blast output line and return query ID of hit """

      myFields = blastLine.strip().split()

      # split subject (hit) ID field on |
      myQueryID = myFields[0]

      return myQueryID


   # Get input file name
   myInfileName = sys.argv[1]
   infile = open(myInfileName)

   # Counter for hit GIs

   #myGICounter = Counter()
   myQueryCounter = Counter()

   for thisLine in infile.readlines():

      # Extract GI
      # thisGI  = getGI(thisLine)

      # Extract query ID
      thisQueryID  = str(getQueryID(thisLine))

      # Extract hit  and query length
      thisHitLength  = getHitLength(thisLine)
      thisQueryLength  = getQueryLength(thisLine)

      # Extract hit percent identity
      thisIdent  = getIdent(thisLine)

      # Count this GI hit based on conditions
      if (thisHitLength == thisQueryLength)  and (thisIdent >= minHitIdentity):
	 myQueryCounter[thisQueryID] += 1
	
    
   for thisKey in myQueryCounter.keys():
      print thisKey,'\t',myQueryCounter[thisKey]


**Submit multiple jobs to queue on lonestar**::

   
   #!/bin/bash

   #$ -V                # inherit shell environment 
   #$ -l h_rt=00:05:00  # wall time limit
   #$ -q development    # run in development queue - 1 hour max
   #$ -pe 1way 12       # Number of nodes and cores
   #$ -A TG-MCB110143   # TACC account 

   #$ -N Week11_Thurs   # Name of job and stdout files
   #$ -cwd              # Run job from the directory it was submitted
   #$ -j y                

   #------------------------
   #
   # James Vincent
   # March 29, 2012
   #
   # Run blast on week9.fa vs 16SMicrobial database
   # Contains 100 reads from 16S of s_mutans and 100 reads from m.komossens
   # randomly distributed over 16S rRNA, 100bp length reads
   # Reformat output to include Query seq-id, subject seq-id, score and e-value
   #
   #------------------------


   # BLAST programs and variables

   # TACC lonestar uses module system to provide blast
   module load blast
   module load python


   # Database
   DB=$WORK/JSCBIO2710/blast/databases/16SMicrobial

   # Query
   QUERY=week9.fa
   OUTFILE=$QUERY.blast.asn

   # BLAST output format: 11 is ASN, 6 is table no header
   OUTFMT=11

   # BLAST programs loaded by module command
   BLASTN=blastn
   BLAST_FORMATTER=blast_formatter
   BLASTDBCMD=blastdbcmd

   # ---!! Beware comments followed by $:  #$ will be interpeted by qsub 
   # ---   add extra space:  #  $ 

   # Run blast
   $BLASTN -db $DB -query $QUERY -outfmt $OUTFMT -out $OUTFILE

   # Reformat ASN to hit custom hit table
   $BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid qlen length pident evalue bitscore " -out $OUTFILE.table.pident

   # Parse BLAST output with python program to get best hits
   .parseBlast_v3.py  $OUTFILE.table.pident

**Rename jobs in queue script to keep track**::

   This line in qsub script:

   #$ -N Week11_Thurs   # Name of job and stdout files

   is the name used to create the job output files:

   Week11_Thurs.o522740  
   Week11_Thurs.po522740

  
   Change the job name to clearly describe different jobs:

   #$ -N Week11_Control1   # Name of job and stdout files

   #$ -N Week11_Control2   # Name of job and stdout files

   #$ -N Week11_Sample1    # Name of job and stdout files


Homework
--------

**Reading**
**Exercises**
  
  Write the programs from this week (given above) on your own.
  Make sure you can write small functions and complete programs

**Turn In**::

  Run BLAST jobs on the three FASTA files given below.

  Copy the fasta files to your own working directory.

  Modify your job script to keep track of which job you are running.

  Convert the BLAST output to table format and inclue percent identity.

  Parse the BLAST table output with your program (example given in Tues above).
  to produce one line of output for each hit GI : GI, Count of hits, taxonID, genus, species
  

  Turn in the resulting table for each of three FASTA files:

  /work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control1.fa
  /work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control2.fa
  /work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control3.fa