.. _week_Ten:


***************************
Week Ten          20-Mar-12
***************************

=======
Tuesday
=======

.. note:: Learning Goals:

   * Use Counter to count top BLAST hits
   * Learn to use wget program 
   * Install BioPython, make sure it works
   * Create sample BioPython program to call NCBI efetch utils

|
|
|
Video
-----
.. raw:: html

 <iframe width=675 height=400  src="http://www.uvm.edu/~epscor/jwplayer.php?video=video/3_20_2012_JSC_bio2710.mp4" frameborder="0" allowfullscreen></iframe>

|
|


Lecture
---------

**Use Counter object to count BLAST hits**::
       
       #!/usr/bin/env python

       """

       James Vincent
       March 20, 2012

       Parse BLAST output in table format
       Extract GI numbers and scores
       Count number of times each hit GI occurs

       """

       import sys
       from collections import Counter


       def getScore(oneLine):
	  """  Extract the score valeu from this line
	       Lines look like this: Score is last number.
	
	       1	gi|265679037|ref|NR_029345.1|	0.0	2846
	       1	gi|310975117|ref|NR_036981.1|	0.0	2819
	 
	   """
	
	  myFields = oneLine.split() 
	  thisScore = float(myFields[3])

	  return thisScore

       def getGI(oneLine):
	  """ Extract GI number from BLAST table format line

	       1	gi|265679037|ref|NR_029345.1|	0.0	2846
	       1	gi|310975117|ref|NR_036981.1|	0.0	2819
	  """

	  myFields = oneLine.split() 
	  myIDVals = myFields[1]
	  myGI = myIDVals.split('|')[1]

	  return myGI


       # Get input BLAST formatted file
       blastInfileName = sys.argv[1]
       infile = open(blastInfileName)

       # Counter for hit GIs
       myGICounter = Counter()

       # Loop over each line in the file
       for thisLine in infile.readlines():
	  # Extract score 
	  thisScore = getScore(thisLine)
	  # Extract G
	  thisGI  = getGI(thisLine)
	  myGICounter[thisGI] += 1
	  
       # Print top five hits
       print myGICounter.most_common(5)
   
|
|
|

**Fetch files with wget**::


       You can download week9.fa from this link:

       http://bit.ly/yoK27C
   
       Copy directly to lonestar with  wget
       
       Log in to lonestar and move to a directory where you 
       would like to place the file, then call wget:

       wget <URL> -O <outputFileName>

       Example:
      
       login1$ cd /work/00921/tg801771/JSCBIO2710/lectures/week9/Thurs/
       login1$  wget http://bit.ly/yoK27C -O week9.fa
|
|
|       
**Retrieve BioPython with wget**::

   Open a browser:  biopython.org   

   Find Download link, follow

   Under Files look for latest source tarball version:  biopython-1.59.tar.gz 8,377 Kb -- Source Tarball
 
   Hold mouse over the file download active link and copy the link address (ctrl-clik or two finger click...)

   Move to your $WORK directory on lonestar, create a directory called Software and move into that

   Retrieve the biopython download with wget:

   wget http://biopython.org/DIST/biopython-1.59.tar.gz

   Without any other options the file will be downloaded with the same name
|
|
|
**Install BioPython**::

   
   login1$ cd $WORK
   login1$ mkdir Software
   login1$ cd Software
   login1$ wget http://biopython.org/DIST/biopython-1.59.tar.gz

     --2012-03-13 04:48:40--  http://biopython.org/DIST/biopython-1.59.tar.gz
     Resolving biopython.org... 208.94.50.58
     Connecting to biopython.org|208.94.50.58|:80... connected.
     HTTP request sent, awaiting response... 200 OK
     Length: 8577492 (8.2M) [application/x-gzip]
     Saving to: `biopython-1.59.tar.gz'
     
     100%[================================================================>] 8,577,492   1.42M/s   in 6.1s    
     
     2012-03-13 04:48:46 (1.34 MB/s) - `biopython-1.59.tar.gz' saved [8577492/8577492]

   login1$ tar zxf biopython-1.58.tar.gz 
   login1$ cd biopython-1.58
   login1$ module load python
   login1$ python ./setup.py install --user

     running install
     running build
     running build_py
     running build_ext
     running install_lib
     running install_egg_info
     Removing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info
     Writing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info

   login1$ 
   login1$ python
   Python 2.7.1 (r271:86832, Sep 20 2011, 09:42:22) 
   [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
   Type "help", "copyright", "credits" or "license" for more information.
   >>> from Bio import Entrez
   >>> 


**Sample eUtils program from Biopython**::

    
    #!/usr/bin/env python

    """

    James Vincent
    Mar 20 , 2012

    efetchSample.py

    Demonstrate use of BioPython efetch

    http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96

    """

    import sys
    from Bio import Entrez

    Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are
    handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text")
    print handle.read()    


Homework
--------

**Reading**

Biopython Chapter 8  Accessing NCBI’s Entrez databases:
- http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96   

**Exercises**

**Turn In**

=======
Thursday
=======

|
|
|
Video
-----
.. raw:: html

 <iframe width=675 height=400  src="http://www.uvm.edu/~epscor/jwplayer.php?video=video/3_22_2012_JSC_bio2710.mp4" frameborder="0" allowfullscreen></iframe>

|
|
Lecture
---------

.. note:: Learning Goals:

   * Create simple BioPython program to call NCBI elink utils
   * Create simple BioPython program to call NCBI esummary utils
   * Use elink,esummary together to retrieve taxonomic names
   * Make taxonomic profile of top ten hits from BLAST week9 output

**Build program to fetch taxonomy step by step**::

   * Write simple program that calls elink with one GI
   * Write program that calls esummary with single taxon ID
   * Convert simple programs into single function calls
   * Add functions calls to existing BLAST parsing program


**Use elink to fetch one taxon ID**::

    
    #!/usr/bin/env python

    """
    James Vincent
    Mar 22 , 2012

    elink1.py

    Demonstrate use of BioPython elink
    to retrieve a taxonomic ID number from a GI number

    http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
    """

    import sys
    import pprint
    from Bio import Entrez

    Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

    # Fetch one taxon ID from entrez given input gi number
    handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516")

    print

    # default is to return XML in an open file handle
    for line in handle.readlines():
      print line,

    print

**Results**::

   
   login2$ ./elink1.py 

   <?xml version="1.0"?>
   <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 23 November 2010//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd">
   <eLinkResult>

	   <LinkSet>
		   <DbFrom>nuccore</DbFrom>
		   <IdList>
			   <Id>343198516</Id>
		   </IdList>
		   <LinkSetDb>
			   <DbTo>taxonomy</DbTo>
			   <LinkName>nuccore_taxonomy</LinkName>
			   <Link>
				   <Id>1309</Id>
			   </Link>
		   </LinkSetDb>
	   </LinkSet>
   </eLinkResult>


**Convert XML from Entrez into easy to use python data structure**::

   
   #!/usr/bin/env python

   """

   James Vincent
   Mar 22 , 2012

   elink.py

   Demonstrate use of BioPython elink
   to retrieve a taxonomic ID number from a GI number

   http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
   """

   import sys
   import pprint
   from Bio import Entrez

   Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

   # Fetch one taxon ID from entrez given input gi number
   # default is to return XML in an open file handle
   handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516")

   # Entrez.read call parses XML into python data structure
   record = Entrez.read(handle)
   print
   print record
   print
   print
   for thisKey in record[0].keys():
     print 'Key: ',thisKey, '\tValue: ',record[0][thisKey]

   print '\n\n'

**Results**::

    login2$ ./elink2.py 

    
    [{u'DbFrom': 'nuccore', u'IdList': ['343198516'], u'LinkSetDbHistory': [], u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '1309'}], u'LinkName': 'nuccore_taxonomy'}]}]


    Key:  DbFrom 	Value:  nuccore
    Key:  IdList 	Value:  ['343198516']
    Key:  LinkSetDbHistory 	Value:  []
    Key:  LinkSetDb 	Value:  [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '1309'}], u'LinkName': 'nuccore_taxonomy'}]

   
**Extract taxon ID from python data structure**::
 
    
    #!/usr/bin/env python

    """

    James Vincent
    Mar 22 , 2012

    elink3.py

    Demonstrate use of BioPython elink
    to retrieve a taxonomic ID number from a GI number

    http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
    """

    import sys
    import pprint
    from Bio import Entrez

    Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

    # Fetch one taxon ID from entrez given input gi number
    # default is to return XML in an open file handle
    handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516")

    # Entrez.read call parses XML into python data structure
    record = Entrez.read(handle)

    myTaxonID = record[0]["LinkSetDb"][0]["Link"][0]["Id"]
    print 'taxon ID: ',myTaxonID

**Results**::

  login2$ ./elink3.py 
  taxon ID:  1309
  login2$  


**Use esummary to fetch taxonomic names**::

   
   #!/usr/bin/env python

   """

   James Vincent
   Mar 22 , 2012

   esummary1.py

   Demonstrate use of BioPython Entrez esummary
   http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
   """

   import sys
   from Bio import Entrez

   Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

   # Get genus and species  as string from Entrez given input taxon ID number """

   handle = Entrez.read(Entrez.esummary(db="taxonomy", id="1309"))
   print
   print handle   
   print

**Results**::
 
    
    login2$ ./esummary1.py 

    [{'StructNumber': 101, 'Division': 'firmicutes', 'ProtNumber': 13860, 'CommonName': '', 'TaxId': 1309, 'Rank': 'species', 'Species': 'mutans', u'Item': [], 'GeneNumber': 4030, 'ScientificName': 'Streptococcus mutans', 'GenNumber': 4, 'Subsp': '', 'NucNumber': 13851, 'Genus': 'Streptococcus', u'Id': '1309'}]


**Extract genus and species as strings from data structure**::

    
    #!/usr/bin/env python

    """

    James Vincent
    Mar 22 , 2012

    esummary1.py

    Demonstrate use of BioPython Entrez esummary
    http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96
    """

    import sys
    from Bio import Entrez

    Entrez.email = "jim@uvm.edu"     # Always tell NCBI who you are

    # Get genus and species  as string from Entrez given input taxon ID number """
    myTaxonID = 1309
    handle = Entrez.read(Entrez.esummary(db="taxonomy", id=myTaxonID))
    myGenus = handle[0]["Genus"]
    mySpecies = handle[0]["Species"]

    print 'Taxon ID: ',myTaxonID,' Genus: ',myGenus,' Species: ',mySpecies

**Results**::

   
   login2$ ./esummary2.py 
   Taxon ID:  1309  Genus:  Streptococcus  Species:  mutans
   login2$ 


Homework
--------

**Reading**

Biopython Chapter 8  Accessing NCBI’s Entrez databases:
- http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96   

**Exercises**

- Write the programs given above for Thursday's lecture

**Turn In**

Write a single python program based on BLAST parser for table formatted output:
  * write a function to parse each line and return the GI
  * use a Counter object to count how many times each GI number was seen
  * Write a function (based on the programs from Thurs lecture above) ::
     
      Given a GI number as input returns a taxon ID
  * Write a second function (based on the programs from Thurs lecture above) ::
     
      Given a taxon ID number as input returns genus and species as text strings

  * Extract the top five GI numbers and their counts and do the following::
  
     Extract just the GI number from your counter
     Fetch the taxon ID from the GI number
     Fetch genus and species from taxon ID
     print GI number, count of hits, genus, species