.. _week_Ten: *************************** Week Ten 20-Mar-12 *************************** ======= Tuesday ======= .. note:: Learning Goals: * Use Counter to count top BLAST hits * Learn to use wget program * Install BioPython, make sure it works * Create sample BioPython program to call NCBI efetch utils | | | Video ----- .. raw:: html | | Lecture --------- **Use Counter object to count BLAST hits**:: #!/usr/bin/env python """ James Vincent March 20, 2012 Parse BLAST output in table format Extract GI numbers and scores Count number of times each hit GI occurs """ import sys from collections import Counter def getScore(oneLine): """ Extract the score valeu from this line Lines look like this: Score is last number. 1 gi|265679037|ref|NR_029345.1| 0.0 2846 1 gi|310975117|ref|NR_036981.1| 0.0 2819 """ myFields = oneLine.split() thisScore = float(myFields[3]) return thisScore def getGI(oneLine): """ Extract GI number from BLAST table format line 1 gi|265679037|ref|NR_029345.1| 0.0 2846 1 gi|310975117|ref|NR_036981.1| 0.0 2819 """ myFields = oneLine.split() myIDVals = myFields[1] myGI = myIDVals.split('|')[1] return myGI # Get input BLAST formatted file blastInfileName = sys.argv[1] infile = open(blastInfileName) # Counter for hit GIs myGICounter = Counter() # Loop over each line in the file for thisLine in infile.readlines(): # Extract score thisScore = getScore(thisLine) # Extract G thisGI = getGI(thisLine) myGICounter[thisGI] += 1 # Print top five hits print myGICounter.most_common(5) | | | **Fetch files with wget**:: You can download week9.fa from this link: http://bit.ly/yoK27C Copy directly to lonestar with wget Log in to lonestar and move to a directory where you would like to place the file, then call wget: wget -O Example: login1$ cd /work/00921/tg801771/JSCBIO2710/lectures/week9/Thurs/ login1$ wget http://bit.ly/yoK27C -O week9.fa | | | **Retrieve BioPython with wget**:: Open a browser: biopython.org Find Download link, follow Under Files look for latest source tarball version: biopython-1.59.tar.gz 8,377 Kb -- Source Tarball Hold mouse over the file download active link and copy the link address (ctrl-clik or two finger click...) Move to your $WORK directory on lonestar, create a directory called Software and move into that Retrieve the biopython download with wget: wget http://biopython.org/DIST/biopython-1.59.tar.gz Without any other options the file will be downloaded with the same name | | | **Install BioPython**:: login1$ cd $WORK login1$ mkdir Software login1$ cd Software login1$ wget http://biopython.org/DIST/biopython-1.59.tar.gz --2012-03-13 04:48:40-- http://biopython.org/DIST/biopython-1.59.tar.gz Resolving biopython.org... 208.94.50.58 Connecting to biopython.org|208.94.50.58|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 8577492 (8.2M) [application/x-gzip] Saving to: `biopython-1.59.tar.gz' 100%[================================================================>] 8,577,492 1.42M/s in 6.1s 2012-03-13 04:48:46 (1.34 MB/s) - `biopython-1.59.tar.gz' saved [8577492/8577492] login1$ tar zxf biopython-1.58.tar.gz login1$ cd biopython-1.58 login1$ module load python login1$ python ./setup.py install --user running install running build running build_py running build_ext running install_lib running install_egg_info Removing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info Writing /home1/00921/tg801771/.local/lib/python2.7/site-packages/biopython-1.58-py2.7.egg-info login1$ login1$ python Python 2.7.1 (r271:86832, Sep 20 2011, 09:42:22) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez >>> **Sample eUtils program from Biopython**:: #!/usr/bin/env python """ James Vincent Mar 20 , 2012 efetchSample.py Demonstrate use of BioPython efetch http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96 """ import sys from Bio import Entrez Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text") print handle.read() Homework -------- **Reading** Biopython Chapter 8 Accessing NCBI’s Entrez databases: - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96 **Exercises** **Turn In** ======= Thursday ======= | | | Video ----- .. raw:: html | | Lecture --------- .. note:: Learning Goals: * Create simple BioPython program to call NCBI elink utils * Create simple BioPython program to call NCBI esummary utils * Use elink,esummary together to retrieve taxonomic names * Make taxonomic profile of top ten hits from BLAST week9 output **Build program to fetch taxonomy step by step**:: * Write simple program that calls elink with one GI * Write program that calls esummary with single taxon ID * Convert simple programs into single function calls * Add functions calls to existing BLAST parsing program **Use elink to fetch one taxon ID**:: #!/usr/bin/env python """ James Vincent Mar 22 , 2012 elink1.py Demonstrate use of BioPython elink to retrieve a taxonomic ID number from a GI number http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96 """ import sys import pprint from Bio import Entrez Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are # Fetch one taxon ID from entrez given input gi number handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516") print # default is to return XML in an open file handle for line in handle.readlines(): print line, print **Results**:: login2$ ./elink1.py nuccore 343198516 taxonomy nuccore_taxonomy 1309 **Convert XML from Entrez into easy to use python data structure**:: #!/usr/bin/env python """ James Vincent Mar 22 , 2012 elink.py Demonstrate use of BioPython elink to retrieve a taxonomic ID number from a GI number http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96 """ import sys import pprint from Bio import Entrez Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are # Fetch one taxon ID from entrez given input gi number # default is to return XML in an open file handle handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516") # Entrez.read call parses XML into python data structure record = Entrez.read(handle) print print record print print for thisKey in record[0].keys(): print 'Key: ',thisKey, '\tValue: ',record[0][thisKey] print '\n\n' **Results**:: login2$ ./elink2.py [{u'DbFrom': 'nuccore', u'IdList': ['343198516'], u'LinkSetDbHistory': [], u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '1309'}], u'LinkName': 'nuccore_taxonomy'}]}] Key: DbFrom Value: nuccore Key: IdList Value: ['343198516'] Key: LinkSetDbHistory Value: [] Key: LinkSetDb Value: [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '1309'}], u'LinkName': 'nuccore_taxonomy'}] **Extract taxon ID from python data structure**:: #!/usr/bin/env python """ James Vincent Mar 22 , 2012 elink3.py Demonstrate use of BioPython elink to retrieve a taxonomic ID number from a GI number http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96 """ import sys import pprint from Bio import Entrez Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are # Fetch one taxon ID from entrez given input gi number # default is to return XML in an open file handle handle = Entrez.elink(dbfrom="nucleotide",db="taxonomy",id="343198516") # Entrez.read call parses XML into python data structure record = Entrez.read(handle) myTaxonID = record[0]["LinkSetDb"][0]["Link"][0]["Id"] print 'taxon ID: ',myTaxonID **Results**:: login2$ ./elink3.py taxon ID: 1309 login2$ **Use esummary to fetch taxonomic names**:: #!/usr/bin/env python """ James Vincent Mar 22 , 2012 esummary1.py Demonstrate use of BioPython Entrez esummary http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96 """ import sys from Bio import Entrez Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are # Get genus and species as string from Entrez given input taxon ID number """ handle = Entrez.read(Entrez.esummary(db="taxonomy", id="1309")) print print handle print **Results**:: login2$ ./esummary1.py [{'StructNumber': 101, 'Division': 'firmicutes', 'ProtNumber': 13860, 'CommonName': '', 'TaxId': 1309, 'Rank': 'species', 'Species': 'mutans', u'Item': [], 'GeneNumber': 4030, 'ScientificName': 'Streptococcus mutans', 'GenNumber': 4, 'Subsp': '', 'NucNumber': 13851, 'Genus': 'Streptococcus', u'Id': '1309'}] **Extract genus and species as strings from data structure**:: #!/usr/bin/env python """ James Vincent Mar 22 , 2012 esummary1.py Demonstrate use of BioPython Entrez esummary http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96 """ import sys from Bio import Entrez Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are # Get genus and species as string from Entrez given input taxon ID number """ myTaxonID = 1309 handle = Entrez.read(Entrez.esummary(db="taxonomy", id=myTaxonID)) myGenus = handle[0]["Genus"] mySpecies = handle[0]["Species"] print 'Taxon ID: ',myTaxonID,' Genus: ',myGenus,' Species: ',mySpecies **Results**:: login2$ ./esummary2.py Taxon ID: 1309 Genus: Streptococcus Species: mutans login2$ Homework -------- **Reading** Biopython Chapter 8 Accessing NCBI’s Entrez databases: - http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96 **Exercises** - Write the programs given above for Thursday's lecture **Turn In** Write a single python program based on BLAST parser for table formatted output: * write a function to parse each line and return the GI * use a Counter object to count how many times each GI number was seen * Write a function (based on the programs from Thurs lecture above) :: Given a GI number as input returns a taxon ID * Write a second function (based on the programs from Thurs lecture above) :: Given a taxon ID number as input returns genus and species as text strings * Extract the top five GI numbers and their counts and do the following:: Extract just the GI number from your counter Fetch the taxon ID from the GI number Fetch genus and species from taxon ID print GI number, count of hits, genus, species