Dr. Neil Sarkar

Dr. Neil  Sarkar
Assistant Professor


Computational Techniques for Studying Contemporary Biomedical and Biodiversity

Our research focuses on developing and adapting computational techniques for studying contemporary biomedical and biodiversity research questions. To this end, we are actively involved with developing algorithms and tools to organize different types of biological data such that they can subsequently be used to postulate testable hypotheses. Specific research areas include:

1. Genomics, Sequence Analysis, and Phylogenetics
The deluge of readily available genomic data poses an increasing challenge for traditional evolutionary biology methods to organize life, either from the perspective of genes or organisms. Much of our research involves the continual development of pipelines and algorithms that efficiently posit phylogenetically justifiable classifications. One such algorithm, the Characteristic Attribute Organization System (CAOS), enables character-based classification (e.g., Maximum Parsimony or Maximum Likelihood) of new data onto an existing hierarchical structure without requiring the instantiation of a new tree search. In addition to the furthering the CAOS algorithm (which is now being applied to character-based DNA Barcoding), some of our contemporary work focuses on developing pipelines to automate tree-searching algorithms for "simultaneous analyses" (analyses that concurrently explore multiple possible tree topologies from multiple data partitions; we call this pipeline the "Automated Simultaneous Analysis Pipeline" [ASAP]). This has also led to the development of evaluation techniques that address statistical or sampling biases that may occur in multiple-partition analyses. These studies have been done to explore the utility and robustness of these new character-based tree-building and evaluation methods on rodent malaria, coxsackievirus, and chiton data sets.

2. Knowledge Representation
With the availability of multiple types of biological data (e.g., genetic, morphological, and phenotypic), one of the grand challenges in computational biology is the representation of these data within a knowledge framework. Using the existing biomedical ontological infrastructure, we have been working towards the creation of new biological ontologies for mediating between knowledge sources and use in evolutionary biology investigations (e.g., phylogenetics). A primary motivation for this work is the modernizing of archival knowledge (e.g., data labels used to describe museum specimens and knowledge embedded in historical scientific literature) into a form that can be used in contemporary studies. A second motivation of our work is to provide ontologically justifiable linkages between gene sequences. Because of the volume of genomic information being produced, there is a need for reconciling different gene predictions. As such, we are devising methods and techniques to combine sequence similarity information with experimental evidence to produce ontological "alignments" that can be used for categorizing gene classes. Finally, we are using ontological strategies to mediate between heterogeneous knowledge sources. For example, we have worked within the biomedical domain to create methods for traversing between genetic knowledge (as represented by the Gene Ontology) and clinical knowledge (as represented by ontologies such as SNOMED-CT).

3. Natural Language Processing
Reliable identification of biomedical entities within natural language text (e.g., literature or clinical notes) is an essential step in enabling subsequent archival, indexing, and knowledge inferencing initiatives. We have focused much of our recent energy to the development of Named Entity Recognition tools for the identification of biomedical concepts within natural language text. Most of our recent work has been in the area of identifying scientific names from biomedical literature. In particular, we have been developing and advancing of a new class of named entity recognition algorithms, termed "Taxonomic Name Recognition" (TNR). We are continuing the advancement of TNR algorithms, including the incorporation of methods for addressing taxonomic name changes (synonyms) and spelling variants, as well as semantic-based disambiguation algorithms (e.g., using context to help distinguish between homonymic strings).

4. Information Retrieval
Much of contemporary information retrieval methods are being developed for dealing with existing popular electronic resources (e.g., PubMed). However, a great deal of valuable (and potentially essential) knowledge remains locked away in numerous unique archival resources. We are working towards development of a single framework to incorporate existing indexing strategies to organize archival literature. Recently, we have generalized this framework into a system that can be used to index any form of digital object, from genomic information to images to literature. The framework leverages a combination of natural language processing tools and knowledge representation approaches to identifying concepts across a range of natural language text documents (e.g., health reports and biomedical literature). One of the ultimate aims of our work is to integrate semantic approaches into existing library cataloguing paradigms. Two central questions that drive our work in this area are: (1) What are the particular types of knowledge that non-literature data types (e.g., genomic) possess? and, (2) Can knowledge from archival resources be analyzed within a phylogenetic framework, leading to more complete understanding of how living organisms relate to one another or to track some evolutionary process? In exploring each of these questions, we are developing interfaces that offer the combination of different retrieval strategies to identify pertinent knowledge.

In summary, a common theme in our research is to advance methods and algorithms for their practical application across biomedical and biodiversity domains. We take a pragmatic approach to developing state-of-the-art computational approaches that are put into use by practicing evolutionary biologists and clinicians. Furthermore, we continually aim to identify research questions that can be addressed through the creation of new (and the advancement of existing) computational methods.

N309 Given Courtyard


Dr. Sarkar received his B.Sc. in Microbiology from the Lyman Briggs College at Michigan State University and his Ph.D. in Biomedical Informatics from the College of Physicians and Surgeons at Columbia University. He also received an MLIS from Syracuse University’s iSchool. Prior to joining the University of Vermont in 2009, he held scientific appointments at the American Museum of Natural History (NYC) and the Marine Biological Laboratory (Woods Hole, MA).


Sarkar IN.Biodiversity informatics: organizing and linking information across the spectrum of life. Brief Bioinform. 2007 Sep;8(5):347-57.

Sarkar IN, Egan MG, Coruzzi G, Lee EK, DeSalle R. Automated simultaneous analysis phylogenetics (ASAP): an enabling tool for phlyogenomics. BMC Bioinformatics. 2008 Feb 19;9:103.

Sarkar IN, Planet PJ, DeSalle R. CAOS software for use in character based DNA barcoding. Molecular Ecology Resources. 2008. 8(6):1256-1259.

Sarkar IN and Rindflesch TC. Discovering protein similarity using natural language processing. Proceedings AMIA Symposium 677-681. 2002.

Sarkar IN. Biomedical informatics and translational medicine. J Transl Med. 2010 Feb 26;8(1):22

All Sarkar publications