The GenBank Flat File Format

e-mail us questions, comments or suggestions!
 


 
LOCUS Section
REFERENCES Section


 
TUTORIAL (20-30 minutes)

The GenBank flat files contain a large amount of information. However, it is presented in a highly telegraphic style. Although the information in itself is generally straightforward, some background on the format aids in decoding the information.

To begin this tutorial, go to the NCBI home page. From the Search pull-down menu, select "Nucleotide", and click on "Go". The "Entrez Nucleotide" page should appear, which looks like this. Type U49845 into the search box, and click on the "Go" button.

When the results of  the search are displayed, there should be a link to "U49845", the TCP1-beta gene in Saccharomyces cervisiae (Bakers Yeast). Click on this link. The nucleotide record whose Accession Number is U49845 - which looks like this-  should appear.

 Look at this sample file and its organization during the rest of this tutorial.



 
What is a Flat File?

A flat file is simply a text file containing data in the form of "fields" and" records".  Such a flat file can be created with any spreadsheet or text editor ( as long as the application can save it in a useful format).

Each field and record of a flat file are separated with a "delimiter", (i.e., a special character between each one).  A "tab" is often used as a "field delimiter" and a "carriage return" as a "record delimiter". 


 
In the sections below it will be essential to actually look at an NCBI flat file to see the format which is used. This can be done easily by clicking on an icon as shown to the left, which will launch an image in a separate page, showing a copy of the NCBI flat file corresponding to the tutorial.



 
LOCUS Section

Locus Name  (1)

The locus name was originally designed to help group entries with similar sequences: 

  • the first three characters usually designated the organism; 
  • the fourth and fifth characters were used to show other group designations, such as gene product; 
However, the ten characters in the locus name are no longer sufficient to represent the amount of information originally intended to be contained in the Locus name. The only rule now applied in assigning a Locus name is that it must be unique. For example:
  • for GenBank records that have 6-character accessions (e.g., U12345), the locus name is usually the first letter of the genus and species name followed by the accession number. 
  • For 8-character character accessions (e.g., AF123456), the locus name is just the accession number.


Sequence Length (2)

Number of nucleotide base pairs (or amino acid residues) in the sequence record.

  • The maximum length on an individual GenBank record is 350 kb (with some exceptions, as noted in section 1.3.2 of the release notes for GenBank 112.0) - longer sequences must be submitted as multiple records. 
  • The minimum length required for submission is 50 bp, although there might be some shorter records from past years. 

Molecule Type (3)

The type of molecule that was sequenced. Each GenBank record must contain contiguous sequence data from a single molecule type. The various molecule types can include:

  • genomic DNA
  • genomic RNA
  • precursor RNA
  • mRNA (cDNA)
  • ribosomal RNA
  • transfer RNA
  • small nuclear RNA
  • small cytoplasmic RNA

GenBank Division (4)
 

The GenBank database is divided into 17 divisions:  1. PRI - primate sequences
  2. ROD - rodent sequences
  3. MAM - other mammalian sequences
  4. VRT - other vertebrate sequences
  5. INV - invertebrate sequences
  6. PLN - plant, fungal, and algal sequences
  7. BCT - bacterial sequences
  8. VRL - viral sequences
  9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
16. HTG - HTGS sequences (high throughput genomic sequences)
17. HTC - unfinished high-throughput cDNA sequencing

Some of the divisions contain sequences from specific groups of organisms, while others (EST, GSS, HTG, etc.) contain data generated by specific sequencing technologies from many different organisms. 

The organismal divisions are historical and do not reflect the current NCBI Taxonomy. Instead, they merely serve as a convenient way to divide GenBank into smaller pieces for those who want to FTP the database. Because of this, and because sequences from a particular organism can exist in technology-based divisions such as EST, HTG, etc., the NCBI Taxonomy Browser should be used for retrieving all sequences from a particular organism.


Modification Date  (5) 

The date in the LOCUS field is the date of last modification. In some cases, it might correspond to the release date, but there is no way to tell just by looking at the record. 

If you need to know the first date of public availability for a specific sequence record, send a message to info@ncbi.nlm.nih.gov. We will check the history of the record for you, and let you know the date of first public release. If the sequence was originally submitted to our collaborators at DDBJ or EMBL, rather than to GenBank, we will ask them to send the release date information to you.


DEFINITION (6) 

Brief description of sequence; includes information such as: 

  • source organism
  • gene name/protein name
  • some description of the sequence's function (if the sequence is non-coding). 
If the sequence has a coding region (CDS), description may be followed by a completeness qualifier, such as "complete cds." 

In this example, the DNA sequence encodes 3 genes: 

  • a partial sequence for TCP-1 beta (pink background)
  • a full sequence for AX12p  (blue background)
  • a full sequence forREV-7p  (green background)

ACCESSION  (7) 
The unique identifier for a sequence record. An accession number applies to the complete record and is usually a combination of a letter(s) and numbers, such as 
  • a single letter followed by five digits (e.g., U12345)
  • or two letters followed by six digits (e.g., AF123456)
Accession numbers do not change, even if information in the record is changed at the author's request. Sometimes, however, an original accession number might become secondary to a newer accession number, if the authors make a new submission that combines previous sequences, or if for some reason a new submission supersedes an earlier record.

Note: compare accession number with Sequence Identifiers such as Version and GI for nucleotide sequences, and ProteinID and GI for amino acid sequences.


VERSION  (8)

A nucleotide sequence identification number that represents a single, specific sequence in the GenBank database. This identification number uses the accession.version format implemented by GenBank/EMBL/DDBJ in February 1999.If there is any change to the sequence data (even a single base) the version number will be increased by 0.1, e.g., U12345.1 --> U12345.2, but the accession portion will remain the same.

The accession.version system of sequence identifiers runs parallel to the GI number system. That is, when any change is made to a sequence, it receives a new GI number AND an increase to its version number. 


GI  (9)

"GenInfo Identifier" sequence identification number, in this case, for the nucleotide sequence. If a sequence changes in any way, a new GI number will be assigned.  A separate GI number is also assigned to each protein translation within a nucleotide sequence record, and a new GI is assigned if the protein translation changes in any way (see below).

GI numbers start with "g" if assigned by GenBank, "d" if assigned by DDBJ or "e" if assigned by EMBL.

GI sequence identifiers run parallel to the new accession / version system of sequence identifiers. For more information, see the description of Version, above, and section 3.4.7 of the current GenBank release notes.


KEYWORDS  (10) 

Word or phrase describing the sequence. If no keywords are included in the entry, the field contains only a period

The Keyword field is present in sequence records primarily for historical reasons, and is not based on a controlled vocabulary. Keywords are generally present in older records. They are not included in newer records unless:

(1) they are not redundant with any feature, qualifier, or other information present in the record 
(2) the submitter specifically asks for them to be added, and (1) is true
(3) the sequence needs to be tagged as an EST, STS, GSS or HTG.

SOURCE  (11) 

Free-format information including an abbreviated form of the organism name - this may be a common name or scientific name.  It is sometimes followed by a molecule type. 


Organism (12) 

The formal scientific name for the source organism (genus and species, where appropriate) and its lineage, based on the phylogenetic classification scheme used in the NCBI Taxonomy Database. If the complete lineage of an organism is very long, an abbreviated lineage will be shown in the GenBank record and the complete lineage will be available in the Taxonomy Database. 
 




 
REFERENCES Section


REFERENCE (13) 

Publications by the authors of the sequence that discuss the data reported in the record. References are automatically sorted within the record based on date of publication, showing the oldest references first.
The last citation in the References field contains information about the submission itself, rather than a literature citation (see Direct Submission, below).



Authors  (13) 

List of authors in the order in which they appear in the cited article.


Title  (13) 

Title of the published work, or tentative title of an unpublished work.


Journal  (13) 

MEDLINE abbreviation of the journal name. (Full spellings can be obtained from the PubMed Journal Browser.)


MEDLINE  (13) 

MEDLINE is the NLM's premier bibliographic database covering the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences.  MEDLINE contains bibliographic citations and author abstracts from more than 4,600 biomedical journals published in the United States and 70 other countries. The file contains over 11 million citations dating back to the mid-1960's. Coverage is worldwide, but most records are from English-language sources or have English abstracts.

PubMed, available via the NCBI Entrez retrieval system, was developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), located at the National Institutes of Health (NIH).  Entrez is the text based search and retrieval system used at NCBI for all major databases including PubMed, Nucleotide, and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, OMIM and many others. 


Direct Submission  (13) 

Contact information of the submitter, such as institute/department and postal address. This is always the last citation in the References field.  Some older records do not contain the "Direct Submission" reference. However, it is required in all new records.



 
FEATURES TABLE Section  (19) 

Definitions of the elements of the Features Table:
    1. Features Key
    2. Location
    3. Qualifiers
Source  (20) 

The "Source" Feature Key is a mandatory feature in each record. The Location / Qualifiers for the "Source" feature in this record are:

  • the "base span Qualifier which also gives the length of the sequence  (21).
  • "scientific name" Qualifier of the source organism (22).
  • the Location of the sequence (23).
  • the "Taxon ID number" Qualifier(24).
xxxxxxxx "Taxon" is a non-specific term for any level of classification - thus, in different usages it may refer to a species, genus, family, class, order, etc.
 

A stable unique identification number for the taxon of the source organism. A taxonomy ID number is assigned to each taxon (species, genus, family, etc.) in the NCBI Taxonomy Database

Entrez Search Field: The Taxonomy ID number is not searchable in the Organism search field of Entrez, but is searchable in the Taxonomy Browser


 

If provided by submitter, the "Source" feature can also include other Qualifiers such as: 

  • map location 
  • strain
  • clone
  • tissue type, etc.

 Gene  (25) 

The "Gene" Feature Key indicates a region of biological interest identified as a gene and for which a name has been assigned. There are 2 "Gene" features in this record (color-coded with blue and green backgrounds)

The "Gene" feature coded with a blue background has two Location / Qualifiers:

  • The base span for the gene feature is determined from the furthest 5' and 3' elements of the gene. If the coding sequence of the gene is located on the complementary strand form the sequence in the record, the word "complement" will appear before the base span.
  • The name assigned to the gene (AXL2)

CDS (26) 

The "CDS" Feature Key provides information about a region of nucleotides that corresponds with the sequence of amino acids in a protein, including start codon and the stop codon.

The "CDS" feature which is coded with a blue background has 9 Location / Qualifiers:

  • the numbered positions of the base pairs of the coding regions.
  • the name of the gene.
  • a note indicating the nature and cellular location of the gene product.
  • position of the start codon ( In this case, the sequence begins at base #687, and the codon start = 1.  Therefore the start codon is 687-689.  The start codon is AUG (RNA) or ATG (DNA. )
  • the function of the gene product
  • the Accession Number for the record in the NCBI Protein Database (27).
  • the GI number for the record in the NCBI Protein Database (28).
  • the translation of the nucleotide coding sequence (29). The translation is provided using the 1 letter amino acid code.



 
BASE COUNT and ORIGIN Section  

BASE COUNT  (30)

The "BASE COUNT" section gives the number of each of the four bases present in the DNA sequence



ORIGIN  (31)

The "ORIGIN" section provides the exact base sequence from which all the annotations in the Features Table are derived.



Top of Page
RETURN TO SITE MAP
Back to "Index for ENTREZ and Searches"
Back to "Sequence Data Bases and Formats"
Back to "GenBank"