Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition
Andreas D. Baxevanis, B.F. Francis Ouellette
Copyright _ 2001 John Wiley & Sons, Inc.
ISBNs: 0-471-38390-2 (Hardback); 0-471-38391-0 (Paper); 0-471-22392-1 (Electronic)
THE NCBI DATA MODEL
James M. Ostell
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Bethesda, Maryland
Sarah J. Wheelan
Department of Molecular Biology and Genetics
The Johns Hopkins School of Medicine
Baltimore, Maryland
Jonathan A. Kans
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Bethesda, Maryland
INTRODUCTION
Why Use a Data Model?
Most biologists are familiar with the use of animal models to study human diseases.
Although a disease that occurs in humans may not be found in exactly the same
form in animals, often an animal disease shares enough attributes with a human
counterpart to allow data gathered on the animal disease to be used to make inferences
about the process in humans. Mathematical models describing the forces involved
in musculoskeletal motions can be built by imagining that muscles are combinations
of springs and hydraulic pistons and bones are lever arms, and, often times,
20 THE NCBI DATA MODEL
such models allow meaningful predictions to be made and tested about the obviously
much more complex biological system under consideration. The more closely and
elegantly a model follows a real phenomenon, the more useful it is in predicting or
understanding the natural phenomenon it is intended to mimic.
In this same vein, some 12 years ago, the National Center for Biotechnology
Information (NCBI) introduced a new model for sequence-related information. This
new and more powerful model made possible the rapid development of software and
the integration of databases that underlie the popular Entrez retrieval system and on
which the GenBank database is now built (cf. Chapter 7 for more information on
Entrez). The advantages of the model (e.g., the ability to move effortlessly from the
published literature to DNA sequences to the proteins they encode, to chromosome
maps of the genes, and to the three-dimensional structures of the proteins) have been
apparent for years to biologists using Entrez, but very few biologists understand the
foundation on which this model is built. As genome information becomes richer and
more complex, more of the real, underlying data model is appearing in common
representations such as GenBank files. Without going into great detail, this chapter
attempts to present a practical guide to the principles of the NCBI data model and
its importance to biologists at the bench.
Some Examples of the Model
The GenBank flatfile is a ‘‘DNA-centered’’ report, meaning that a region of DNA
coding for a protein is represented by a ‘‘CDS feature,’’ or ‘‘coding region,’’ on the
DNA. A qualifier (/translation=“MLLYY”) describes a sequence of amino
acids produced by translating the CDS. A limited set of additional features of the
DNA, such as mat peptide, are occasionally used in GenBank flatfiles to describe
cleavage products of the (possibly unnamed) protein that is described by a
/translation, but clearly this is not a satisfactory solution. Conversely, most
protein sequence databases present a ‘‘protein-centered’’ view in which the connection
to the encoding gene may be completely lost or may be only indirectly referenced
by an accession number. Often times, these connections do not provide the
exact codon-to-amino acid correspondences that are important in performing mutation
analysis.
The NCBI data model deals directly with the two sequences involved: a DNA
sequence and a protein sequence. The translation process is represented as a link
between the two sequences rather than an annotation on one with respect to the
other. Protein-related annotations, such as peptide cleavage products, are represented
as features annotated directly on the protein sequence. In this way, it becomes very
natural to analyze the protein sequences derived from translations of CDS features
by BLAST or any other sequence search tool without losing the precise linkage back
to the gene. A collection of a DNA sequence and its translation products is called a
Nuc-prot set, and this is how such data is represented by NCBI. The GenBank flatfile
format that many readers are already accustomed to is simply a particular style of
report, one that is more ‘‘human-readable’’ and that ultimately flattens the connected
collection of sequences back into the familiar one-sequence, DNA-centered view.
The navigation provided by tools such as Entrez much more directly reflects the
underlying structure of such data. The protein sequences derived from GenBank
translations that are returned by BLAST searches are, in fact, the protein sequences
from the Nuc-prot sets described above.
INTRODUCTION 21
The standard GenBank format can also hide the multiple-sequence nature of
some DNA sequences. For example, three genomic exons of a particular gene are
sequenced, and partial flanking, noncoding regions around the exons may also be
available, but the full-length sequences of these intronic sequences may not yet be
available. Because the exons are not in their complete genomic context, there would
be three GenBank flatfiles in this case, one for each exon. There is no explicit
representation of the complete set of sequences over that genomic region; these three
exons come in genomic order and are separated by a certain length of unsequenced
DNA. In GenBank format there would be a Segment line of the form SEGMENT 1
of 3 in the first record, SEGMENT 2 of 3 in the second, and SEGMENT 3 of 3 in
the third, but this only tells the user that the lines are part of some undefined, ordered
series (Fig. 2.1A). Out of the whole GenBank release, one locates the correct Segment
records to place together by an algorithm involving the LOCUS name. All segments
that go together use the same first combination of letters, ending with the numbers
appropriate to the segment, e.g., HSDDT1, HSDDT2, and HSDDT3. Obviously, this
complicated arrangement can result in problems when LOCUS names include numbers
that inadvertently interfere with such series. In addition, there is no one sequence
record that describes the whole assembled series, and there is no way to describe
the distance between the individual pieces. There is no segmenting convention in
the EMBL sequence database at all, so records derived from that source or distributed
in that format lack even this imperfect information.
The NCBI data model defines a sequence type that directly represents such a
segmented series, called a ‘‘segmented sequence.’’ Rather than containing the letters
A, G, C, and T, the segmented sequence contains instructions on how it can be built
from other sequences. Considering again the example above, the segmented sequence
would contain the instructions ‘‘take all of HSDDT1, then a gap of unknown length,
then all of HSDDT2, then a gap of unknown length, then all of HSDDT3.’’ The
segmented sequence itself can have a name (e.g., HSDDT), an accession number,
features, citations, and comments, like any other GenBank record. Data of this type
are commonly stored in a so-called ‘‘Seg-set’’ containing the sequences HSDDT,
HSDDT1, HSDDT2, HSDDT3 and all of their connections and features. When the
GenBank release is made, as in the case of Nuc-prot sets, the Seg-sets are broken
up into multiple records, and the segmented sequence itself is not visible. However,
GenBank, EMBL, and DDBJ have recently agreed on a way to represent these
constructed assemblies, and they will be placed in a new CON division, with CON
standing for ‘‘contig’’ (Fig. 2.1B). In the Entrez graphical view of segmented sequences,
the segmented sequence is shown as a line connecting all of its component
sequences (Fig. 2.1C).
An NCBI segmented sequence does not require that there be gaps between the
individual pieces. In fact the pieces can overlap, unlike the case of a segmented
series in GenBank format. This makes the segmented sequence ideal for representing
large sequences such as bacterial genomes, which may be many megabases in length.
This is what currently is done within the Entrez Genomes division for bacterial
genomes, as well as other complete chromosomes such as yeast. The NCBI Software
Toolkit (Ostell, 1996) contains functions that can gather the data that a segmented
sequence refers to ‘‘on the fly,’’ including constituent sequence and features, and this
information can automatically be remapped from the coordinates of a small, individual
record to that of a complete chromosome. This makes it possible to provide
graphical views, GenBank flatfile views, or FASTA views or to perform analyses on
22 THE NCBI DATA MODEL
INTRODUCTION 23
<
Figure 2.1. (A) Selected parts of GenBank-formatted records in a segmented sequence.
GenBank format historically indicates merely that records are part of some ordered series;
it offers no information on what the other components are or how they are connected.
To see the complete view of these records, see http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/
query?uid=6849043&form=6&db=n&Dopt=g. (B) Representation of segmented sequences
in the new CON (contig) division. A new extension of GenBank format allows the
details of the construction of segmented records to be presented. The CONTIG line can
include individual accessions, gaps of known length, and gaps of unknown length. The
individual components can still be displayed in the traditional form, although no features
or sequences are present in this format. (C) Graphical representation of a segmented sequence.
This view displays features mapped to the coordinates of the segmented sequence.
The segments include all exonic and untranslated regions plus 20 base pairs of sequence
at the ends of each intron. The segment gaps cover the remaining intronic sequence.
whole chromosomes quite easily, even though data exist only in small, individual
pieces. This ability to readily assemble a set of related sequences on demand for any
region of a very large chromosome has already proven to be valuable for bacterial
genomes. Assembly on demand will become more and more important as larger and
larger regions are sequenced, perhaps by many different groups, and the notion that
an investigator will be working on one huge sequence record becomes completely
impractical.
What Does ASN.1 Have to Do With It?
The NCBI data model is often referred to as, and confused with, the ‘‘NCBI ASN.1’’
or ‘‘ASN.1 Data Model.’’ Abstract Syntax Notation 1 (ASN.1) is an International
Standards Organization (ISO) standard for describing structured data that reliably
encodes data in a way that permits computers and software systems of all types to
reliably exchange both the structure and the content of the entries. Saying that a data
model is written in ASN.1 is like saying a computer program is written in C or
FORTRAN. The statement identifies the language; it does not say what the program
does. The familiar GenBank flatfile was really designed for humans to read, from a
DNA-centered viewpoint. ASN.1 is designed for a computer to read and is amenable
to describing complicated data relationships in a very specific way. NCBI describes
and processes data using the ASN.1 format. Based on that single, common format,
a number of human-readable formats and tools are produced, such as Entrez,
GenBank, and the BLAST databases. Without the existence of a common format
such as this, the neighboring and hard-link relationships that Entrez depends on
would not be possible. This chapter deals with the structure and content of the NCBI
data model and its implications for biomedical databases and tools. Detailed discussions
about the choice of ASN.1 for this task and its overall form can be found
elsewhere (Ostell, 1995).
What to Define?
We have alluded to how the NCBI data model defines sequences in a way that
supports a richer and more explicit description of the experimental data than can be
24 THE NCBI DATA MODEL
obtained with the GenBank format. The details of the model are important, and will
be expanded on in the ensuing discussion. At this point, we need to pause and briefly
describe the reasoning and general principles behind the model as a whole.
There are two main reasons for putting data on a computer: retrieval and discovery.
Retrieval is basically being able to get back out what was put in. Amassing
sequence information without providing a way to retrieve it makes the sequence
information, in essence, useless. Although this is important, it is even more valuable
to be able to get back from the system more knowledge than was put in to begin
with—that is, to be able to use the information to make biological discoveries.
Scientists can make these kinds of discoveries by discerning connections between
two pieces of information that were not known when the pieces were entered separately
into the database or by performing computations on the data that offer new
insight into the records. In the NCBI data model, the emphasis is on facilitating
discovery; that means the data must be defined in a way that is amenable to both
linkage and computation.
A second, general consideration for the model is stability. NCBI is a US Government
agency, not a group supported year-to-year by competitive grants. Thus, the
NCBI staff takes a very long-term view of its role in supporting bioinformatics
efforts. NCBI provides large-scale information systems that will support scientific
inquiry well into the future. As anyone who is involved in biomedical research
knows, many major conceptual and technical revolutions can happen when dealing
with such a long time span. Somehow, NCBI must address these changing views
and needs with software and data that may have been created years (or decades)
earlier. For that reason, basic observations have been chosen as the central data
elements, with interpretations and nomenclature (elements more subject to change)
being placed outside the basic, core representation of the data.
Taking all factors into account, NCBI uses four core data elements: bibliographic
citations, DNA sequences, protein sequences, and three-dimensional structures. In
addition, two projects (taxonomy and genome maps) are more interpretive but nonetheless
are so important as organizing and linking resources that NCBI has built a
considerable base in these areas as well.
PUBs: PUBLICATIONS OR PERISH
Publication is at the core of every scientific endeavor. It is the common process
whereby scientific information is reviewed, evaluated, distributed, and entered into
the permanent record of scientific progress. Publications serve as vital links between
factual databases of different structures or content domains (e.g., a record in a sequence
database and a record in a genetic database may cite the same article). They
serve as valuable entry points into factual databases (‘‘I have read an article about
this, now I want to see the primary data’’).
Publications also act as essential annotation of function and context to records
in factual databases. One reason for this is that factual databases have a structure
that is essential for efficient use of the database but may not have the representational
capacity to set forward the full biological, experimental, or historical context of a
particular record. In contrast, the published paper is limited only by language and
contains much fuller and more detailed explanatory information than will ever be in
a record in a factual database. Perhaps more importantly, authors are evaluated by
PUBs: PUBLICATIONS OR PERISH 25
their scientific peers based on the content of their published papers, not by the content
of the associated database records. Despite the best of intentions, scientists move on
and database records become static, even though the knowledge about them has
expanded, and there is very little incentive for busy scientists to learn a database
system and keep records based on their own laboratory studies up to date.
Generally, the form and content of citations have not been thought about carefully
by those designing factual databases, and the quality, form, and content of
citations can vary widely from one database to the next. Awareness of the importance
of having a link to the published literature and the realization that bibliographic
citations are much less volatile than scientific knowledge led to a decision that a
careful and complete job of defining citations was a worthwhile endeavor. Some
components of the publication specification described below may be of particular
interest to scientists or users of the NCBI databases, but a full discussion of all the
issues leading to the decisions governing the specifications themselves would require
another chapter in itself.
Authors
Author names are represented in many formats by various databases: last name only,
last name and initials, last name-comma-initials, last name and first name, all authors
with initials and the last with a full first name, with or without honorifics (Ph.D.)
or suffixes (Jr., III), to name only a few. Some bibliographic databases (such as
MEDLINE) might represent only a fixed number of authors. Although this inconsistency
is merely ugly to a human reader, it poses severe problems for database systems
incorporating names from many sources and providing functions as simple as looking
up citations by author last name, such as Entrez does. For this reason, the specification
provides two alternative forms of author name representation: one a simple
string and the other a structured form with fields for last name, first name, and so
on. When data are submitted directly to NCBI or in cases when there is a consistent
format of author names from a particular source (such as MEDLINE), the structured
form is used. When the form cannot be deciphered, the author name remains as a
string. This limits its use for retrieval but at least allows data to be viewed when the
record is retrieved by other means.
Even the structured form of author names must support diversity, since some
sources give only initials whereas others provide a first and middle name. This is
mentioned to specifically emphasize two points. First, the NCBI data model is designed
both to direct our view of the data into a more useful form and to accommodate
the available existing data. (This pair of functions can be confusing to people
reading the specification and seeing alternative forms of the same data defined.)
Second, software developers must be aware of this range of representations and
accommodate whatever form had to be used when a particular source was being
converted. In general, NCBI tries to get as much of the data into a uniform, structured
form as possible but carries the rest in a less optimal way rather than losing it
altogether.
Author affiliations (i.e., authors’ institutional addresses) are even more complicated.
As with author names, there is the problem of supporting both structured forms
and unparsed strings. However, even sources with reasonably consistent author name
conventions often produce affiliation information that cannot be parsed from text into
a structured format. In addition, there may be an affiliation associated with the whole
26 THE NCBI DATA MODEL
author list, or there may be different affiliations associated with each author. The
NCBI data model allows for both scenarios. At the time of this writing only the first
form is supported in either MEDLINE or GenBank, both types may appear in published
articles.
Articles
The most commonly cited bibliographic entity in biological science is an article in
a journal; therefore, the citation formats of most biological databases are defined
with that type in mind. However, ‘‘articles’’ can also appear in books, manuscripts,
theses, and now in electronic journals as well. The data model defines the fields
necessary to cite a book, a journal, or a manuscript. An article citation occupies one
field; other fields display additional information necessary to uniquely identify the
article in the book, journal, or manuscript—the author(s) of the article (as opposed
to the author or editor of the book), the title of the article, page numbers, and so on.
There is an important distinction between the fields necessary to uniquely identify
a published article from a citation and those necessary to describe the same
article meaningfully to a database user. The NCBI Citation Matching Service takes
fields from a citation and attempts to locate the article to which they refer. In this
process, a successful match would involve only correctly matching the journal title,
the year, the first page of the article, and the last name of an author of the article.
Other information (e.g., article title, volume, issue, full pages, author list) is useful
to look at but very often is either not available or outright incorrect. Once again, the
data model must allow the minimum information set to come in as a citation, be
matched against MEDLINE, and then be replaced by a citation having the full set
of desired fields obtained from MEDLINE to produce accurate, useful data for consumption
by the scientific public.
Patents
With the advent of patented sequences it became necessary to cite a patent as a
bibliographic entity instead of an article. The data model supports a very complete
patent citation, a format developed in cooperation with the US Patent Office. In
practice, however, patented sequences tend to have limited value to the scientific
public. Because a patent is a legal document, not a scientific one, its purpose is to
present and support the claims of the patent, not to fully describe the biology of the
sequence itself. It is often prepared in a lawyer’s office, not by the scientist who did
the research. The sequences presented in the patent may function only to illustrate
some discreet aspect of the patent, rather than being the focus of the document.
Organism information, location of biological features, and so on may not appear at
all if they are not germane to the patent. Thus far, the vast majority of sequences
appearing in patents also appear in a more useful form (to scientists) in the public
databases.
In NCBI’s view, the main purpose of listing patented sequences in GenBank is
to be able to retrieve sequences by similarity searches that may serve to locate patents
related to a given sequence. To make a legal determination in the case, however, one
would still have to examine the full text of the patent. To evaluate the biology of
the sequence, one generally must locate information other than that contained in the
patent. Thus, the critical linkage is between the sequence and its patent number.
PUBs: PUBLICATIONS OR PERISH 27
Additional fields in the patent citation itself may be of some interest, such as the
title of the patent and the names of the inventors.
Citing Electronic Data Submission
A relatively new class of citations comprises the act of data submission to a database,
such as GenBank. This is an act of publication, similar but not identical to the
publication of an article in a journal. In some cases, data submission precedes article
publication by a considerable period of time, or a publication regarding a particular
sequence may never appear in press. Because of this, there is a separate citation
designed for deposited sequence data. The submission citation, because it is indeed
an act of publication, may have an author list, showing the names of scientists who
worked on the record. This may or may not be the same as the author list on a
subsequently published paper also cited in the same record. In most cases, the scientist
who submitted the data to the database is also an author on the submission
citation. (In the case of large sequencing centers, this may not always be the case.)
Finally, NCBI has begun the practice of citing the update of a record with a submission
citation as well. A comment can be included with the update, briefly describing
the changes made in the record. All the submission citations can be retained
in the record, providing a history of the record over time.
MEDLINE and PubMed Identifiers
Once an article citation has been matched to MEDLINE, the simplest and most
reliable key to point to the article is the MEDLINE unique identifier (MUID). This
is simply an integer number. NCBI provides many services that use MUID to retrieve
the citation and abstract from MEDLINE, to link together data citing the same article,
or to provide Web hyperlinks.
Recently, in concert with MEDLINE and a large number of publishers, NCBI
has introduced PubMed. PubMed contains all of MEDLINE, as well as citations
provided directly by the publishers. As such, PubMed contains more recent articles
than MEDLINE, as well as articles that may never appear in MEDLINE because of
their subject matter. This development led NCBI to introduce a new article identifier,
called a PubMed identifier (PMID). Articles appearing in MEDLINE will have both
a PMID and an MUID. Articles appearing only in PubMed will have only a PMID.
PMID serves the same purpose as MUID in providing a simple, reliable link to the
citation, a means of linking records together, and a means of setting up hyperlinks.
Publishers have also started to send information on ahead-of-print articles to
PubMed, so this information may now appear before the printed journal. A new
project, PubMed Central, is meant to allow electronic publication to occur in lieu
of or ahead of publication in a traditional, printed journal. PubMed Central records
contain the full text of the article, not just the abstract, and include all figures and
references.
The NCBI data model stores most citations as a collection called a Pub-equiv,
a set of equivalent citations that includes a reliable identifier (PMID or MUID) and
the citation itself. The presence of the citation form allows a useful display without
an extra retrieval from the database, whereas the identifier provides a reliable key
for linking or indexing the same citation in the record.
28 THE NCBI DATA MODEL
SEQ-IDs: WHAT’S IN A NAME?
The NCBI data model defines a whole class of objects called Sequence Identifiers
(Seq-id). There has to be a whole class of such objects because NCBI integrates
sequence data from many sources that name sequence records in different ways and
where, of course, the individual names have different meanings. In one simple case,
PIR, SWISS-PROT, and the nucleotide sequence databases all use a string called an
‘‘accession number,’’ all having a similar format. Just saying ‘‘A10234’’ is not
enough to uniquely identify a sequence record from the collection of all these databases.
One must distinguish ‘‘A10234’’ in SWISS-PROT from ‘‘A10234’’ in PIR.
(The DDBJ/EMBL/GenBank nucleotide databases share a common set of accession
numbers; therefore, ‘‘A12345’’ in EMBL is the same as ‘‘A12345’’ in GenBank or
DDBJ.) To further complicate matters, although the sequence databases define their
records as containing a single sequence, PDB records contain a single structure,
which may contain more than one sequence. Because of this, a PDB Seq-id contains
a molecule name and a chain ID to identify a single unique sequence. The subsections
that follow describe the form and use of a few commonly used types of Seq-ids.
Locus Name
The locus appears on the LOCUS line in GenBank and DDBJ records and in the ID
line in EMBL records. These originally were the only identifier of a discrete
GenBank record. Like a genetic locus name, it was intended to act both as a unique
identifier for the record and as a mnemonic for the function and source organism of
the sequence. Because the LOCUS line is in a fixed format, the locus name is restricted
to ten or fewer numbers and uppercase letters. For many years in GenBank,
the first three letters of the name were an organism code and the remaining letters
a code for the gene (e.g., HUMHBB was used for ‘‘human _-globin region’’). However,
as with genetic locus names, locus names were changed when the function of
a region was discovered to be different from what was originally thought. This
instability in locus names is obviously a problem for an identifier for retrieval. In
addition, as the number of sequences and organisms represented in GenBank increased
geometrically over the years, it became impossible to invent and update such
mnemonic names in an efficient and timely manner. At this point, the locus name is
dying out as a useful name in GenBank, although it continues to appear prominently
on the first line of the flatfile to avoid breaking the established format.
Accession Number
Because of the difficulties in using the locus/ID name as the unique identifier for a
nucleotide sequence record, the International Nucleotide Sequence Database Collaborators
(DDBJ/EMBL/GenBank) introduced the accession number. It intentionally
carries no biological meaning, to ensure that it will remain (relatively) stable. It
originally consisted of one uppercase letter followed by five digits. New accessions
consist of two uppercase letters followed by six digits. The first letters were allocated
to the individual collaborating databases so that accession numbers would be unique
across the Collaboration (e.g., an entry beginning with a ‘‘U’’ was from GenBank).
The accession number was an improvement over the locus/ID name, but, with
use, problems and deficiencies became apparent. For example, although the accession
SEQ-IDs: WHAT’S IN A NAME? 29
is stable over time, many users noticed that the sequence retrieved by a particular
accession was not always the same. This is because the accession identifies the whole
database record. If the sequence in a record was updated (say by the insertion of
1000 bp at the beginning), the accession number did not change, as it was an updated
version of the same record. If one had analyzed the original sequence and recorded
that at position 100 of accession U00001 there was a putative protein-binding site,
after the update a completely different sequence would be found at position 100!
The accession number appears on the ACCESSION line of the GenBank record.
The first accession on the line, called the ‘‘primary’’ accession, is the key for retrieving
this record. Most records have only this type of accession number. However,
other accessions may follow the primary accession on the ACCESSION line. These
‘‘secondary’’ accessions are intended to give some notion of the history of the record.
For example, if U00001 and U00002 were merged into a single updated record, then
U00001 would be the primary accession on the new record and U00002 would
appear as a secondary accession. In standard practice, the U00002 record would be
removed from GenBank, since the older record had become obsolete, and the secondary
accessions would allow users to retrieve whatever records superseded the old
one. It should also be noted that, historically, secondary accession numbers do not
always mean the same thing; therefore, users should exercise care in their interpretations.
(Policies at individual databases differed, and even shifted over time in a
given database.) The use of secondary accession numbers also caused problems in
that there was still not enough information to determine exactly what happened and
why. Nonetheless, the accession number remains the most controlled and reliable
way to point to a record in DDBJ/EMBL/GenBank.
gi Number
In 1992, NCBI began assigning GenInfo Identifiers (gi) to all sequences processed
into Entrez, including nucleotide sequences from DDBJ/EMBL/GenBank, the protein
sequences from the translated CDS features, protein sequences from SWISS-PROT,
PIR, PRF, PDB, patents, and others. The gi is assigned in addition to the accession
number provided by the source database. Although the form and meaning of the
accession Seq-id varied depending on the source, the meaning and form of the gi is
the same for all sequences regardless of the source.
The gi is simply an integer number, sometimes referred to as a GI number. It is
an identifier for a particular sequence only. Suppose a sequence enters GenBank
and is given an accession number U00001. When the sequence is processed internally
at NCBI, it enters a database called ID. ID determines that it has not seen U00001
before and assigns it a gi number—for example, 54. Later, the submitter might
update the record by changing the citation, so U00001 enters ID again. ID, recognizing
the record, retrieves the first U00001 and compares its sequence with the new
one. If the two are completely identical, ID reassigns gi 54 to the record. If the
sequence differs in any way, even by a single base pair, it is given a new gi number,
say 88. However, the new sequence retains accession number U00001 because of
the semantics of the source database. At this time, ID marks the old record (gi 54)
with the date it was replaced and adds a ‘‘history’’ indicating that it was replaced
by gi 88. ID also adds a history to gi 88 indicating that it replaced gi 54.
The gi number serves three major purposes:
30 THE NCBI DATA MODEL
• It provides a single identifier across sequences from many sources.
• It provides an identifier that specifies an exact sequence. Anyone who analyzes
gi 54 and stores the analysis can be sure that it will be valid as long as U00001
has gi 54 attached to it.
• It is stable and retrievable. NCBI keeps the last version of every gi number.
Because the history is included in the record, anyone who discovers that gi
54 is no longer part of the GenBank release can still retrieve it from ID through
NCBI and examine the history to see that it was replaced by gi 88. Upon
aligning gi 54 to gi 88 to determine their relationship, a researcher may decide
to remap the former analysis to gi 88 or perhaps to reanalyze the data. This
can be done at any time, not just at GenBank release time, because gi 54 will
always be available from ID.
For these reasons, all internal processing of sequences at NCBI, from computing
Entrez sequence neighbors to determining when new sequence should be processed
or producing the BLAST databases, is based on gi numbers.
Accession.Version Combined Identifier
Recently, the members of the International Nucleotide Sequence Database Collaboration
(GenBank, EMBL, and DDBJ) introduced a ‘‘better’’ sequence identifier, one
that combines an accession (which identifies a particular sequence record) with a
version number (which tracks changes to the sequence itself). It is expected that this
kind of Seq-id will become the preferred method of citing sequences.
Users will still be able to retrieve a record based on the accession number alone,
without having to specify a particular version. In that case, the latest version of the
record will be obtained by default, which is the current behavior for queries using
Entrez and other retrieval programs.
Scientists who are analyzing sequences in the database (e.g., aligning all alcohol
dehydrogenase sequences from a particular taxonomic group) and wish to have their
conclusions remain valid over time will want to reference sequences by accession
and the given version number. Subsequent modification of one of the sequences by
its owner (e.g., 5_ extension during a study of the gene’s regulation) will result in
the version number being incremented appropriately. The analysis that cited accession
and version remains valid because a query using both the accession and version
will return the desired record.
Combining accession and version makes it clear to the casual user that a sequence
has changed since an analysis was done. Also, determining how many times
a sequence has changed becomes trivial with a version number. The accession.version
number appears on the VERSION line of the GenBank flatfile. For sequence retrieval,
the accession.version is simply mapped to the appropriate gi number, which remains
the underlying tracking identifier at NCBI.
Accession Numbers on Protein Sequences
The International Sequence Database Collaborators also started assigning accession.
version numbers to protein sequences within the records. Previously, it was
difficult to reliably cite the translated product of a given coding region feature, except
BIOSEQs: SEQUENCES 31
by its gi number. This limited the usefulness of translated products found in BLAST
results, for example. These sequences will now have the same status as protein
sequences submitted directly to the protein databases, and they have the benefit of
direct linkage to the nucleotide sequence in which they are encoded, showing up as
a CDS feature’s /protein id qualifier in the flatfile view. Protein accessions in
these records consist of three uppercase letters followed by five digits and an integer
indicating the version.
Reference Seq-id
The NCBI RefSeq project provides a curated, nonredundant set of reference sequence
standards for naturally occurring biological molecules, ranging from chromosomes
to transcripts to proteins. RefSeq identifiers are in accession.version form but are
prefixed with NC (chromosomes), NM (mRNAs), NP (proteins), or NT (constructed
genomic contigs). The NG prefix will be used for genomic regions or gene
clusters (e.g., immunoglobulin region) in the future. RefSeq records are a stable
reference point for functional annotation, point mutation analysis, gene expression
studies, and polymorphism discovery.
General Seq-id
The General Seq-id is meant to be used by genome centers and other groups as a
way of identifying their sequences. Some of these sequences may never appear in
public databases, and others may be preliminary data that eventually will be submitted.
For example, records of human chromosomes in the Entrez Genomes division
contain multiple physical and genetic maps, in addition to sequence components.
The physical maps are generated by various groups, and they use General Seq-ids
to identify the proper group.
Local Seq-id
The Local sequence identifier is most prominently used in the data submission tool
Sequin (see Chapter 4). Each sequence will eventually get an accession.
version identifier and a gi number, but only when the completed submission has
been processed by one of the public databases. During the submission process, Sequin
assigns a local identifier to each sequence. Because many of the software tools
made by NCBI require a sequence identifier, having a local Seq-id allows the use
of these tools without having to first submit data to a public database.
BIOSEQs: SEQUENCES
The Bioseq, or biological sequence, is a central element in the NCBI data model. It
comprises a single, continuous molecule of either nucleic acid or protein, thereby
defining a linear, integer coordinate system for the sequence. A Bioseq must have at
least one sequence identifier (Seq-id). It has information on the physical type of
molecule (DNA, RNA, or protein). It may also have annotations, such as biological
features referring to specific locations on specific Bioseqs, as well as descriptors.
32 THE NCBI DATA MODEL
>
Figure 2.2. Classes of Bioseqs. All Bioseqs represent a single, continuous molecule of nucleic
acid or protein, although the complete sequence may not be known. In a virtual
Bioseq, the type of molecule is known, but the sequence is not known, and the precise
length may not be known (e.g., from the size of a band on an electrophoresis gel). A raw
Bioseq contains a single contiguous string of bases or residues. A segmented Bioseq points
to its components, which are other raw or virtual Bioseqs (e.g., sequenced exons and undetermined
introns). A constructed sequence takes its original components and subsumes
them, resulting in a Bioseq that contains the string of bases or residues and a ‘‘history’’ of
how it was built. A map Bioseq places genes or physical markers, rather than sequence, on
its coordinates. A delta Bioseq can represent a segmented sequence but without the requirement
of assigning identifiers to each component (including gaps of known length),
although separate raw sequences can still be referenced as components. The delta sequence
is used for unfinished high-throughput genome sequences (HTGS) from genome centers
and for genomic contigs.
Descriptors provide additional information, such as the organism from which the
molecule was obtained. Information in the descriptors describe the entire Bioseq.
However, the Bioseq isn’t necessarily a fully sequenced molecule. It may be a
segmented sequence in which, for example, the exons have been sequenced but not
all of the intronic sequences have been determined. It could also be a genetic or
physical map, where only a few landmarks have been positioned.
Sequences are the Same
All Bioseqs have an integer coordinate system, with an integer length value, even if
the actual sequence has not been completely determined. Thus, for physical maps,
or for exons in highly spliced genes, the spacing between markers or exons may be
known only from a band on a gel. Although the coordinates of a fully sequenced
chromosome are known exactly, those in a genetic or physical map are a best guess,
with the possibility of significant error from the ‘‘real’’ coordinates.
Nevertheless, any Bioseq can be annotated with the same kinds of information.
For example, a gene feature can be placed on a region of sequenced DNA or at a
discrete location on a physical map. The map and the sequence can then be aligned
on the basis of their common gene features. This greatly simplifies the task of writing
software that can display these seemingly disparate kinds of data.
Sequences are Different
Despite the benefits derived from having a common coordinate system, the different
Bioseq classes do differ in the way they are represented. The most common classes
(Fig. 2.2) are described briefly below.
Virtual Bioseq. In the virtual Bioseq, the molecule type is known, and its
length and topology (e.g., linear, circular) may also be known, but the actual sequence
is not known. A virtual Bioseq can represent an intron in a genomic molecule
in which only the exon sequences have been determined. The length of the putative
sequence may be known only by the size of a band on an agarose gel.
BIOSEQs: SEQUENCES 33
34 THE NCBI DATA MODEL
Raw Bioseq. This is what most people would think of as a sequence, a single
contiguous string of bases or residues, in which the actual sequence is known. The
length is obviously known in this case, matching the number of bases or residues in
the sequence.
Segmented Bioseq. A segmented Bioseq does not contain raw sequences but
instead contains the identifiers of other Bioseqs from which it is made. This type of
Bioseq can be used to represent a genomic sequence in which only the exons are
known. The ‘‘parts’’ in the segmented Bioseq would be the individual, raw Bioseqs
representing the exons and the virtual Bioseqs representing the introns.
Delta Bioseq. Delta Bioseqs are used to represent the unfinished high-throughput
genome sequences (HTGS) derived at the various genome sequencing centers.
Using delta Bioseqs instead of segmented Bioseqs means that only one Seq-id is
needed for the entire sequence, even though subregions of the Bioseq are not known
at the sequence level. Implicitly, then, even at the early stages of their presence in
the databases, delta Bioseqs maintain the same accession number.
Map Bioseq. Used to represent genetic and physical maps, a map Bioseq is
similar to a virtual Bioseq in that it has a molecule type, perhaps a topology, and a
length that may be a very rough estimate of the molecule’s actual length. This information
merely supplies the coordinate system, a property of every Bioseq. Given
this coordinate system for a genetic map, we estimate the positions of genes on it
based on genetic evidence. The table of the resulting gene features is the essential
data of the map Bioseq, just as bases or residues constitute the raw Bioseq’s data.
BIOSEQ-SETs: COLLECTIONS OF SEQUENCES
A biological sequence is often most appropriately stored in the context of other,
related sequences. For example, a nucleotide sequence and the sequences of the
protein products it encodes naturally belong in a set. The NCBI data model provides
the Bioseq-set for this purpose.
A Bioseq-set can have a list of descriptors. When packaged on a Bioseq, a
descriptor applies to all of that Bioseq. When packaged on a Bioseq-set, the descriptor
applies to every Bioseq in the set. This arrangement is convenient for attaching
publications and biological source information, which are expected on all sequences
but frequently are identical within sets of sequences. For example, both the DNA
and protein sequences are obviously from the same organism, so this descriptor
information can be applied to the set. The same logic may apply to a publication.
The most common Bioseq-sets are described in the sections that follow.
Nucleotide/Protein Sets
The Nuc-prot set, containing a nucleotide and one or more protein products, is the
type of set most frequently produced by a Sequin data submission. The component
Bioseqs are connected by coding sequence region (CDS) features that describe how
translation from nucleotide to protein sequence is to proceed. In a traditional nucleotide
or protein sequence database, these records might have cross-references to each
SEQ-ANNOT: ANNOTATING THE SEQUENCE 35
other to indicate this relationship. The Nuc-prot set makes this explicit by packaging
them together. It also allows descriptive information that applies to all sequences
(e.g., the organism or publication citation) to be entered once (see Seq-descr: Describing
the Sequence, below).
Population and Phylogenetic Studies
A major class of sequence submissions represent the results of population or phylogenetic
studies. Such research involves sequencing the same gene from a number
of individuals in the same species (population study) or in different species (phylogenetic
study). An alignment of the individual sequences may also be submitted (see
Seq-align: Alignments, below). If the gene encodes a protein, the components of the
Population or Phylogenetic Bioseq-set may themselves be Nuc-prot sets.
Other Bioseq-sets
A Seg set contains a segmented Bioseq and a Parts Bioseq-set, which in turn contains
the raw Bioseqs that are referenced by the segmented Bioseq. This may constitute
the nucleotide component of a Nuc-prot set.
An Equiv Bioseq-set is used in the Entrez Genomes division to hold multiple
equivalent Bioseqs. For example, human chromosomes have one or more genetic
maps, physical maps derived by different methods and a segmented Bioseq on which
‘‘islands’’ of sequenced regions are placed. An alignment between the various Bioseqs
is made based on references to any available common markers.
SEQ-ANNOT: ANNOTATING THE SEQUENCE
A Seq-annot is a self-contained package of sequence annotations or information that
refers to specific locations on specific Bioseqs. It may contain a feature table, a set
of sequence alignments, or a set of graphs of attributes along the sequence.
Multiple Seq-annots can be placed on a Bioseq or on a Bioseq-set. Each Seqannot
can have specific attribution. For example, PowerBLAST (Zhang and Madden,
1997) produces a Seq-annot containing sequence alignments, and each Seq-annot is
named based on the BLAST program used (e.g., BLASTN, BLASTX, etc.). The
individual blocks of alignments are visible in the Entrez and Sequin viewers.
Because the components of a Seq-annot have specific references to locations on
Bioseqs, the Seq-annot can stand alone or be exchanged with other scientists, and it
need not reside in a sequence record. The scope of descriptors, on the other hand,
does depend on where they are packaged. Thus, information about Bioseqs can be
created, exchanged, and compared independently of the Bioseq itself. This is an
important attribute of the Seq-annot and of the NCBI data model.
Seq-feat: Features
A sequence feature (Seq-feat) is a block of structured data explicitly attached to a
region of a Bioseq through one or two sequence locations (Seq-locs). The Seq-feat
itself can carry information common to all features. For example, there are flags to
indicate whether a feature is partial (i.e., goes beyond the end of the sequence of
36 THE NCBI DATA MODEL
the Bioseq), whether there is a biological exception (e.g., RNA editing that explains
why a codon on the genomic sequence does not translate to the expected amino
acid), and whether the feature was experimentally determined (e.g., an mRNA was
isolated from a proposed coding region).
A feature must always have a location. This is the Seq-loc that states where on
the sequence the feature resides. A coding region’s location usually starts at the ATG
and ends at the terminator codon. The location can have more than one interval if
it is on a genomic sequence and mRNA splicing occurs. In cases of alternative
splicing, separate coding region features are created, with one multi-interval Seq-loc
for each isolated molecular species.
Optionally, a feature may have a product. For a coding region, the product Seqloc
points to the resulting protein sequence. This is the link that allows the data
model to separately maintain the nucleotide and protein sequences, with annotation
on each sequence appropriate to that molecule. An mRNA feature on a genomic
sequence could have as its product an mRNA Bioseq whose sequence reflects the
results of posttranscriptional RNA editing. Features also have information unique to
the kind of feature. For example, the CDS feature has fields for the genetic code
and reading frame, whereas the tRNA feature has information on the amino acid
transferred.
This design completely modularizes the components required by each feature
type. If a particular feature type calls for a new field, no other field is affected. A
new feature type, even a very complex one, can be added without changing the
existing features. This means that software used to display feature locations on a
sequence need consider only the location field common to all features.
Although the DDBJ/EMBL/GenBank feature table allows numerous kinds of
features to be included (see Chapter 3), the NCBI data model treats some features
as ‘‘more equal’’ than others. Specifically, certain features directly model the central
dogma of molecular biology and are most likely to be used in making connections
between records and in discovering new information by computation. These features
are discussed next.
Genes. A gene is a feature in its own right. In the past, it was merely a qualifier
on other features. The Gene feature indicates the location of a gene, a heritable region
of nucleic acid sequence that confers a measurable phenotype. That phenotype may
be achieved by many components of the gene being studied, including, but not
limited to, coding regions, promoters, enhancers, and terminators. The Gene feature
is meant to approximately cover the region of nucleic acid considered by workers
in the field to be the gene. This admittedly fuzzy concept has an appealing simplicity,
and it fits in well with higher-level views of genes such as genetic maps. It has
practical utility in the era of large genomic sequencing when a biologist may wish
to see just the ‘‘xyz gene’’ and not a whole chromosome. The Gene feature may also
contain cross-references to genetic databases, where more detailed information on
the gene may be found.
RNAs. An RNA feature can describe both coding intermediates (e.g., mRNAs)
and structural RNAs (e.g., tRNAs, rRNAs). The locations of an mRNA and the
corresponding coding region (CDS) completely determine the locations of 5_ and 3_
untranslated regions (UTRs), exons, and introns.
SEQ-ANNOT: ANNOTATING THE SEQUENCE 37
Figure 2.3. The Coding Region (CDS) feature links specific regions on a nucleotide sequence
with its encoded protein product. All features in the NCBI data model have a ‘‘location’’
field, which is usually one or more intervals on a sequence. (Multiple intervals on
a CDS feature would correspond to individual exons.) Features may optionally have a ‘‘product’’
field, which for a CDS feature is the entirety of the resulting protein sequence. The
CDS feature also contains a field for the genetic code. This appears in the GenBank flat
file as a /transl table qualifier. In this example, the Bacterial genetic code (code 11) is
indicated. A CDS may also have translation exceptions indicating that a particular residue
is not what is expected, given the codon and the genetic code. In this example, residue
196 in the protein is selenocysteine, indicated by the /transl except qualifier. NCBI
software includes functions for converting between codon locations and residue locations,
using the CDS as its guide. This capability is used to support the historical conventions of
GenBank format, allowing a signal peptide, annotated on the protein sequence, to appear
in the GenBank flat file with a location on the nucleotide sequence.
Coding Regions. A Coding Region (CDS) feature in the NCBI data model
can be thought of as ‘‘instructions to translate’’ a nucleic acid into its protein product,
via a genetic code (Fig. 2.3). A coding region serves as a link between the nucleotide
and protein. It is important to note that several situations can provide exceptions to
the classical colinearity of gene and protein. Translational stuttering (ribosomal slippage),
for example, merely results in the presence of overlapping intervals in the
feature’s location Seq-loc.
The genetic code is assumed to be universal unless explicitly given in the Coding
Region feature. When the genetic code is not followed at specific positions in the
sequence—for example, when alternative initiation codons are used in the first position,
when suppressor tRNAs bypass a terminator, or when selenocysteine is added
—the Coding Region feature allows these anomalies to be indicated.
Proteins. A Protein feature names (or at least describes) a protein or proteolytic
product of a protein. A single protein Bioseq may have many Protein features on it.
It may have one over its full length describing a pro-peptide, the primary product of
translation. (The name in this feature is used for the /product qualifier in the
CDS feature that produces the protein.) It may have a shorter protein feature describing
the mature peptide or, in the case of viral polyproteins, several mature
peptide features. Signal peptides that guide a protein through a membrane may also
be indicated.
38 THE NCBI DATA MODEL
Others. Several other features are less commonly used. A Region feature provides
a simple way to name a region of a chromosome (e.g., ‘‘major histocompatibility
complex’’) or a domain on a polypeptide. A Bond feature annotates a bond
between two residues in a protein (e.g., disulfide). A Site feature annotates a known
site (e.g., active, binding, glycosylation, methylation, phosphorylation).
Finally, numerous features exist in the table of legal features, covering many
aspects of biology. However, they are less likely than the above-mentioned features
to be used for making connections between records or for making discoveries based
on computation.