Kumpulan Artikel

Rabu, 04 Januari 2012

AQUAKULTUR BIOINFORMATIC

PERHATIAN
INI JAWABAN SOAL UAS YANG SAYA BUAT . . . ! ! ! !
SILAHKAN LIAT SAJA DIBAWAH INI :

Identifikasi dan lokalisasi struktural
protein virus herpes anguillid 1

Banyak herpesvirus ikan dikenal memiliki spesies budidaya yang penting sebagai tuan rumah alami mereka, dan dapat menyebabkan
penyakit serius dan kematian. Anguillid herpesvirus 1 (AngHV-1) menyebabkan penyakit perdarahan di belut Eropa,
Anguilla Anguilla. Meskipun penting, pengetahuan molekul mendasar pada herpesvirusikan masih terbatas.
Dalam studi ini kami menjelaskan identifikasi dan lokalisasi protein struktural AngHV-1.dimurnikan virion
yang difraksinasi menjadi tegument kapsid-dan fraksi amplop, dan prematur capsidsdiisolasi dari
sel yang terinfeksi. Protein yang diekstraksi dengan metode yang berbeda dan diidentifikasi dengan spektrometri massa. Sebanyak 40
protein struktural diidentifikasi, yang 7 dapat ditugaskan untuk kapsid, 11 ke amplop, dan 22 ke
tegument. Identifikasi dan lokalisasi protein ini memungkinkan prediksi fungsional.Temuan kami meliputi
identifikasi protein kapsid tripleks, putatif 1 protein tegument dominan, dan besar
antigenik protein amplop. Delapan belas dari 40 AngHV-1 protein struktural telah homologurutan terkait
Cyprinid herpesvirus 3 (CyHV-3). Konservasi gen virus herpes ikan struktural tampaknyatinggi untuk kapsid
protein, terbatas untuk tegument protein, dan rendah untuk protein amplop. Identifikasi dan lokalisasi
dari protein struktural AngHV-1 dalam penelitian ini menambah pengetahuan dasaranggota
Alloherpesviridae keluarga, terutama dari genus Cyprinivirus.

lebih lanjut nya silahkan liat jurnal disini

Dari hasil resum jurnal diatas dapat di tarik kesimpulan berupa :

Cyprinid herpesvirus 3 ( CyHV-3) ??
Cyprinid herpesvirus merupakan virus yang menyerang sebagian besar belut, yang menyerang sistem organ dalam, sehingga belut akan mengalami pendarahan.

maka hasil penelitian ini akan memfasilitasi karakterisasi lebih fungsional diarahkan protein yang menarik. Selain itu, informasi ini sangat penting dalam studi lebih lanjut tentang Pathobiology virus ini, dan akan mendukung pengembangan diagnostik spesifik
alat dan vaksin. Dimana virus tersebut dapat diidentifakasi dengan menggunakan sistem bioinformatik dalam bidang perikanan. Sehingga penyakit tersebut dapat diatasi dan sehingga akan dapat meningkatkan hasil produksi.
.
jadi bio informatika dalam bidang aquacultur berfungsi untuk mengetahui dan menyelesaikan kasus yang ada dalam bidang perikanan.

SUMBER :
http://www.springerlink.com/journals/

NAMA : LUTHFI ADHI VIRNANTO

NIM : 26010210120032

PRODI :BUDIDAYA PERAIRAN'10

Selasa, 03 Januari 2012

BIO-INFORMATIKA dalam PERIKANAN

Bioinformatika ialah ilmu yang mempelajari penerapan teknik komputasi untuk mengelola dan menganalisis informasi hayati. Bidang ini mencakup penerapan metode-metodematematika, statistika, dan informatika untuk memecahkan masalah-masalah biologi, terutama yang terkait dengan penggunaan sekuens DNA dan asam amino.

Bioinformatika adalah gabungan antara ilmu komputer dengan ilmu biologi yang berfungsi untuk menganalisis ilmu di bidang biologi.

Kemajuan dalam teknologi berbasis DNA seperti sekuensing genom telah menyebabkan terjadinya ledakan informasi genetic yang dihasilkan oleh para peneliti. Membludaknya jumlah informasi genetic ini mutlak memerlukan ilmu ilmu computer untuk pengelolananya, sehingga lahirlah bidang ilmu baru yang disebut bioinformatika. Dengan software software dan situs bioinformatika diharapkan mampu untuk membantu penelitian yang berkaitan dengan biologi molekuler organisme budidaya sehingga penelitian akan lebih mudah dilakuakn dan hasilnya lebih valid. Penggunaan software bioinformatika dalam penelitian diharapkan mampu meningkatkan peroduktivitas budidaya perikanan.

ISI

Pada prinsinya pendekatan biologi moelekuler dapat ditempuhdengan tiga tingkatan molekuler yang berbeda . Pertama, studi pada tingkat DNA yang disebut (Anotasi Genom), yaitu mengidentifikasi gen-gen pada suatu genom, yang kemudian menganalisis letak dan fungsi gen-gen tersebut. Kedua, studi pada tingkat RNA (Transkriptomika), yaitu menguji seluruh transkrip (produk transkripsi gen) yang dihasilkan oleh suatu genom. Ketiga, studi pada tingkat protein (Proteomika), yaitu menguji seluruh protein (produk translasi RNA) yang dihasilkan oleh suatu genom. Ketiganya bertujuan untuk meningkuatkan kualitas dan jumlah produksi budidaya perikanan.

PROTEOMIKA

Fungsi genom dapat dipelajari pada tingkat protein atau tingkat translasi melalui analisis seluruh protein yang dihasilkan oleh suatu organisme. Analisis protein skala besar seperti ini dikenal dengan sebutan Proteomika. Analisis protein dalam bidang perikanan dapat digunakan untuk pembuatan pakan ikan berdasarkan protein yang terkandung dalam tubuh ikan tersebut. Hasil sequens yang didapat biasanya dicocokkan dengan program BLASTn untuk mengetahui komponen asam amino penyandinya.

PENUTUP

Pendekatan molekuler dengan bioinformatik telah dilakukan di bidang perikanan meliputi Anotasi genom, traskriptomika dan poteomika. Pada anotasi genom program bioinformatika yang digunakan adalah program BLAST, pada transkriptomika adalah pembuatan chip microarray, sedangkan pada program proteomika dalah menganalisa asam amino penyandi gen dengan menggunakan program BLASTN.

Kesimpulan dari jurnal tersebut : bioinformatik dalam perikanan berfungsi untuk mengidentifikasi gen-gen pada ikan, yang kemudian di rekayasa dan menjadikannya ikan jenis baru yang lebih sempurna untuk meningkatkan produksivitas budidaya perikanan.

lebih lanjut nya jurnal dapat dilihat dibawah ini:

http://www.bbrp2b.kkp.go.id/publikasi/prosiding/2008/brawijaya/17.%20STUDI%20BIOINFORMATIKA%20MIKROBA%20Streptomyces%20PENYANDI%20GEN%20TGase.pdf

http://biotech-uria.synthasite.com/Genomikafungsionalkelautan

http://www.biomedcentral.com/1471-2164/9/508

http://isjd.pdii.lipi.go.id/admin/jurnal/3208183190.pdf

Kamis, 15 Desember 2011

apa itu SIG dan sistem pengindraan jauh . . . ? ? ?

hasil pencarian tentang ( SIG ) atau Sistem Informasi Geografi dapat di akses di http://journal.ui.ac.id/upload/artikel/05-Penggunaan%20Metode%20Analisa_Bangun.PDF dan http://journal.ui.ac.id/upload/artikel/05-Penggunaan%20Metode%20Analisa_Bangun.PDF

Metode Penelitian

.
Sistem Informasi Geografis/SIG sudah cukup lama dikenal sejak awal tahun 1960 di Kanada dan Amerika Serikat, yang
saat itu banyak digunakan untuk keperluan Land Information System. Saat ini SIG sudah banyak digunakan untuk
keperluan lain seperti pengembangan wilayah, perpetaan, lingkungan dan sebagainya.
SIG mulai dimanfaatkan di Indonesia pada awal tahun 1980 terutama dalam pembuatan peta, pengelolaan wilayah,
analisis lingkungan dan agraria. Teknologi ini pada dasarnya memiliki ciri dapat memasukkan, menyimpan, mengolah
dan menyajikan data dalam suatu sistem komputer, dengan data dapat berupa gambar maupun tulisan atau angka.
SIG ini tidak akan berarti apabila lima komponen (perangkat keras, perangkat lunak, data, pelaksana dan prosedur)
pembentuk sistem ini tidak terpenuhi, dengan demikian komponen-komponen tersebut satu sama lain harus benar-benar
dapat terpenuhi kriterianya.
Khusus untuk komponen data, data tersebut harus benar-benar sesuai dengan ketentuan yang berarti harus teliti, lengkap,
aktual dan benar. Data seperti yang dimaksud di atas atau data dengan validitas yang bagus dapat diperoleh melalui
prosedur atau metode pengambilan dan pengolahan data yang benar sesuai dengan kreteria yang telah ditetapkan.
Ada 2 metode untuk menganalisis data lapangan yang dapat digunakan yakni cara analitik dengan menggunakan metode
statistik dan cara grafik dengan menggunakan metode penginderaan jauh [2].
Pemrosesan citra yang dilakukan sebagai berikut [3]:
1. Perbaikan kontras.
Perbaikan dilakukan terhadap masing-masing band (XS-1, XS-2, XS-3). Perbaikan kontras dilakukan dengan
metode linier dan eksponensial. Perbaikan kontras (contrast stretching) tidak berpengaruh terhadap nilai asli dari
citra.
2. Penyusunan komposit Red-Green-Blue.
Komposit yang disusun dari band-band (XS-1, XS-2, XS-3) dengan tampilan visual kekontrasan terbaik.
Kekontrasan komposit RGB diperbaiki secara keseluruhan dengan mengubah kekontrasan masing-masing band
tunggal penyusunnya.
3. Klasifikasi
Klasifikasi dilakukan dengan menggunakan dua band (XS-2 dan XS-3) dengan metode histogram bidimensional.
4. Koreksi geometrik.
Koreksi geometrik dilakukan dengan menggunakan metode overlay (tumpang susun) antara hasil citra terklasifikasi
dengan peta topografi.
Analisis statistik dilakukan dengan tujuan mencari hubungan antara species (jenis) mangrove berdasarkan karakter
vegetasi, ciri-ciri fisika dan kimia ekosistem (yang diwakili oleh temperatur, pH, kandungan Cl, suspended solid SS,
BOD, COD dan salinitas) baik saat pasang maupun surut, sedangkan untuk tanah digunakan parameter granulometri,
salinitas dan NaCl.
Analisis statistik dibagi menjadi dua bagian, pertama melalui prosedur untuk mengeliminasi autokorelasi antar variabel
dengan menggunakan Analisis Komponen Utama (Principal Component Analysis/ PCA). Analisis PCA akan
mentransformasikan variabel-variabel ke suatu set variabel baru yang dapat menjelaskan keragaman data dengan jumlah
yang lebih sedikit. Bagian kedua berupa analisis statistik untuk penyusunan model.

Daftar Acuan :
[1] Hartono, B. Muljo Sukojo, Monitoring Mangrove Disappearance by Remote Sensing: A Case Study in Surabaya,
East Java Indonesia. The Indonesia Journal of Geography, 1991.
[2] T.M. Lillesand, R.W. Kiefer, Penginderaan Jarak Jauh dan Interpretasi Citra. Gajahmada University Press,
Yogyakarta, 1990.
[3] W. R. Dillon, M. Goldsten. Multivariate Analysis, Methods and Applications. John Wiley and Sons. Inc, New
York, 1984.
[4] Bangun Muljo Sukojo, Analyse Ecologique Des Mangroves de Java (Indonesie) et Cartograhie Par Teledetection
Satellitaire, These Universite Toulouse 3, 1991.

Resum dari hasil diatas adalah :
.

Untuk menganalisis ekosistem daerah tersebut terdapat dua faktor yang sangat dominan yaitu tanah dan air, selain itu terdapat pula beberapa faktor lain yang bersifat fisik seperti pengaruh pasang-surut, angin dan sebagainya dan faktor lain yang bersifat non fisik seperti sosial, ekonomi, budaya dan lainnya.

Parameter yang digunakan untuk analisis air antara lain temperatur (T), pH, kandungan Cl, suspended solid (SS) dan salinitas (Sal.), baik saat kondisi surut maupun pasang, sedangkan untuk tanah digunakan parameter granulometri, salinitas dan NaCl.

Fungsi SIG
Berdasarkan desain awalnya fungsi utAma SIG adalah untuk
melakukan analisis data spasial. Dilihat dari sudut pemrosesan data
geografik, SIG bukanlah penemuan baru. Pemrosesan data geografik
sudah lama dilakukan oleh berbagai macam bidang ilmu, yang
membedakannya dengan pemrosesan lama hanyalah digunakannya
data dijital.

Rabu, 14 Desember 2011

NCBI

Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition

Andreas D. Baxevanis, B.F. Francis Ouellette

ISBNs: 0-471-38390-2 (Hardback); 0-471-38391-0 (Paper); 0-471-22392-1 (Electronic)

THE NCBI DATA MODEL

James M. Ostell

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

Bethesda, Maryland

Sarah J. Wheelan

Department of Molecular Biology and Genetics

The Johns Hopkins School of Medicine

Baltimore, Maryland

Jonathan A. Kans

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

Bethesda, Maryland

INTRODUCTION

Why Use a Data Model?

Most biologists are familiar with the use of animal models to study human diseases.

Although a disease that occurs in humans may not be found in exactly the same

form in animals, often an animal disease shares enough attributes with a human

counterpart to allow data gathered on the animal disease to be used to make inferences

about the process in humans. Mathematical models describing the forces involved

in musculoskeletal motions can be built by imagining that muscles are combinations

of springs and hydraulic pistons and bones are lever arms, and, often times,

20 THE NCBI DATA MODEL

such models allow meaningful predictions to be made and tested about the obviously

much more complex biological system under consideration. The more closely and

elegantly a model follows a real phenomenon, the more useful it is in predicting or

understanding the natural phenomenon it is intended to mimic.

In this same vein, some 12 years ago, the National Center for Biotechnology

Information (NCBI) introduced a new model for sequence-related information. This

new and more powerful model made possible the rapid development of software and

the integration of databases that underlie the popular Entrez retrieval system and on

which the GenBank database is now built (cf. Chapter 7 for more information on

Entrez). The advantages of the model (e.g., the ability to move effortlessly from the

published literature to DNA sequences to the proteins they encode, to chromosome

maps of the genes, and to the three-dimensional structures of the proteins) have been

apparent for years to biologists using Entrez, but very few biologists understand the

foundation on which this model is built. As genome information becomes richer and

more complex, more of the real, underlying data model is appearing in common

representations such as GenBank files. Without going into great detail, this chapter

attempts to present a practical guide to the principles of the NCBI data model and

its importance to biologists at the bench.

Some Examples of the Model

The GenBank flatfile is a ‘‘DNA-centered’’ report, meaning that a region of DNA

coding for a protein is represented by a ‘‘CDS feature,’’ or ‘‘coding region,’’ on the

DNA. A qualifier (/translation=“MLLYY”) describes a sequence of amino

acids produced by translating the CDS. A limited set of additional features of the

DNA, such as mat peptide, are occasionally used in GenBank flatfiles to describe

cleavage products of the (possibly unnamed) protein that is described by a

/translation, but clearly this is not a satisfactory solution. Conversely, most

protein sequence databases present a ‘‘protein-centered’’ view in which the connection

to the encoding gene may be completely lost or may be only indirectly referenced

by an accession number. Often times, these connections do not provide the

exact codon-to-amino acid correspondences that are important in performing mutation

analysis.

The NCBI data model deals directly with the two sequences involved: a DNA

sequence and a protein sequence. The translation process is represented as a link

between the two sequences rather than an annotation on one with respect to the

other. Protein-related annotations, such as peptide cleavage products, are represented

as features annotated directly on the protein sequence. In this way, it becomes very

natural to analyze the protein sequences derived from translations of CDS features

by BLAST or any other sequence search tool without losing the precise linkage back

to the gene. A collection of a DNA sequence and its translation products is called a

Nuc-prot set, and this is how such data is represented by NCBI. The GenBank flatfile

format that many readers are already accustomed to is simply a particular style of

report, one that is more ‘‘human-readable’’ and that ultimately flattens the connected

collection of sequences back into the familiar one-sequence, DNA-centered view.

The navigation provided by tools such as Entrez much more directly reflects the

underlying structure of such data. The protein sequences derived from GenBank

translations that are returned by BLAST searches are, in fact, the protein sequences

from the Nuc-prot sets described above.

INTRODUCTION 21

The standard GenBank format can also hide the multiple-sequence nature of

some DNA sequences. For example, three genomic exons of a particular gene are

sequenced, and partial flanking, noncoding regions around the exons may also be

available, but the full-length sequences of these intronic sequences may not yet be

available. Because the exons are not in their complete genomic context, there would

be three GenBank flatfiles in this case, one for each exon. There is no explicit

representation of the complete set of sequences over that genomic region; these three

exons come in genomic order and are separated by a certain length of unsequenced

DNA. In GenBank format there would be a Segment line of the form SEGMENT 1

of 3 in the first record, SEGMENT 2 of 3 in the second, and SEGMENT 3 of 3 in

the third, but this only tells the user that the lines are part of some undefined, ordered

series (Fig. 2.1A). Out of the whole GenBank release, one locates the correct Segment

records to place together by an algorithm involving the LOCUS name. All segments

that go together use the same first combination of letters, ending with the numbers

appropriate to the segment, e.g., HSDDT1, HSDDT2, and HSDDT3. Obviously, this

complicated arrangement can result in problems when LOCUS names include numbers

that inadvertently interfere with such series. In addition, there is no one sequence

record that describes the whole assembled series, and there is no way to describe

the distance between the individual pieces. There is no segmenting convention in

the EMBL sequence database at all, so records derived from that source or distributed

in that format lack even this imperfect information.

The NCBI data model defines a sequence type that directly represents such a

segmented series, called a ‘‘segmented sequence.’’ Rather than containing the letters

A, G, C, and T, the segmented sequence contains instructions on how it can be built

from other sequences. Considering again the example above, the segmented sequence

would contain the instructions ‘‘take all of HSDDT1, then a gap of unknown length,

then all of HSDDT2, then a gap of unknown length, then all of HSDDT3.’’ The

segmented sequence itself can have a name (e.g., HSDDT), an accession number,

features, citations, and comments, like any other GenBank record. Data of this type

are commonly stored in a so-called ‘‘Seg-set’’ containing the sequences HSDDT,

HSDDT1, HSDDT2, HSDDT3 and all of their connections and features. When the

GenBank release is made, as in the case of Nuc-prot sets, the Seg-sets are broken

up into multiple records, and the segmented sequence itself is not visible. However,

GenBank, EMBL, and DDBJ have recently agreed on a way to represent these

constructed assemblies, and they will be placed in a new CON division, with CON

standing for ‘‘contig’’ (Fig. 2.1B). In the Entrez graphical view of segmented sequences,

the segmented sequence is shown as a line connecting all of its component

sequences (Fig. 2.1C).

An NCBI segmented sequence does not require that there be gaps between the

individual pieces. In fact the pieces can overlap, unlike the case of a segmented

series in GenBank format. This makes the segmented sequence ideal for representing

large sequences such as bacterial genomes, which may be many megabases in length.

This is what currently is done within the Entrez Genomes division for bacterial

genomes, as well as other complete chromosomes such as yeast. The NCBI Software

Toolkit (Ostell, 1996) contains functions that can gather the data that a segmented

sequence refers to ‘‘on the fly,’’ including constituent sequence and features, and this

information can automatically be remapped from the coordinates of a small, individual

record to that of a complete chromosome. This makes it possible to provide

graphical views, GenBank flatfile views, or FASTA views or to perform analyses on

22 THE NCBI DATA MODEL

INTRODUCTION 23

Figure 2.1. (A) Selected parts of GenBank-formatted records in a segmented sequence.

GenBank format historically indicates merely that records are part of some ordered series;

it offers no information on what the other components are or how they are connected.

To see the complete view of these records, see http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/

query?uid=6849043&form=6&db=n&Dopt=g. (B) Representation of segmented sequences

in the new CON (contig) division. A new extension of GenBank format allows the

details of the construction of segmented records to be presented. The CONTIG line can

include individual accessions, gaps of known length, and gaps of unknown length. The

individual components can still be displayed in the traditional form, although no features

or sequences are present in this format. (C) Graphical representation of a segmented sequence.

This view displays features mapped to the coordinates of the segmented sequence.

The segments include all exonic and untranslated regions plus 20 base pairs of sequence

at the ends of each intron. The segment gaps cover the remaining intronic sequence.

whole chromosomes quite easily, even though data exist only in small, individual

pieces. This ability to readily assemble a set of related sequences on demand for any

region of a very large chromosome has already proven to be valuable for bacterial

genomes. Assembly on demand will become more and more important as larger and

larger regions are sequenced, perhaps by many different groups, and the notion that

an investigator will be working on one huge sequence record becomes completely

impractical.

What Does ASN.1 Have to Do With It?

The NCBI data model is often referred to as, and confused with, the ‘‘NCBI ASN.1’’

or ‘‘ASN.1 Data Model.’’ Abstract Syntax Notation 1 (ASN.1) is an International

Standards Organization (ISO) standard for describing structured data that reliably

encodes data in a way that permits computers and software systems of all types to

reliably exchange both the structure and the content of the entries. Saying that a data

model is written in ASN.1 is like saying a computer program is written in C or

FORTRAN. The statement identifies the language; it does not say what the program

does. The familiar GenBank flatfile was really designed for humans to read, from a

DNA-centered viewpoint. ASN.1 is designed for a computer to read and is amenable

to describing complicated data relationships in a very specific way. NCBI describes

and processes data using the ASN.1 format. Based on that single, common format,

a number of human-readable formats and tools are produced, such as Entrez,

GenBank, and the BLAST databases. Without the existence of a common format

such as this, the neighboring and hard-link relationships that Entrez depends on

would not be possible. This chapter deals with the structure and content of the NCBI

data model and its implications for biomedical databases and tools. Detailed discussions

about the choice of ASN.1 for this task and its overall form can be found

elsewhere (Ostell, 1995).

What to Define?

We have alluded to how the NCBI data model defines sequences in a way that

supports a richer and more explicit description of the experimental data than can be

24 THE NCBI DATA MODEL

obtained with the GenBank format. The details of the model are important, and will

be expanded on in the ensuing discussion. At this point, we need to pause and briefly

describe the reasoning and general principles behind the model as a whole.

There are two main reasons for putting data on a computer: retrieval and discovery.

Retrieval is basically being able to get back out what was put in. Amassing

sequence information without providing a way to retrieve it makes the sequence

information, in essence, useless. Although this is important, it is even more valuable

to be able to get back from the system more knowledge than was put in to begin

with—that is, to be able to use the information to make biological discoveries.

Scientists can make these kinds of discoveries by discerning connections between

two pieces of information that were not known when the pieces were entered separately

into the database or by performing computations on the data that offer new

insight into the records. In the NCBI data model, the emphasis is on facilitating

discovery; that means the data must be defined in a way that is amenable to both

linkage and computation.

A second, general consideration for the model is stability. NCBI is a US Government

agency, not a group supported year-to-year by competitive grants. Thus, the

NCBI staff takes a very long-term view of its role in supporting bioinformatics

efforts. NCBI provides large-scale information systems that will support scientific

inquiry well into the future. As anyone who is involved in biomedical research

knows, many major conceptual and technical revolutions can happen when dealing

with such a long time span. Somehow, NCBI must address these changing views

and needs with software and data that may have been created years (or decades)

earlier. For that reason, basic observations have been chosen as the central data

elements, with interpretations and nomenclature (elements more subject to change)

being placed outside the basic, core representation of the data.

Taking all factors into account, NCBI uses four core data elements: bibliographic

citations, DNA sequences, protein sequences, and three-dimensional structures. In

addition, two projects (taxonomy and genome maps) are more interpretive but nonetheless

are so important as organizing and linking resources that NCBI has built a

considerable base in these areas as well.

PUBs: PUBLICATIONS OR PERISH

Publication is at the core of every scientific endeavor. It is the common process

whereby scientific information is reviewed, evaluated, distributed, and entered into

the permanent record of scientific progress. Publications serve as vital links between

factual databases of different structures or content domains (e.g., a record in a sequence

database and a record in a genetic database may cite the same article). They

serve as valuable entry points into factual databases (‘‘I have read an article about

this, now I want to see the primary data’’).

Publications also act as essential annotation of function and context to records

in factual databases. One reason for this is that factual databases have a structure

that is essential for efficient use of the database but may not have the representational

capacity to set forward the full biological, experimental, or historical context of a

particular record. In contrast, the published paper is limited only by language and

contains much fuller and more detailed explanatory information than will ever be in

a record in a factual database. Perhaps more importantly, authors are evaluated by

PUBs: PUBLICATIONS OR PERISH 25

their scientific peers based on the content of their published papers, not by the content

of the associated database records. Despite the best of intentions, scientists move on

and database records become static, even though the knowledge about them has

expanded, and there is very little incentive for busy scientists to learn a database

system and keep records based on their own laboratory studies up to date.

Generally, the form and content of citations have not been thought about carefully

by those designing factual databases, and the quality, form, and content of

citations can vary widely from one database to the next. Awareness of the importance

of having a link to the published literature and the realization that bibliographic

citations are much less volatile than scientific knowledge led to a decision that a

careful and complete job of defining citations was a worthwhile endeavor. Some

components of the publication specification described below may be of particular

interest to scientists or users of the NCBI databases, but a full discussion of all the

issues leading to the decisions governing the specifications themselves would require

another chapter in itself.

Authors

Author names are represented in many formats by various databases: last name only,

last name and initials, last name-comma-initials, last name and first name, all authors

with initials and the last with a full first name, with or without honorifics (Ph.D.)

or suffixes (Jr., III), to name only a few. Some bibliographic databases (such as

MEDLINE) might represent only a fixed number of authors. Although this inconsistency

is merely ugly to a human reader, it poses severe problems for database systems

incorporating names from many sources and providing functions as simple as looking

up citations by author last name, such as Entrez does. For this reason, the specification

provides two alternative forms of author name representation: one a simple

string and the other a structured form with fields for last name, first name, and so

on. When data are submitted directly to NCBI or in cases when there is a consistent

format of author names from a particular source (such as MEDLINE), the structured

form is used. When the form cannot be deciphered, the author name remains as a

string. This limits its use for retrieval but at least allows data to be viewed when the

record is retrieved by other means.

Even the structured form of author names must support diversity, since some

sources give only initials whereas others provide a first and middle name. This is

mentioned to specifically emphasize two points. First, the NCBI data model is designed

both to direct our view of the data into a more useful form and to accommodate

the available existing data. (This pair of functions can be confusing to people

reading the specification and seeing alternative forms of the same data defined.)

Second, software developers must be aware of this range of representations and

accommodate whatever form had to be used when a particular source was being

converted. In general, NCBI tries to get as much of the data into a uniform, structured

form as possible but carries the rest in a less optimal way rather than losing it

altogether.

Author affiliations (i.e., authors’ institutional addresses) are even more complicated.

As with author names, there is the problem of supporting both structured forms

and unparsed strings. However, even sources with reasonably consistent author name

conventions often produce affiliation information that cannot be parsed from text into

a structured format. In addition, there may be an affiliation associated with the whole

26 THE NCBI DATA MODEL

author list, or there may be different affiliations associated with each author. The

NCBI data model allows for both scenarios. At the time of this writing only the first

form is supported in either MEDLINE or GenBank, both types may appear in published

articles.

Articles

The most commonly cited bibliographic entity in biological science is an article in

a journal; therefore, the citation formats of most biological databases are defined

with that type in mind. However, ‘‘articles’’ can also appear in books, manuscripts,

theses, and now in electronic journals as well. The data model defines the fields

necessary to cite a book, a journal, or a manuscript. An article citation occupies one

field; other fields display additional information necessary to uniquely identify the

article in the book, journal, or manuscript—the author(s) of the article (as opposed

to the author or editor of the book), the title of the article, page numbers, and so on.

There is an important distinction between the fields necessary to uniquely identify

a published article from a citation and those necessary to describe the same

article meaningfully to a database user. The NCBI Citation Matching Service takes

fields from a citation and attempts to locate the article to which they refer. In this

process, a successful match would involve only correctly matching the journal title,

the year, the first page of the article, and the last name of an author of the article.

Other information (e.g., article title, volume, issue, full pages, author list) is useful

to look at but very often is either not available or outright incorrect. Once again, the

data model must allow the minimum information set to come in as a citation, be

matched against MEDLINE, and then be replaced by a citation having the full set

of desired fields obtained from MEDLINE to produce accurate, useful data for consumption

by the scientific public.

Patents

With the advent of patented sequences it became necessary to cite a patent as a

bibliographic entity instead of an article. The data model supports a very complete

patent citation, a format developed in cooperation with the US Patent Office. In

practice, however, patented sequences tend to have limited value to the scientific

public. Because a patent is a legal document, not a scientific one, its purpose is to

present and support the claims of the patent, not to fully describe the biology of the

sequence itself. It is often prepared in a lawyer’s office, not by the scientist who did

the research. The sequences presented in the patent may function only to illustrate

some discreet aspect of the patent, rather than being the focus of the document.

Organism information, location of biological features, and so on may not appear at

all if they are not germane to the patent. Thus far, the vast majority of sequences

appearing in patents also appear in a more useful form (to scientists) in the public

databases.

In NCBI’s view, the main purpose of listing patented sequences in GenBank is

to be able to retrieve sequences by similarity searches that may serve to locate patents

related to a given sequence. To make a legal determination in the case, however, one

would still have to examine the full text of the patent. To evaluate the biology of

the sequence, one generally must locate information other than that contained in the

patent. Thus, the critical linkage is between the sequence and its patent number.

PUBs: PUBLICATIONS OR PERISH 27

Additional fields in the patent citation itself may be of some interest, such as the

title of the patent and the names of the inventors.

Citing Electronic Data Submission

A relatively new class of citations comprises the act of data submission to a database,

such as GenBank. This is an act of publication, similar but not identical to the

publication of an article in a journal. In some cases, data submission precedes article

publication by a considerable period of time, or a publication regarding a particular

sequence may never appear in press. Because of this, there is a separate citation

designed for deposited sequence data. The submission citation, because it is indeed

an act of publication, may have an author list, showing the names of scientists who

worked on the record. This may or may not be the same as the author list on a

subsequently published paper also cited in the same record. In most cases, the scientist

who submitted the data to the database is also an author on the submission

citation. (In the case of large sequencing centers, this may not always be the case.)

Finally, NCBI has begun the practice of citing the update of a record with a submission

citation as well. A comment can be included with the update, briefly describing

the changes made in the record. All the submission citations can be retained

in the record, providing a history of the record over time.

MEDLINE and PubMed Identifiers

Once an article citation has been matched to MEDLINE, the simplest and most

reliable key to point to the article is the MEDLINE unique identifier (MUID). This

is simply an integer number. NCBI provides many services that use MUID to retrieve

the citation and abstract from MEDLINE, to link together data citing the same article,

or to provide Web hyperlinks.

Recently, in concert with MEDLINE and a large number of publishers, NCBI

has introduced PubMed. PubMed contains all of MEDLINE, as well as citations

provided directly by the publishers. As such, PubMed contains more recent articles

than MEDLINE, as well as articles that may never appear in MEDLINE because of

their subject matter. This development led NCBI to introduce a new article identifier,

called a PubMed identifier (PMID). Articles appearing in MEDLINE will have both

a PMID and an MUID. Articles appearing only in PubMed will have only a PMID.

PMID serves the same purpose as MUID in providing a simple, reliable link to the

citation, a means of linking records together, and a means of setting up hyperlinks.

Publishers have also started to send information on ahead-of-print articles to

PubMed, so this information may now appear before the printed journal. A new

project, PubMed Central, is meant to allow electronic publication to occur in lieu

of or ahead of publication in a traditional, printed journal. PubMed Central records

contain the full text of the article, not just the abstract, and include all figures and

references.

The NCBI data model stores most citations as a collection called a Pub-equiv,

a set of equivalent citations that includes a reliable identifier (PMID or MUID) and

the citation itself. The presence of the citation form allows a useful display without

an extra retrieval from the database, whereas the identifier provides a reliable key

for linking or indexing the same citation in the record.

28 THE NCBI DATA MODEL

SEQ-IDs: WHAT’S IN A NAME?

The NCBI data model defines a whole class of objects called Sequence Identifiers

(Seq-id). There has to be a whole class of such objects because NCBI integrates

sequence data from many sources that name sequence records in different ways and

where, of course, the individual names have different meanings. In one simple case,

PIR, SWISS-PROT, and the nucleotide sequence databases all use a string called an

‘‘accession number,’’ all having a similar format. Just saying ‘‘A10234’’ is not

enough to uniquely identify a sequence record from the collection of all these databases.

One must distinguish ‘‘A10234’’ in SWISS-PROT from ‘‘A10234’’ in PIR.

(The DDBJ/EMBL/GenBank nucleotide databases share a common set of accession

numbers; therefore, ‘‘A12345’’ in EMBL is the same as ‘‘A12345’’ in GenBank or

DDBJ.) To further complicate matters, although the sequence databases define their

records as containing a single sequence, PDB records contain a single structure,

which may contain more than one sequence. Because of this, a PDB Seq-id contains

a molecule name and a chain ID to identify a single unique sequence. The subsections

that follow describe the form and use of a few commonly used types of Seq-ids.

Locus Name

The locus appears on the LOCUS line in GenBank and DDBJ records and in the ID

line in EMBL records. These originally were the only identifier of a discrete

GenBank record. Like a genetic locus name, it was intended to act both as a unique

identifier for the record and as a mnemonic for the function and source organism of

the sequence. Because the LOCUS line is in a fixed format, the locus name is restricted

to ten or fewer numbers and uppercase letters. For many years in GenBank,

the first three letters of the name were an organism code and the remaining letters

a code for the gene (e.g., HUMHBB was used for ‘‘human _-globin region’’). However,

as with genetic locus names, locus names were changed when the function of

a region was discovered to be different from what was originally thought. This

instability in locus names is obviously a problem for an identifier for retrieval. In

addition, as the number of sequences and organisms represented in GenBank increased

geometrically over the years, it became impossible to invent and update such

mnemonic names in an efficient and timely manner. At this point, the locus name is

dying out as a useful name in GenBank, although it continues to appear prominently

on the first line of the flatfile to avoid breaking the established format.

Accession Number

Because of the difficulties in using the locus/ID name as the unique identifier for a

nucleotide sequence record, the International Nucleotide Sequence Database Collaborators

(DDBJ/EMBL/GenBank) introduced the accession number. It intentionally

carries no biological meaning, to ensure that it will remain (relatively) stable. It

originally consisted of one uppercase letter followed by five digits. New accessions

consist of two uppercase letters followed by six digits. The first letters were allocated

to the individual collaborating databases so that accession numbers would be unique

across the Collaboration (e.g., an entry beginning with a ‘‘U’’ was from GenBank).

The accession number was an improvement over the locus/ID name, but, with

use, problems and deficiencies became apparent. For example, although the accession

SEQ-IDs: WHAT’S IN A NAME? 29

is stable over time, many users noticed that the sequence retrieved by a particular

accession was not always the same. This is because the accession identifies the whole

database record. If the sequence in a record was updated (say by the insertion of

1000 bp at the beginning), the accession number did not change, as it was an updated

version of the same record. If one had analyzed the original sequence and recorded

that at position 100 of accession U00001 there was a putative protein-binding site,

after the update a completely different sequence would be found at position 100!

The accession number appears on the ACCESSION line of the GenBank record.

The first accession on the line, called the ‘‘primary’’ accession, is the key for retrieving

this record. Most records have only this type of accession number. However,

other accessions may follow the primary accession on the ACCESSION line. These

‘‘secondary’’ accessions are intended to give some notion of the history of the record.

For example, if U00001 and U00002 were merged into a single updated record, then

U00001 would be the primary accession on the new record and U00002 would

appear as a secondary accession. In standard practice, the U00002 record would be

removed from GenBank, since the older record had become obsolete, and the secondary

accessions would allow users to retrieve whatever records superseded the old

one. It should also be noted that, historically, secondary accession numbers do not

always mean the same thing; therefore, users should exercise care in their interpretations.

(Policies at individual databases differed, and even shifted over time in a

given database.) The use of secondary accession numbers also caused problems in

that there was still not enough information to determine exactly what happened and

why. Nonetheless, the accession number remains the most controlled and reliable

way to point to a record in DDBJ/EMBL/GenBank.

gi Number

In 1992, NCBI began assigning GenInfo Identifiers (gi) to all sequences processed

into Entrez, including nucleotide sequences from DDBJ/EMBL/GenBank, the protein

sequences from the translated CDS features, protein sequences from SWISS-PROT,

PIR, PRF, PDB, patents, and others. The gi is assigned in addition to the accession

number provided by the source database. Although the form and meaning of the

accession Seq-id varied depending on the source, the meaning and form of the gi is

the same for all sequences regardless of the source.

The gi is simply an integer number, sometimes referred to as a GI number. It is

an identifier for a particular sequence only. Suppose a sequence enters GenBank

and is given an accession number U00001. When the sequence is processed internally

at NCBI, it enters a database called ID. ID determines that it has not seen U00001

before and assigns it a gi number—for example, 54. Later, the submitter might

update the record by changing the citation, so U00001 enters ID again. ID, recognizing

the record, retrieves the first U00001 and compares its sequence with the new

one. If the two are completely identical, ID reassigns gi 54 to the record. If the

sequence differs in any way, even by a single base pair, it is given a new gi number,

say 88. However, the new sequence retains accession number U00001 because of

the semantics of the source database. At this time, ID marks the old record (gi 54)

with the date it was replaced and adds a ‘‘history’’ indicating that it was replaced

by gi 88. ID also adds a history to gi 88 indicating that it replaced gi 54.

The gi number serves three major purposes:

30 THE NCBI DATA MODEL

• It provides a single identifier across sequences from many sources.

• It provides an identifier that specifies an exact sequence. Anyone who analyzes

gi 54 and stores the analysis can be sure that it will be valid as long as U00001

has gi 54 attached to it.

• It is stable and retrievable. NCBI keeps the last version of every gi number.

Because the history is included in the record, anyone who discovers that gi

54 is no longer part of the GenBank release can still retrieve it from ID through

NCBI and examine the history to see that it was replaced by gi 88. Upon

aligning gi 54 to gi 88 to determine their relationship, a researcher may decide

to remap the former analysis to gi 88 or perhaps to reanalyze the data. This

can be done at any time, not just at GenBank release time, because gi 54 will

always be available from ID.

For these reasons, all internal processing of sequences at NCBI, from computing

Entrez sequence neighbors to determining when new sequence should be processed

or producing the BLAST databases, is based on gi numbers.

Accession.Version Combined Identifier

Recently, the members of the International Nucleotide Sequence Database Collaboration

(GenBank, EMBL, and DDBJ) introduced a ‘‘better’’ sequence identifier, one

that combines an accession (which identifies a particular sequence record) with a

version number (which tracks changes to the sequence itself). It is expected that this

kind of Seq-id will become the preferred method of citing sequences.

Users will still be able to retrieve a record based on the accession number alone,

without having to specify a particular version. In that case, the latest version of the

record will be obtained by default, which is the current behavior for queries using

Entrez and other retrieval programs.

Scientists who are analyzing sequences in the database (e.g., aligning all alcohol

dehydrogenase sequences from a particular taxonomic group) and wish to have their

conclusions remain valid over time will want to reference sequences by accession

and the given version number. Subsequent modification of one of the sequences by

its owner (e.g., 5_ extension during a study of the gene’s regulation) will result in

the version number being incremented appropriately. The analysis that cited accession

and version remains valid because a query using both the accession and version

will return the desired record.

Combining accession and version makes it clear to the casual user that a sequence

has changed since an analysis was done. Also, determining how many times

a sequence has changed becomes trivial with a version number. The accession.version

number appears on the VERSION line of the GenBank flatfile. For sequence retrieval,

the accession.version is simply mapped to the appropriate gi number, which remains

the underlying tracking identifier at NCBI.

Accession Numbers on Protein Sequences

The International Sequence Database Collaborators also started assigning accession.

version numbers to protein sequences within the records. Previously, it was

difficult to reliably cite the translated product of a given coding region feature, except

BIOSEQs: SEQUENCES 31

by its gi number. This limited the usefulness of translated products found in BLAST

results, for example. These sequences will now have the same status as protein

sequences submitted directly to the protein databases, and they have the benefit of

direct linkage to the nucleotide sequence in which they are encoded, showing up as

a CDS feature’s /protein id qualifier in the flatfile view. Protein accessions in

these records consist of three uppercase letters followed by five digits and an integer

indicating the version.

Reference Seq-id

The NCBI RefSeq project provides a curated, nonredundant set of reference sequence

standards for naturally occurring biological molecules, ranging from chromosomes

to transcripts to proteins. RefSeq identifiers are in accession.version form but are

prefixed with NC (chromosomes), NM (mRNAs), NP (proteins), or NT (constructed

genomic contigs). The NG prefix will be used for genomic regions or gene

clusters (e.g., immunoglobulin region) in the future. RefSeq records are a stable

reference point for functional annotation, point mutation analysis, gene expression

studies, and polymorphism discovery.

General Seq-id

The General Seq-id is meant to be used by genome centers and other groups as a

way of identifying their sequences. Some of these sequences may never appear in

public databases, and others may be preliminary data that eventually will be submitted.

For example, records of human chromosomes in the Entrez Genomes division

contain multiple physical and genetic maps, in addition to sequence components.

The physical maps are generated by various groups, and they use General Seq-ids

to identify the proper group.

Local Seq-id

The Local sequence identifier is most prominently used in the data submission tool

Sequin (see Chapter 4). Each sequence will eventually get an accession.

version identifier and a gi number, but only when the completed submission has

been processed by one of the public databases. During the submission process, Sequin

assigns a local identifier to each sequence. Because many of the software tools

made by NCBI require a sequence identifier, having a local Seq-id allows the use

of these tools without having to first submit data to a public database.

BIOSEQs: SEQUENCES

The Bioseq, or biological sequence, is a central element in the NCBI data model. It

comprises a single, continuous molecule of either nucleic acid or protein, thereby

defining a linear, integer coordinate system for the sequence. A Bioseq must have at

least one sequence identifier (Seq-id). It has information on the physical type of

molecule (DNA, RNA, or protein). It may also have annotations, such as biological

features referring to specific locations on specific Bioseqs, as well as descriptors.

32 THE NCBI DATA MODEL

Figure 2.2. Classes of Bioseqs. All Bioseqs represent a single, continuous molecule of nucleic

acid or protein, although the complete sequence may not be known. In a virtual

Bioseq, the type of molecule is known, but the sequence is not known, and the precise

length may not be known (e.g., from the size of a band on an electrophoresis gel). A raw

Bioseq contains a single contiguous string of bases or residues. A segmented Bioseq points

to its components, which are other raw or virtual Bioseqs (e.g., sequenced exons and undetermined

introns). A constructed sequence takes its original components and subsumes

them, resulting in a Bioseq that contains the string of bases or residues and a ‘‘history’’ of

how it was built. A map Bioseq places genes or physical markers, rather than sequence, on

its coordinates. A delta Bioseq can represent a segmented sequence but without the requirement

of assigning identifiers to each component (including gaps of known length),

although separate raw sequences can still be referenced as components. The delta sequence

is used for unfinished high-throughput genome sequences (HTGS) from genome centers

and for genomic contigs.

Descriptors provide additional information, such as the organism from which the

molecule was obtained. Information in the descriptors describe the entire Bioseq.

However, the Bioseq isn’t necessarily a fully sequenced molecule. It may be a

segmented sequence in which, for example, the exons have been sequenced but not

all of the intronic sequences have been determined. It could also be a genetic or

physical map, where only a few landmarks have been positioned.

Sequences are the Same

All Bioseqs have an integer coordinate system, with an integer length value, even if

the actual sequence has not been completely determined. Thus, for physical maps,

or for exons in highly spliced genes, the spacing between markers or exons may be

known only from a band on a gel. Although the coordinates of a fully sequenced

chromosome are known exactly, those in a genetic or physical map are a best guess,

with the possibility of significant error from the ‘‘real’’ coordinates.

Nevertheless, any Bioseq can be annotated with the same kinds of information.

For example, a gene feature can be placed on a region of sequenced DNA or at a

discrete location on a physical map. The map and the sequence can then be aligned

on the basis of their common gene features. This greatly simplifies the task of writing

software that can display these seemingly disparate kinds of data.

Sequences are Different

Despite the benefits derived from having a common coordinate system, the different

Bioseq classes do differ in the way they are represented. The most common classes

(Fig. 2.2) are described briefly below.

Virtual Bioseq. In the virtual Bioseq, the molecule type is known, and its

length and topology (e.g., linear, circular) may also be known, but the actual sequence

is not known. A virtual Bioseq can represent an intron in a genomic molecule

in which only the exon sequences have been determined. The length of the putative

sequence may be known only by the size of a band on an agarose gel.

BIOSEQs: SEQUENCES 33

34 THE NCBI DATA MODEL

Raw Bioseq. This is what most people would think of as a sequence, a single

contiguous string of bases or residues, in which the actual sequence is known. The

length is obviously known in this case, matching the number of bases or residues in

the sequence.

Segmented Bioseq. A segmented Bioseq does not contain raw sequences but

instead contains the identifiers of other Bioseqs from which it is made. This type of

Bioseq can be used to represent a genomic sequence in which only the exons are

known. The ‘‘parts’’ in the segmented Bioseq would be the individual, raw Bioseqs

representing the exons and the virtual Bioseqs representing the introns.

Delta Bioseq. Delta Bioseqs are used to represent the unfinished high-throughput

genome sequences (HTGS) derived at the various genome sequencing centers.

Using delta Bioseqs instead of segmented Bioseqs means that only one Seq-id is

needed for the entire sequence, even though subregions of the Bioseq are not known

at the sequence level. Implicitly, then, even at the early stages of their presence in

the databases, delta Bioseqs maintain the same accession number.

Map Bioseq. Used to represent genetic and physical maps, a map Bioseq is

similar to a virtual Bioseq in that it has a molecule type, perhaps a topology, and a

length that may be a very rough estimate of the molecule’s actual length. This information

merely supplies the coordinate system, a property of every Bioseq. Given

this coordinate system for a genetic map, we estimate the positions of genes on it

based on genetic evidence. The table of the resulting gene features is the essential

data of the map Bioseq, just as bases or residues constitute the raw Bioseq’s data.

BIOSEQ-SETs: COLLECTIONS OF SEQUENCES

A biological sequence is often most appropriately stored in the context of other,

related sequences. For example, a nucleotide sequence and the sequences of the

protein products it encodes naturally belong in a set. The NCBI data model provides

the Bioseq-set for this purpose.

A Bioseq-set can have a list of descriptors. When packaged on a Bioseq, a

descriptor applies to all of that Bioseq. When packaged on a Bioseq-set, the descriptor

applies to every Bioseq in the set. This arrangement is convenient for attaching

publications and biological source information, which are expected on all sequences

but frequently are identical within sets of sequences. For example, both the DNA

and protein sequences are obviously from the same organism, so this descriptor

information can be applied to the set. The same logic may apply to a publication.

The most common Bioseq-sets are described in the sections that follow.

Nucleotide/Protein Sets

The Nuc-prot set, containing a nucleotide and one or more protein products, is the

type of set most frequently produced by a Sequin data submission. The component

Bioseqs are connected by coding sequence region (CDS) features that describe how

translation from nucleotide to protein sequence is to proceed. In a traditional nucleotide

or protein sequence database, these records might have cross-references to each

SEQ-ANNOT: ANNOTATING THE SEQUENCE 35

other to indicate this relationship. The Nuc-prot set makes this explicit by packaging

them together. It also allows descriptive information that applies to all sequences

(e.g., the organism or publication citation) to be entered once (see Seq-descr: Describing

the Sequence, below).

Population and Phylogenetic Studies

A major class of sequence submissions represent the results of population or phylogenetic

studies. Such research involves sequencing the same gene from a number

of individuals in the same species (population study) or in different species (phylogenetic

study). An alignment of the individual sequences may also be submitted (see

Seq-align: Alignments, below). If the gene encodes a protein, the components of the

Population or Phylogenetic Bioseq-set may themselves be Nuc-prot sets.

Other Bioseq-sets

A Seg set contains a segmented Bioseq and a Parts Bioseq-set, which in turn contains

the raw Bioseqs that are referenced by the segmented Bioseq. This may constitute

the nucleotide component of a Nuc-prot set.

An Equiv Bioseq-set is used in the Entrez Genomes division to hold multiple

equivalent Bioseqs. For example, human chromosomes have one or more genetic

maps, physical maps derived by different methods and a segmented Bioseq on which

‘‘islands’’ of sequenced regions are placed. An alignment between the various Bioseqs

is made based on references to any available common markers.

SEQ-ANNOT: ANNOTATING THE SEQUENCE

A Seq-annot is a self-contained package of sequence annotations or information that

refers to specific locations on specific Bioseqs. It may contain a feature table, a set

of sequence alignments, or a set of graphs of attributes along the sequence.

Multiple Seq-annots can be placed on a Bioseq or on a Bioseq-set. Each Seqannot

can have specific attribution. For example, PowerBLAST (Zhang and Madden,

1997) produces a Seq-annot containing sequence alignments, and each Seq-annot is

named based on the BLAST program used (e.g., BLASTN, BLASTX, etc.). The

individual blocks of alignments are visible in the Entrez and Sequin viewers.

Because the components of a Seq-annot have specific references to locations on

Bioseqs, the Seq-annot can stand alone or be exchanged with other scientists, and it

need not reside in a sequence record. The scope of descriptors, on the other hand,

does depend on where they are packaged. Thus, information about Bioseqs can be

created, exchanged, and compared independently of the Bioseq itself. This is an

important attribute of the Seq-annot and of the NCBI data model.

Seq-feat: Features

A sequence feature (Seq-feat) is a block of structured data explicitly attached to a

region of a Bioseq through one or two sequence locations (Seq-locs). The Seq-feat

itself can carry information common to all features. For example, there are flags to

indicate whether a feature is partial (i.e., goes beyond the end of the sequence of

36 THE NCBI DATA MODEL

the Bioseq), whether there is a biological exception (e.g., RNA editing that explains

why a codon on the genomic sequence does not translate to the expected amino

acid), and whether the feature was experimentally determined (e.g., an mRNA was

isolated from a proposed coding region).

A feature must always have a location. This is the Seq-loc that states where on

the sequence the feature resides. A coding region’s location usually starts at the ATG

and ends at the terminator codon. The location can have more than one interval if

it is on a genomic sequence and mRNA splicing occurs. In cases of alternative

splicing, separate coding region features are created, with one multi-interval Seq-loc

for each isolated molecular species.

Optionally, a feature may have a product. For a coding region, the product Seqloc

points to the resulting protein sequence. This is the link that allows the data

model to separately maintain the nucleotide and protein sequences, with annotation

on each sequence appropriate to that molecule. An mRNA feature on a genomic

sequence could have as its product an mRNA Bioseq whose sequence reflects the

results of posttranscriptional RNA editing. Features also have information unique to

the kind of feature. For example, the CDS feature has fields for the genetic code

and reading frame, whereas the tRNA feature has information on the amino acid

transferred.

This design completely modularizes the components required by each feature

type. If a particular feature type calls for a new field, no other field is affected. A

new feature type, even a very complex one, can be added without changing the

existing features. This means that software used to display feature locations on a

sequence need consider only the location field common to all features.

Although the DDBJ/EMBL/GenBank feature table allows numerous kinds of

features to be included (see Chapter 3), the NCBI data model treats some features

as ‘‘more equal’’ than others. Specifically, certain features directly model the central

dogma of molecular biology and are most likely to be used in making connections

between records and in discovering new information by computation. These features

are discussed next.

Genes. A gene is a feature in its own right. In the past, it was merely a qualifier

on other features. The Gene feature indicates the location of a gene, a heritable region

of nucleic acid sequence that confers a measurable phenotype. That phenotype may

be achieved by many components of the gene being studied, including, but not

limited to, coding regions, promoters, enhancers, and terminators. The Gene feature

is meant to approximately cover the region of nucleic acid considered by workers

in the field to be the gene. This admittedly fuzzy concept has an appealing simplicity,

and it fits in well with higher-level views of genes such as genetic maps. It has

practical utility in the era of large genomic sequencing when a biologist may wish

to see just the ‘‘xyz gene’’ and not a whole chromosome. The Gene feature may also

contain cross-references to genetic databases, where more detailed information on

the gene may be found.

RNAs. An RNA feature can describe both coding intermediates (e.g., mRNAs)

and structural RNAs (e.g., tRNAs, rRNAs). The locations of an mRNA and the

corresponding coding region (CDS) completely determine the locations of 5_ and 3_

untranslated regions (UTRs), exons, and introns.

SEQ-ANNOT: ANNOTATING THE SEQUENCE 37

Figure 2.3. The Coding Region (CDS) feature links specific regions on a nucleotide sequence

with its encoded protein product. All features in the NCBI data model have a ‘‘location’’

field, which is usually one or more intervals on a sequence. (Multiple intervals on

a CDS feature would correspond to individual exons.) Features may optionally have a ‘‘product’’

field, which for a CDS feature is the entirety of the resulting protein sequence. The

CDS feature also contains a field for the genetic code. This appears in the GenBank flat

file as a /transl table qualifier. In this example, the Bacterial genetic code (code 11) is

indicated. A CDS may also have translation exceptions indicating that a particular residue

is not what is expected, given the codon and the genetic code. In this example, residue

196 in the protein is selenocysteine, indicated by the /transl except qualifier. NCBI

software includes functions for converting between codon locations and residue locations,

using the CDS as its guide. This capability is used to support the historical conventions of

GenBank format, allowing a signal peptide, annotated on the protein sequence, to appear

in the GenBank flat file with a location on the nucleotide sequence.

Coding Regions. A Coding Region (CDS) feature in the NCBI data model

can be thought of as ‘‘instructions to translate’’ a nucleic acid into its protein product,

via a genetic code (Fig. 2.3). A coding region serves as a link between the nucleotide

and protein. It is important to note that several situations can provide exceptions to

the classical colinearity of gene and protein. Translational stuttering (ribosomal slippage),

for example, merely results in the presence of overlapping intervals in the

feature’s location Seq-loc.

The genetic code is assumed to be universal unless explicitly given in the Coding

Region feature. When the genetic code is not followed at specific positions in the

sequence—for example, when alternative initiation codons are used in the first position,

when suppressor tRNAs bypass a terminator, or when selenocysteine is added

—the Coding Region feature allows these anomalies to be indicated.

Proteins. A Protein feature names (or at least describes) a protein or proteolytic

product of a protein. A single protein Bioseq may have many Protein features on it.

It may have one over its full length describing a pro-peptide, the primary product of

translation. (The name in this feature is used for the /product qualifier in the

CDS feature that produces the protein.) It may have a shorter protein feature describing

the mature peptide or, in the case of viral polyproteins, several mature

peptide features. Signal peptides that guide a protein through a membrane may also

be indicated.

38 THE NCBI DATA MODEL

Others. Several other features are less commonly used. A Region feature provides

a simple way to name a region of a chromosome (e.g., ‘‘major histocompatibility

complex’’) or a domain on a polypeptide. A Bond feature annotates a bond

between two residues in a protein (e.g., disulfide). A Site feature annotates a known

site (e.g., active, binding, glycosylation, methylation, phosphorylation).

Finally, numerous features exist in the table of legal features, covering many

aspects of biology. However, they are less likely than the above-mentioned features

to be used for making connections between records or for making discoveries based

on computation.