|
Bioinformatics
Research Laboratory
IBI Biosolutions Pvt. Ltd Panchkula - 134109, INDIA ![]() |
|
|
Home
| About us | Cell Cycle
| Organisms | Project
Information | Database | Tools
| Glossary | Authors
| Contact us
|
|
![]()
![]() ![]() |
GLOSSARY Accession number : An identifier supplied by the curators of the major biological databases upon submission of a novel entry that uniquely identifies that sequence entry. Alignment : The result of a comparison of two or more gene or protein seqquences in order to determine their degree of base or amino acid similarity. Sequence alignments are used to determine the similarity, homology, function or other degree of relatedness between two or more genes or gene products. Alignment score : A numerical value that describes the overall quality of an alignment. Higher numbers correspond to higher similarity. Annotation : A combination of comments, notations, references, and citations, either in free format or utilizing a controlled vocabulary, that together describe all the experimental and inferred information about a gene or protein. Annotations can also be applied to the description of other biological systems. Batch, automated annotation of bulk biological sequence is one of the key uses of Bioinformatics tools. Base pair : A pair of nitrogenous bases (a purine and a pyrimidine), held together by hydrogen bonds, that form the core of DNA and RNA i.e the A:T, G:C and A:U interactions. Bioinformatics : The field of endeavor that relates to the collection, organization and analysis of large amounts of biological data using networks of computers and databases (usually with reference to the genome project and DNA sequence information). Base-pairing : The attachment of one polynucleotide to another -- or one part of a polynucleotide to another part of the same polynucleotide -- by base pairs. Bit score : The value S' is derived from the raw alignment score S, in which the statistical properties of the scoring system used have been taken into account. BLAST : A set of programs, used to perform
fast similarity searches. Block : Conserved ungapped patterns approximately 3-60 amino acids in lenght in a set of related protein. Browser : Program used to access sites on the WWW. Hyper Text Markup Language (HTML) enables browsers to represent a web page the same way regardless of computer platform. Binding site Specific DNA/RNA sequences a protein or protein complex bind. Some examples of protein binding sites are promoters, ribosome entry sites, and replication origins. Blunt ends Digestions of double-stranded DNA by many restriction enzymes (e.g. EcoR V) generate ends without any single-stranded sequences. Such ends are called blunt ends. Cell :The basic unit of any living organism. Cell Cycle : The life cycle of a cell which is marked by cell division which is separated into four phases: G1, S, G2, and M. DNA replication is confined to the S(synthesis) phase, and chromosomal separation in the M (mitotic) phase. Conserved sequence : A conserved sequence is a DNA or protein sequence with a high degree of sequence identity to other sequences in that species, or in others. Consensus sequence : A single sequence delineated from an alignment of multiple constituent sequences that represents a "best fit" for all those sequences. A "voting" or other selection procedure is used to determine which residue (nucleotide or amino acid) is placed at a given position in the event that not all of the constituent sequences have the identical residue at that position. Central dogma : The path of information flow in DNA organisms (DNA -> RNA -> proteins). Codon : A sequence of three nucleotides in messenger RNA that codes for a single amino acid. Contig : A length of contiguous sequence assembled from partial, overlapping sequences, generated from a "shotgun" sequencing project. Contigs are typically created computationally, by comparing the overlapping ends of several sequencing reads generated by restriction enzyme digestion of a segment of genomic DNA. The creation of contigs in the presence of sequencing errors, ambiguities and the presence of repeats is one of the most computationally challenging aspects of the role of Bioinformatics in genome analysis. Complementary sequence : A sequence
of bases that can form a double-stranded structure by matching base
pairs. Convergence : The end-point of any algorithm that uses iteration or recursion to guide a series of data processing steps. An algorithm is usually said to have reached convergence when the difference between the computed and observed steps falls below a pre-defined threshold. Computational biology : A field
incorporating computer science and biology. Complementary sequence : A sequence of bases that can form a double-stranded structure by matching base pairs. Computational biology : A field
incorporating computer science and biology. Data : Information manipulated by a computer program. Database : A collection of data arranged for ease and speed of search and retrieval. Also called data bank. Data Cleaning :A process whereby
automated or semi-automated algorithms are used to process experimental
data, including noise, experimental errors and other artifacts, in order
to generate and store high-quality data for use in subsequent analysis.
Data cleaning is typically required in high-throughput sequencing where
compression or other experimental artifacts limit the amount of sequence
data generated from each sequencing run or "read." Data Processing : Data processing is defined as the systematic performance of operations upon data such as handling, merging, sorting, and computing. The semantic content of the original data should not be changed, but the semantic content of the processed data may be changed. Data Warehouses :Vast arrays of heterogeneous (biological) data, stored within a single logical data repository, that are accessible to different querying and manipulation methods. Database : Any file system by which data gets stored following a logical process. (see also relational database). Deconvolution : Mathematical procedure to separate out the overlapping effects of molecules such as mixtures of compounds in a high-throughput screen, or mixtures of cDNAs in a high density array. Data type : A description of a variable defined by two properties: a domain, which is the set of values that belong to that type, and a set of operations, which defines the behavior of that type. DNA sequence : The relative order
of base pairs in any sample of DNA, whether in a DNA fragment, a gene,
a chromosome, or an entire genome. Domain (protein) : A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics. Enzyme : A class of proteins that
are capable of catalyzing chemical reactions (the making or breaking
of chemical bonds). They do so by orienting their substrates into a
suitable geometry in a particular location (the active site) where electrophilic
or nucleophilic amino acid residues can participate in the reaction.
Enzymes are protein catalyst that speeds up chemical reactions that
would otherwise be prohibitively slow under physiological conditions. E-value : For a given score, the number of hits in a database search that we expect to see by chance with this score or better. Entrez : An online retrieval system for searching several linked databases, provided by the National Center for Biotechnology Information (NCBI). Expression (gene or protein) : A measure of the presence, amount, and time-course of one or more gene products in a particular cell or tissue. Expression studies are typically performed at the RNA (mRNA) or protein level in order to determine the number, type, and level of genes that may be up-regulated or down-regulated during a cellular process, in response to an external stimulus, or in sickness or disease. Gene chips and proteomics now allow the study of expression profiles of sets of genes or even entire genomes. FASTA : A database search tool used to compare a nucleotide or peptide sequence to a sequence database. Field : A single unit of data stored as part of a database record. Functional domain : A well-defined
region within a protein that can perform a specific function. Gaps (affine gaps): A gap is defined
as any maximal, consecutive run of spaces in a single string of a given
alignment. Gaps help create alignments that better conform to underlying
biological models and more closely fit patterns that one expects to
find in meaningful alignment. The idea is to take in account the number
of continuous gaps and not only the number of spaces when calculating
an alignment. Affine gaps contain a component for gap insertion and
a component for gap extension, where the extension penalty is usually
much lower than the insertion penalty. This mimics biological reality
as multiple gaps would imply multiple mutations, but a single mutation
can lead to a long gap quite easily. Gap penalties : The penalty applied to a similarity score for the introduction of an insertion or deletion gap, the extension of a gap, or both. Gap penalties are usually subtracted from a cumulative score being determined for the comparison of two or more sequences via an optimization algorithm that attempts to maximize that score. Gene : Classically, a unit of inheritance. In practice, a gene is a segment of DNA on a chromosome that encodes a protein and all the regulatory sequences (promoter) required to control expression of that protein. GenBank : Data bank of genetic
sequences operated by a division of the National Institutes of Health. Genetic code : The mapping of all
possible codons into the 20 amino acids including the start and stop
codons. Genome : The complete genetic content of an organism. Genomic DNA (sequence) : DNA sequence typically obtained from mammalian or other higher-order species, which includes both intron and exon sequence (coding sequence), as well as non-coding regulatory sequences such as promoter, and enhancer sequences. GC content : The percentage of
nucleotides in a genome that are guanine (G) or cytosine (C). Gene Ontology (GO) : "The
goal of the Gene Ontology Consortium is to produce a dynamic controlled
vocabulary that can be applied to all eukaryotes even as knowledge of
gene and protein roles in cells is accumulating and changing." Genotype : A description of the
genetic composition of an individual, including the alleles that do
not show as outward characteristics. Global alignment : The alignment
of two nucleic acid or protein sequences over their entire length. Gene families : Subsets of genes
containing homologous sequences which usually correlate with a common
function. Gene library : A collection of
cloned DNA fragments created by restriction endonuclease digestion that
represent part or all of an organism's genome. Gene product : The product, either
RNA or protein, that results from expression of a gene. The amount of
gene product reflects the activity of the gene. Homology : Two or more biological
species, systems or molecules that share a common evolutionary ancestor.
(general) Two or more gene or protein sequences that share a significant
degree of similarity, typically measured by the amount of identity (in
the case of DNA), or conservative replacements (in the case of protein),
that they register along their lengths. Sequence "homology"
searches are typically performed with a query DNA or protein sequence
to identify known genes or gene products that share significant similarity
and hence might inform on the ancestry, heritage and possible function
of the query gene. Homolog : One of a pair of chromosomes,
in which one is obtained from the organism's maternal parent and the
other from the paternal parent; found in diploid cells. Homologous genes : Genes that share a common evolutionary ancestor. Homologous protein : A protein that is related to another by common evolutionary history. Homology domain : A region in a protein sequence with similarity to an otherwise unrelated protein. This term should be used only if the region is of a size sufficient to form a domain. High-scoring segment pair (HSP) : Local alignments with no gaps that achieve one of the highest alignment scores in a given search. Home page : A document on the World
Wide Web that acts as a front page or point of welcome to a collection
of documents that may introduce an individual, organization, or point
of interest. Hypothesis : An idea that can be
experimentally tested; an idea with the lowest level of confidence. Introns : Introns are sections
of DNA that will be spliced out after transcription, but before the
RNA is used. Introns are common in eukaryotic RNAs of all types, but
are found in prokaryotic tRNA and rRNA genes only. Identity : The extent to which
two nucleotide or amino acid sequences are invariant. Junk DNA : Term used to describe the excess DNA that is present in the genome beyond that required to encode proteins. A misleading term since these regions are likely to be involved in gene regulation, and other as yet unidentified functions. Ligand : Any small molecule that binds to a protein or receptor; the cognate partner of many cellular proteins, enzymes, and receptors. Locus : The specific position occupied by a gene on a chromosome. At a given locus, any one of the variant forms of a gene may be present. The variants are said to be alleles of that gene. Local alignment : An optimal alignment
that includes the most similar local region or regions. Local similarity : An alignment
technique where we ask for those subunits of two sequences that exhibit
most similarity. In biology, this is normally termed local alignment. Meiosis : A process within the
cell nucleus that results in the reduction of the chromosome number
from diploid (two copies of each chromosome) to haploid (a single copy)
through two reductive divisions in germ cells. Mitosis : The nuclear division
that results in the replication of the genetic material and its redistribution
into each of the daughter cells during cell division. Multiple alignment : An alignment
of three or more sequences, with gaps inserted in the sequences such
that residues with common structural position and/or ancestral residues
are aligned in the same column of a multiple alignment. Messenger RNA (mRNA) : The complementary
RNA copy of DNA formed from a single-stranded DNA template during transcription
that migrates from the nucleus to the cytoplasm where it is processed
into a sequence carrying the information to code for a polypeptide domain. Motif : A conserved element of
a protein sequence alignment that usually correlates with a particular
function. Motifs are generated from a local multiple protein sequence
alignment corresponding to a region whose function or structure is known.
It is sufficient that it is conserved, and is hence likely to be predictive
of any subsequent occurrence of such a structural/functional region
in any other novel protein sequence. Mutation : An inheritable alteration
to the genome that includes genetic (point or single base) changes,
or larger scale alterations such as chromosomal deletions or rearrangements. Match : In sequence alignment,
the existence of the same base in a homologous position in both sequences. Mismatch : A position in a double-stranded
DNA molecule where base-pairing does not occur because the nucleotides
are not complementary. Nucleotide : A nucleic acid unit
composed of a five carbon sugar joined to a phosphate group and a nitrogen
base. National Center for Biotechnology Information
(NCBI) : A unit of the National Library of Medicine (NLM), National
Institutes of Health (NIH). National Institutes of Health (NIH)
: An agency of the U.S. Department of Health and Human Services, Public
Health Service; supports research and training to improve health. National Library of Medicine (NLM)
: Part of the National Institutes of Health (NIH). Includes NCBI. Noncoding RNA : An RNA molecule
that does not code for a protein. Nucleic acid : The term first used
to describe the acidic chemical compound isolated from the nuclei of
eukaryotic cells. Now used specifically to describe a polymeric molecule
comprising nucleotide monomers, such as DNA and RNA. Organism : An individual, composed
of organ systems (if multicellular). Multiple organisms make up a population. Orthologous : Homologous sequences
in different species that arose from a common ancestral gene during
speciation. Overlap : A part of a sequence
that is common to sequences derived from separate sequencing experiments.
Overlapping genes : Two genes whose
coding regions overlap. Pattern : Molecular biological
patterns usually occur at the level of the characters making up the
gene or protein sequence. A pattern language must be defined in order
to apply different criteria to different positions of a sequence. In
order to have position-specific comparison done by a computer, a pattern-matching
algorithm must allow alternative residues at a given position, repetitions
of a residue, exclusion of alternative residues, weighting, and ideally,
combinatorial representation. Profile : Sequence profiles are
usually derived from multiple alignments of sequences with a known relationship,
and consist of tables of position-specific scores and gap-penalties.
Each position in the profile contains scores for all of the possible
amino acids, as well as one penalty score for opening and one for continuing
a gap at the specified position. Attempts have been made to further
improve the sensitivity of the profile by refining the procedures to
construct a profile starting from a given multiple alignment. Other
representations for sequence domains or motifs do not necessarily require
the presence of a correct and complete multiple alignment, such as hidden
Markov models. Protein families : Sets of proteins
that share a common evolutionary origin reflected by their relatedness
in function which is usually reflected by similarities in sequence,
or in primary, secondary or tertiary structure. Subsets of proteins
with related structure and function. P value : The probability of an
alignment occurring with the score in question or better. Pairwise alignment : In a pairwise
alignment, two sequences are padded by gaps, to achieve same length,
and to display maximum similarity/conservation on a character-by-character
basis. Paralogous : Homologous sequences
(sequences that share a common evolutionary ancestor) that diverged
by gene duplication, as opposed to orthologs, which diverged by speciation. Perl : An interpreted computer
language for easily manipulating text, files and processes. Phylogeny : A classification scheme
that indicates the evolutionary relationships between organisms. PROSITE : A database of "patterns"
(regular expressions) specific for various protein motifs. Query (sequence) : A DNA, RNA of
protein sequence used to search a sequence database in order to identify
close or remote family members (homologs) of known function, or sequences
with similar active sites or regions (analogs), from whom the function
of the query may be deduced. Reading frame : A sequence of codons
beginning with an intiation codon and ending with a termination codon,
typically of at least 150 bases (50 amino acids) coding for a polypeptide
or protein chain (see ORF and URF). Recursion : An algorithmic procedure
whereby an algorithm calls on itself to perform a calculation until
the result exceeds a threshold, in which case the algorithm exits. Recursion
is a powerful procedure with which to process data and is computationally
quite efficient. Repeats (repeat sequences) : Repeat
sequences and approximate repeats occur throughout the DNA of higher
organisms (mammals). For example, the Alu sequences of length about
300 characters, appear hundreds of thousands of times in Human DNA with
about 87% homology to a consensus Alu string. Some short substrings
such as TATA-boxes, poly-A and (TG)* also appear more often than by
chance. Repeat sequences may also occur within genes, as mutations or
alterations to those genes. Repetitive sequences, especially mobile
elements, have many applications in genetic research. DNA transposons
and retroposons are routinely used for insertional mutagenesis, gene
mapping, gene tagging, and gene transfer in several model systems. Repetitive elements : Repetitive
elements provide important clues about chromosome dynamics, evolutionary
forces, and mechanisms for exchange of genetic information between organisms
The most ubiquitous class of repetitive elements in the DNA sequence
in primate genomes is the Alu family of interspersed repeats which have
arisen in the last 65 million years of evolution Alu repeats belong
to a class of sequences defined as short interspersed elements (SINEs).
Approximately 500,000 Alu SINEs exist within the human genome, representing
about 5% of the genome by mass. Signal sequence (leader sequence)
: A short sequence added to the amino-terminal end of a polypeptide
chain that forms an amphipathic helix allowing the nascent polypeptide
to migrate through membranes such as the endoplasmic reticulum or the
cell membrane. It is cleaved from the polypeptide after the protein
has crossed the membrane. Single nucleotide polymorphisms
(SNPs). Variations of single base pairs scattered throughout the human
genome that serve as measures of the genetic diversity in humans. About
1 million SNPs are estimated to be present in the human genome, and
SNPs are useful markers for gene mapping studies. Single-pass sequencing : Rapid
sequencing of large segments of the genome of an organism by isolating
as many expressed (cDNA) sequences as possible and performing single
sequencer runs on their 5' or 3' ends. Single-pass sequencing typically
results in individual, error-prone sequencing reads of 400-700 bases,
depending on the type of sequencer used. However, if many of these are
generated from numerous clones from different tissues, they may be overlapped
and assembled to remove the errors and generate a contiguous sequence
for the entire expressed gene. Site : Sites in sequences can be
located either in DNA (e.g. binding sites, cleavage sites) or in proteins.
In order to identify a site in DNA, ambiguity symbols are used to allow
several different symbols at one position. Proteins, however, need a
different mechanism (see Pattern). Restriction enzyme cleavage sites,
for instance, have the following properties: limited length (typically,
less than 20 base pairs); definition of the cleavage site and its appearance
(3', 5' overhang or blunt); definition of the binding site. Structure prediction : Algorithms
that predict the secondary, tertiary and sometimes even quarternary
structure of proteins from their sequences. Determining protein structure
from sequence has been dubbed "the second half of the Genetic Code"
since it is the folded tertiary structure of a protein that governs
how it functions as a gene product. As yet most structure prediction
methods are only partially successful, and typically work best for certain
well-defined classes of proteins. Unidentified reading frame (URF)
: An open reading frame encoding a protein of undefined function. Variable numbers of tandem repeats
(VNTRs) : DNA sequence blocks of 2-60 base pairs which are repeated
from two to more than 20 times in different individuals. This polymorphism
makes VNTRs very useful DNA markers used in genomic mapping, linkage
analysis and also DNA fingerprinting. Variation (genetic) : Variation
in genetic sequences and the detection of DNA sequence variants genome-wide
allow studies relating the distribution of sequence variation to a population
history. This in turn allows one to determine the density of SNPS or
other markers needed for gene mapping studies. Quantitation of these
variations together with analytical tools for studying sequence variation
also relate genetic variations to phenotype. |
|
|
|
|
© 2007 IBI Biosolutions Pvt. Ltd. Disclaimer: This databases and associated information are protected by copyright. This server and its associated data and services are for academic, non-commercial use only.IBI has no liability for the use of results, data or information which have been provided through this server. Neither the use for commercial purposes, nor the redistribution of this database files to third parties nor the distribution of parts of files or derivative products to any third parties is permitted. Commercial users may contact IBI Biosolutions Pvt. Ltd. |
|