Workshop Laboratory 3

Pairwise Sequence Alignment


Pairwise comparison and alignment of protein or nucleic acid sequences is the foundation upon which most other bioinformatics tools are built. In this exercise, we will briefly introduce the concept of dynamic programming, which is the algorithm that allows for efficient and complete comparison of two (or more) biological sequences. We will also investigate the effects of various parameters on the results of these comparisons and begin to look at database searching using a dynamic programming technique known as the Smith-Waterman algorithm.


Homology vs. Similarity

At this point, it will be useful to touch on a basic concept that will be assumed througout these exercises. First, a distinction will be made between the terms "homologous" and "similar." Most of what we will be doing is determining the similarity between two sequences. Homology may be inferred from similarity. In fact, sequences that have a very high degree of similarity can easily be inferred to be homologous, but this is not always the case.

It is important to note that identical protein sequences result in identical 3-D structures. So it follows that similar sequences may result in similar structures, and this is usually the case. The converse, however, is not true: identical 3-D structures do not necessarily indicate identical sequences. It is because of this that there is a distinction between "homology" and "similarity". There are examples of proteins in the databases that have nearly identical 3-D structures, and are therefore homologous, but do not exhibit significant (or detectable) sequence similarity. Here is an example.

So, we will usually stick to "sequence similarity" and talk about "protein homology" only when we know something special about the protein sequences, or we find similarity scores that warrant it.

"The Best Global Alignment"

The question that may be most often asked when working with sequence alignments is: "What is the best global alignment of these two sequences?" Ultimately, the answer depends on whether or not the sequence alignment agrees with the 3-D structural alignment of the proteins being considered. This issue will be addressed in a demonstration of homology modelling later in the workshop. Other factors, such as conserved sequence motifs, phylogenetic information, and known structural domains, may also help you determine the value of different alignments. But in the end, you are the only judge of whether or not a given global alignment is valid. There are no fool proof measures of alignment quality that can be used to compare different results, especially for pairwise alignments. There are statistics that may be employed, but these are, currently, only known to be valid for local alignments.

DNA vs. Protein Analysis

In these exercises, we really don't address DNA sequence analysis for a variety of reasons. The first is that we simply don't have time to cover everything. The second, and most important reason is that the use of protein sequences for analyses and database queries is always preferred to the use of nucleic acid sequences.

It has been estimated that database searches using DNA sequences can identify sequences diverged by only about 200 million years while searches with protein sequences can "look back" more than a billion years. This is because there is more information per residue in a protein sequence than in a DNA sequence. For instance, a protein sequence correctly translated from DNA carries with it information about reading frame, start and stop codons and codon redundancy in addition to its 3-dimensional structure and chemical properties.

While such information can be deciphered from a DNA sequence, it is not inherently a part of the sequence (i.e. we have to find ORFs, etc.). Whats more, a protein sequence carries all this information in 1/3 (or less) the number of residues of a DNA sequence.

And finally, most analysis programs treat biological sequences as simple strings of text. Protein sequences, as simple strings of text, are generally more complex and distinguishable. DNA, on the other hand, in the absence of knowledge about correct reading frames and start and stop locations are far, less rich as simple text strings.

 

We will first take a look at the Dynamic Programming algorithm for sequence comparison that most analysis programs are built upon.


Next - Dynamic Programming
Up - Main Page

Go directly to the Smith-Waterman Search lesson.


summer_w_sm.jpg (9409 bytes)NCSAsm.gif (1758 bytes)
Developed and Maintained by Mark S. Whitsitt
Last Updated: Saturday, June 06, 1998 12:29 PM