Sequence Alignment Terms

Alignment

The comparison of two or more nucleotide or protein sequences to determine the degree of similarity. Commonly used to deduct functional or evolutionary relationships between genes and proteins.

Assembly

The process of combining short DNA sequence fragments into larger units by looking for overlaps between different fragments. Often required because the length of the genes studied exceeds the length of the sequence fragments produced by DNA sequencing machines. Also used to combine several fragments that cover the same region, for example in forward and reverse direction, with the goal to reduce errors in the consensus sequence.

Consensus Sequence

A single sequence generated from an alignment or assembly of sequence fragments that is the "best fit" for the given sequences. Historically, majority ("vote based") and inclusive methods were most commonly used to determine consensus sequence. For sequence assemblies, these methods have often been replaced by quality-based consensus methods. Quality-based consensus sequences are typically more accurate that majority-based sequences, and can reduce the need for manual editing of sequence assemblies drastically.

Contig

The result of a sequence assembly or alignment that shows the arrangement of the fragments to form a contiguous large sequence.

Dynamic Programming

A computer-science based method to find the optimal alignment between sequences. For two sequences, this algorithm creates a two-dimensional matrix based on identityl or similarity of bases (or amino acids) in both sequences, and then finds the highest-scoring path to obtain the alignment. A commonly used dynamic programming method is the Needleman-Wunsch algorithm. A nice graphical display of the dynamic programming methods for sequence alignments can be found here.

Global Alignments

Global alignments attempt to align every base (or amino acid) in each aligned sequence.

Local Alignments

Local alignments will align only similar regions between sequences, and leave regions with too many differences unaligned. Local alignments can be better suited for the alignment of very dissimilar sequences. In sequence assembly, the program Phrap demonstrate that local alignments can be used to reduce or eliminate the need to remove low-quality sequence (end clipping) before assembly.

Multiple Sequence Alignment

Multiple sequence alignment refers to the process of aligning three or more nucleotide or protein sequences to identify similarities between the sequences. Alignments that include many sequences can be computational intensive, and require more sophistated algorithms than pairwise alignments.

Pairwise Alignment

In pairwise sequence alignment, exactly two nucleotide or protein sequences are aligned to each other to determine the similarity between the two sequences.

Word-based Alignment Methods

Word-based alignment methods are an optimization often used in sequence alignment and assemblies. Instead of examining every single nucleotide or amino acid, "words" of a fixed length are analyzed. This can lead to substantial reductions in memory use and alignment times. One common application is to used the number of shared words between two sequences to estimate the similarity in early phases of sequence alignments, or to identify sequences that share overlaps in sequence assembly.