Degenerate DNA sequences as regular expressions

DNA molecules are described using a string made up of four different letters, each representing a nucleotide base on one strand in the double helix: A, C, G and T. Sometimes, however, there is a need to represent several possible nucleotides in a given position. One example where this is needed arises is when the sequencing does not work perfectly. Another example is when a binding site for proteins is described.

There are several ways to represent these ambiguous or degenerate positions. Take for example this description of the AGL1 binding site motif from AGRIS:


In this sequence we see two different methods for describing the degenerate positions. In some positions we see the letter N, which means that any of the four nucleotides in that position matches the description. In the positions where more than one, but not all four, bases are allowed this is represented using parentheses and slashes. Despite the meaning being obvious, this convention is as far as I know considered non-standard. The use of the letter N, however, is an accepted standard. There are also other standard one-letter representations for all possible of two and three nucleotide positions. For example is the letter D used to represent a position where A, G and T is allowed. This means that the fourth position in the above motif could be represented using D instead of (A/G/T).

Suppose now that you would want to find out if this motif occurs in a DNA sequence. A simple text search with the motif as it is described above will obviously not do. One obvious solution would be to turn the motif into a regular
expression. There are many ways in which the above motif could be described using a regular expression, but I will take advantage of the fact that the motif already is very similar to a regular expression pattern. The only thing we need to do is convert it into the correct syntax.

In regular expressions the ambiguities can be described using square brackets. Position 4 in the motif becomes [AGT] instead of (A/G/T). A simple find-replace replacing the parentheses with square brackets and removing the slash will do the trick for this kind of notation. The positions described with the single letter N can similarly be replaced with [ACGT].

But wait! What if the sequence you want to find the motif in it self contains ambiguities? Let’s at least hope the ambiguities are represented using the single letter standard and not with the parentheses and slash method. First, to avoid matching wrong motifs we have establish that an A in the motif only matches A in the sequence, and not matches D, for example, even if an A is possible in that position. This is already implied, as an A in a regular expression of course does not matches anything else than an A.

What we need to do is to change the square bracket ambiguities in the regular expression to also match the appropriate ambiguities in the target sequence. Take the letter N that is know encoded as [ACGT]. Of course we need to add an N, but thats not all. We also need to add all other letters in the code. N therefore becomes [ACGTRYMKSWHBVDN]. Similarly D becomes [AGTDWK].

But would it not be easier to represent D, which according to the link above means ‘not C’, as [^C]? The problem with this is that [^C] can match letters that does not exclude C. [^C] would for instance match N, which of course stands for all nucleotides including C.

Here is a complete table of regular expression patterns for all degenerate DNA bases:

Nucleotide Regexp pattern

Thus the AGL1 binding site motif above becomes