Regular Expressions for Biologists: Key Points

Alpha

Regular Expressions for Biologists

Introduction

Regular expressions are a way of describing patterns in text.
Most text editors and many other tools include a regular expression engine for performing these kinds of searches.
Regular expressions are often offered as a mode of find/replace that can be turned on and off by the user.

Regex Fundamentals

Wrap characters in [] to define a set of valid matches for a given position.
Use - between two characters to define a range of characters to match.
^ at the start of a set to invert it, indicating that the given characters should be excluded from a match.

Tokens and Wildcards

Use the \b token to match a word boundary, and ^ and $ to match the beginning and end of a line respectively.
\\ has special meaning in regular expressions, and \\\\ should be used to specify a literal backslash in a pattern.
. describes a position that could match any character.
When composing a regular expression, it is good practice to be as specific as possible about what you want to match.

Repeated Matches

? indicates that the preceding character or set should be treated as optional in this position.
* indicates that the preceding character or set should appear 0 or more times in this position.
+ indicates that the preceding character or set should appear 1 or more times in this position.
{2,4} indicates that the preceding character or set should appear at least twice but no more than four times in this position.

Capture Groups and References

Capture groups are defined within () in a regular expression.
The left-most capture group in a regular expression is referred to with \\1 in the replacement string, the next with \\2, and so on.

Alternative Matches

Alternative strings to match can be combined with |.