Regular Expressions Definition – regex, regexp
Regular expressions are the patterns used to find or find and replace text on the command line. Regular expressions are used in most modern programming languages and the syntax is usually very similar.
Regular expressions in Perl
In this tutorial we will be using perl on the command line to showcase how to use regular expressions. Perl one-liners is a great tool to add to your bioinformatics toolkit to find, replace, extract and generally manipulate strings.
Perl can be executed on the command line like this:
1
perl –ne 'if (/regular_expression/) { print $_.”\n”;}’ input_file_name
perl
calls the perl program-ne
these are parameters to the perl program-n
feeds the perl program each line of the file-p
feeds the perl program each line and prints every line.-e
stands for execute what is between the { }if (/regular_expression/)
is the portion of the perl command where you put a the string you are trying to find.{execute me}
is the portion of the perl command where you tell perl what to do. In the example above, each line of the file is printed to the screen.$_
variable in perl that represents the entire line that was fed to the perl program
Regular Expression Syntax
Like a program language regular expressions have syntax that can represent more than just letters and numbers but specific patterns of letters and numbers. This syntax will allow you to build a pattern that will match multiple strings or words or sentences.
Example Text
This file has 12 lines of text containig data about an individuals Name, phone number, address and SS#.
Copy this text into a file named address.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
ZEnderX 1-515-999-4321ZZIX
1000 ZBattlesXchool driveZZplayX
SS# 11-27-33-47-57
Bean 1-515-999-3149
1010 Bats drive
SS# 11-22-33-44-55
ZPetraX 1-515-999-1234ZZhouseX
10X0 Battleschool drive ZinX
SS# 21-22-23-24-25
Andrew 1-515-294-1320
206 ZScienceX IZZ.X
SS# 11-02-33-04-50
SS# 11-02-33-04-50
Simple Example that requires no syntax
Recall from above
1
perl -ne 'if (/regular_expression/) { print $_."\n";}' input_file_name
To find the line that contains Bean you would type
1
perl -ne 'if (/Bean/) {print $_."\n";}' address.txt
Result
1
Bean 1-515-999-3149
Other examples
1
2
3
perl -ne 'if (/515/) {print $_."\n";}' address.txt
perl -ne 'if (/Battleschool/) {print $_."\n";}' address.txt
perl -ne 'if (/Science/) {print $_."\n";}' address.txt
Example with Simple Syntax .
, *
, ?
the .
symbol in the regex pattern can represent any character
To find all addresses that have the pattern 10X0 Where X represents any number or character
1
perl -ne 'if(/10.0/) {print $_}' address.txt
Result
1
2
3
1000 ZBattlesXchool driveZZplayX
1010 Bats drive
10X0 Battleschool drive ZinX
To match multiple characters we can use the *
symbol. In the example below it will find all lines with Z followed by 0 or more characters then an X
1
perl -ne 'if(/Z.*X/) {print $_}' address.txt
Result prints all lines that contain this pattern
1
2
3
4
5
ZEnderX 1-515-999-4321ZIX
1000 ZBattlesXchool driveZZplayX
ZPetraX 1-515-999-1234ZZhouseX
10X0 Battleschool drive ZinX
206 ZScienceX IZZ.X
If however we wanted just the characters between Z and X we will need to designate a variable in Perl. This is achieved by putting the portion of the pattern we want to recall later inside parentheses ()
1
perl -ne 'if(/Z(.*)X/) {print $1."\n"}' address.txt
Result
1
2
3
4
5
EnderX 1-515-999-4321ZZI
BattlesXchool driveZZplay
PetraX 1-515-999-1234ZZhouse
in
ScienceX IZZ.
Using .*
will find the pattern with the longest match even if a shorter match is present. This is called Greedy Matching
Using .*?
will find the pattern with the shortest match. This is called Lazy Matching
1
perl -ne 'if(/Z(.*?)X/) {print $1."\n"}' address.txt
Result. As you can see it finds the shortest match contained between the letters Z and X.
1
2
3
4
5
Ender
Battles
Petra
in
Science
We can add additional variables in the pattern to capture secondary regions.
1
perl -ne 'if(/Z(.*?)X.*ZZ(.*)X/) {print $2."\n"}' address.txt
Result. As you can see this sentence was hard to see until you extracted the text between the pattern ZZ and X.
1
2
3
4
I
play
house
.
Special RegExp Characters
You may want to include these characters in the regular expression you are trying to match. However, since these characters also have other meanings in perl and programs in general we have to tell the program that this is to be interpreted as a character in the pattern and not part of the syntax of the program. This is done by “escaping” the special character by preceding it with a back slash ().
Character | Name |
---|---|
\ |
(backslash) |
^ |
(caret) |
$ |
(dollar sign) |
. |
(dot) |
\| |
(pipe) |
? |
(question mark) |
* |
(asterisk) |
+ |
(plus sign) |
( |
(open parenthesis) |
) |
(closed parenthesis) |
[ |
(open square bracket) |
] |
(closed square bracket) |
{ |
(open brace) |
} |
(closed brace) |
Character Classes
Below is a table of some of the most common character classes that are used in regular expression.
Class | Description |
---|---|
[A-Z] |
any single capital letter between A and Z |
[a-z] |
any lower case letter between A and Z |
[0-9] |
any number between 0 and 9 |
\s |
matches any whitespace character |
\w |
matches a word character equivelent to [A-Za-z0-9_] |
[:ascii:] |
Matches any character in the ASCII character set |
\d |
matches any number equivalent to [0-9] |
\d{3} |
matches any string of 3 numbers |
\d{2,5} |
matches any string of 2 to 5 numbers |
A more comprehensive list of character classes can be found here.