Python RegEx

In this tutorial, you will learn about Python RegEx with the help of examples.

In Python, Regular Expression (RegEx) are patterns used to match character combinations in strings. For example,

^s...e$

Here, we have defined a RegEx pattern. The pattern is: any five letter string starting with s and ending with e.

The RegEx pattern ^s...e$ can be used to match against strings:

  • sense - match
  • shade - match
  • seize - match
  • Sense - no match
  • science - no match
  • swift - no match

Example: Python RegEx

To work with RegEx in Python, we first need to import a module named re.

Let's see an example,

# import re module
import re

# regex pattern
pattern = '^s...e$'

# test string
string1 = 'shade'
string2 = 'science'

# use re.match() to match pattern
result1 = re.match(pattern, string1)
result2 = re.match(pattern, string2)

# print boolean value
print('shade:', bool(result1))  # True
print('science:', bool(result2))  # False

# Output: shade: True
                  science: False

In the above example, we first imported a module named re and used the re.match() function to search for the pattern.

Here, re.match() take two parameters:

  • pattern - the regular expression to be matched
  • string1 / string2 - the string in which the pattern is checked

The pattern ^s...e$ means any five letter string starting with s and ending with e. Since,

  • 'shade' - matches the pattern, bool() returns True
  • 'science' - does not match the pattern, bool() returns False

MetaCharacters in Python Regular Expression

The characters that are interpreted in a special way by a RegEx engine are metacharacters.

Here's a list of metacharacters with a short description:

Metacharacter Description
[ ] specifies a set of characters we wish to match
. matches any single character
^ checks if a string starts with a certain character
$ checks if a string ends with a certain character
* matches zero or more occurrences of the pattern left to it
+ matches one or more occurrences of the pattern left to it
? matches zero or one occurrence of the pattern left to it
( ) groups sub-patterns
\ used to escape various characters including all metacharacters
| used for alternation (or operator)

MetaCharacters Examples:

[ ] - Square Brackets

Expression String Match?


[xyz]
x 1 match
hey 1 match
hello No match
proxy 2 matches

Here, [xyz] will match if the string you are trying to match contains any of the x, y, or z.

We can also specify a range of characters using - inside square brackets.

For example, [w-z] is the same as [wxyz] and similarly [1-4] is the same as [1234].


. - Period

Expression String Match?


...
hey 1 match
python 2 matches (contains 3 characters)
a No match
sa No match

We can see that . matches any single character (except newline '\n').


^ - Caret

Expression String Match?


^s
s 1 match
swift 1 match
tsunami No match
case No match

Here, ^ is used to check if a string starts with a certain character.


$ - Dollar

Expression String Match?


$s
s 1 match
kicks 1 match
sick No match
case No match

Above, $ checks if a string ends with a certain character or not.


* - Star

Expression String Match?


hel*o
heo 1 match
hello 1 match
hola No match (not ending with o)
hell No match

Here, * matches zero or more occurrences of the pattern left to it.


+ - Plus

Expression String Match?


hel+o
helo 1 match
hellllo 1 match
hola No match
heo No match (zero occurrence)

We can see above that + matches one or more occurrences of the pattern left to it.


? - Question Mark

Expression String Match?


hel+o
heo 1 match (zero occurrence)
helo 1 match (one occurrence)
sayhelo 1 match
hello No match (more than one occurrences)

Here, ? matches zero or one occurrences of the pattern left to it.


| - Alternation

Expression String Match?


s|a
cat 1 match (a in cat)
case 2 matches (a and s both in case)
lit No match
red No match

Here, s|a match any string that contains either s or a


() - Group

Expression String Match?


(c|l|t)an
can 1 match (a in cat)
lan 1 match
tan 1 match
caan No match

In the above example, (c|l|t)an matches any string that matches either c or l or t followed by an.


Python Special Sequences

A special sequence is \ followed by a special character which makes commonly used patterns easier to write.

Here's a list of special sequence with a short description:

Special Sequence Description
\A matches if the specified characters are at the start of a string
\b matches if the specified characters are at the beginning or end of a word
\B matches if the specified characters are not at the beginning or end of a word
\d matches any decimal digit
\D matches any non-decimal digit
\s matches where a string contains any whitespace character
\S matches where a string contains any non-whitespace character
\w matches any alphanumeric character
\W matches any non-alphanumeric character
\Z matches if the specified characters are at the end of a string

Special Sequence Examples:

\A

Expression String Match?


\Aan
an ocean Match
at sea No match

Here, \A matches if an is at the start of a string or not.


\b

Expression String Match?


\bdis
diss track Match
a disco Match
adisco No Match

nt\b
bent Match
aunt Match
act No Match

We can see that \b matches if the specified characters

  • \bdis - are at the beginning of a word or not
  • nt\b- are at the end of word or not

\B

Expression String Match?


\Bdis
diss track No Match
a disco No Match
adisco Match

nt\B
bent No Match
aunt No Match
ant Match

We can see that \B is opposite of \b. That is, it matches if the specified characters are not at the beginning or end of a word.


\d

Expression String Match?


\d
h3llo 1 Match
hello No Match

Here, \d matches any decimal digit [0-9].


\D

Expression String Match?


\D
1234 No Match
h3llo 4 Matches

We can see that \D is opposite of \d. That is, it matches any string that does not contain a non-decimal digit.


\s

Expression String Match?


\s
hello world 1 Match
helloworld No Match

Here, \s matches where a string contains any whitespace character.


\S

Expression String Match?


\S
x y 2 Match
x 1 Match

Here, \S matches where a string contains any non-whitespace character.


\w

Expression String Match?

\w
67%;gt 4 Matches
!>%" No Match

Here, \w matches any alphanumeric character (digits and alphabets).


\W

Expression String Match?

\W
!>%" 4 Matches
hello No Match

\W is opposite of \w. It matches any non-alphanumeric character (digits and alphabets).


\Z

Expression String Match?

coding\Z
I love coding 1 Match
coding is fun No Match

Here, \Z matches if 'coding' is at the end of a string or not.


The re.search() Function

In Python, the re.search() function will search the regex pattern and return the first occurrence.

It is slightly different from re.match() where all lines of the input string are checked.

Let's see an example,

import re

# test string
string1 = 'Nepal is beautiful'
string2 = 'Datamentor for beginners'

# check if 'Nepal' is at the beginning of string1
result1 = re.search('\ANepal', string1) # True

# check if 'beginners' ia at the beginning of string2
result2 = re.search('\Abeginners', string2) # False

# print boolean value
print('Result for string1:', bool(result1)) # True
print('Result for string2:', bool(result2)) # False

Output

Result for string1: True
Result for string2: False

In the above example, we first imported a module named re and used the re.search() function to search for the pattern.

Here, re.search() take two parameters:

  • \ANepal and \Abeginners - \A matches if the given word is at the start of a string
  • string1 and string2 - the string in which the pattern is checked

Since,

  • 'Nepal' is at the beginning of string1, bool() returns True
  • 'beginners' is not at the beginning of string2, bool() returns False

The re.split() Function

The re.split() function in Python splits the string at each match and returns a list. For example,

import re

# test string
string1 = 'Nepal is beautiful'

# check if 'Nepal' is at the beginning of string1
result1 = re.split('\s', string1) 

# print boolean value
print(result1)

# Output:  ['Nepal', 'is', 'beautiful']

In the above example, we have used the re.split() function to split the string named string1.

Here, re.split('\s', string1) splits string1 at each white-space character.

Note: We can use other special sequences inside re.split() to split the given string.


The re.findall() Function

In Python, the re.findall() function returns a list of strings containing all matches. For example,

import re

string1 = 'H3ll0 W0R1D'
pattern = '\D+'

# extract non-digits from a string
result = re.findall(pattern, string1) 
print(result)

# Output: ['H', 'll', ' W', 'R', 'D']

Here, the re.findall() function returns a list that contains non-digits from the string1 string.

Note: re.findall() returns an empty list if the pattern is not found in the string.


The re.sub() Function

The re.sub() function in Python returns a string after replacing the matched occurrence in a string with a replacement string. For example,

import re

string1 = 'Hello World'

# replacement string 
replace = 'Hola'

# matches if 'Hello' is at the start or not
pattern = '\AHello'

# replace 'Hello' with 'Hola'
result = re.sub(pattern, replace, string1)
print(result)

# Output: Hola World

In the above example, we have used the re.sub() function to replace 'Hello' with 'Hola' in the string1.

re.sub() returns the original string if the pattern is not found.


Python Match Object

The match object in Python contains all the information about the search and the result. For example,

import re

# test string
string1 = 'Nepal is beautiful'

# result contains match object
result = re.search('\ANepal', string1) 

print(result)

Output

<re.Match object; span=(0, 5), match='Nepal'>

Here, the result variable contains a match object.


Methods and Attributes of Python Match Object

Some of the commonly used methods and attributes of match objects are:

match.group()

The group() function returns the matched substring. For example,

import re

string1 = 'Employee ID 2032 1111'

# Two digit number followed by space followed by three digit number
pattern = '(\d{2}) (\d{3})'

# match variable contains a Match object.
match = re.search(pattern, string1) 

# get substring
print('Whole Substring:', match.group())

# get first part of substring
print('First part of substring:', match.group(1))

# get second part of substring
print('Second part of substring:', match.group(2))

Output

Whole Substring: 32 111
First part of substring: 32
Second part of substring: 111

In the above example, we have used the group() function to return the matched substring from the string named string1.

Here, the pattern '(\d{2}) (\d{3})' means: two digit number followed by space followed by three digit number.

To get the matched substring we have used

  • match.group() - to get the whole substring
  • match.group(1) - to get first part of substring
  • match.group(2) - to get second part of substring

match.start(), match.end(), and match.span()

  • The start() function returns the index of the start of the matched substring
  • The end() function returns the end index of the matched substring
  • The span() function returns a tuple containing start and end index of the matched substring

Let's see an example,

import re

string = 'Employee ID 2032 1111'

# Two digit number followed by space followed by three digit number
pattern = '(\d{2}) (\d{3})'

# match variable contains a Match object.
match = re.search(pattern, string) 

print('Matched Substring Start Index:', match.start())

print('Matched Substring End Index:', match.end())

print('Tuple of Matched Substring Start and End Index:', match.span())

Output

Matched Substring Start Index: 14
Matched Substring End Index: 20
Tuple of Matched Substring Start and End Index: (14, 20)

Raw String in Python

Raw string is useful if we want to treat backslash (\) as a literal character.

For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.

Let's understand raw string with the help of an example,

import re

# \n to get new line
string1 = 'Hello\nWorld'
print("Escape Character:", string1)

# prefix r to treat \n as a normal character
string2 = r'Hello\nWorld'
print('Raw String:', string2)

Output

Escape Character: Hello
World
Raw String: Hello\nWorld

Using r prefix before RegEx

In Python, we can prefix r before a regular expression. For example,

import re

# test string
string1 = '\t Programming \n is \r fun.'

pattern = r'[\t\n\r]'

# find \t,\n, and \r in string1
result = re.findall(pattern, string1) 

print(result)
# Output: ['\t', '\n', '\r']

Here, first we have prefixed r before the regular expression pattern as

pattern = r'[\t\n\r]'

And used re.findall() to return a list of strings containing all matches.

Did you find this article helpful?