So, what is Regular Expressions? or Regex? Is it any good? What’s it used for?

Beginner programmers usually have a moment before grasping what Regex exactly is and veteran programmers usually keep forgetting how it actually worked.

This tutorial aims to provide a very clean explanation about most things Regex and a very clean reference point for all programmers to refresh their memories whenever they need.

Regex or Regular Expressions is like a simple language inside programming languages. It has its own little syntax (or expressions). And it makes text processing / string searching way easier.

Although this tutorial is Python-oriented. You can use Regex in most programming languages. Regex is a life-saver. So, enjoy this tutorial and adding Regex to your skill-set arsenal.

Used Where?

  • Text searching
  • Research
  • Data Science
  • Database search and manipulation
  • Web requests, web crawling
  • API access
  • All kinds of other queries

Let’s dive right into it.

Estimated Time

20 mins

Skill Level

Upper intermediate

Functions

findall, searchall

Course Provider

Provided by HolyPython.com

For instance let’s say you got a fat text with tons of weird characters in it and you want to clean it fast. No problem:

By using this little structure below, you can tell your computer to only match letters from a to z.

[a-z]

What if you also need capital letters as you normally would? Simple:

Just type:

[a-z][A-Z]

As you can see Regex is very straight forward almost like a magical elixir in which you throw in the components you need. Only if it was more memorable… But, once you master Regex which doesn’t take too long, you can always quickly refresh your memory and continue where you left off.

What if you want to filter out only the numbers from 0 to 9 with Regex?

By using the structure below, you can tell your computer to only catch letters from 0 to 9.

[0-9]

Example

This simple example demonstrates the usage of Regex in Python:

import re

txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("[A-Z]" ,txt)
print(data)

[‘B’, ‘E’, ‘F’, ‘F’, ‘J’]

If you just type a regex expression as below you will only get the characters inside the brackets:

[abcde]

 

Or, do you not want the characters abcde, just include ^ in the beginning:

[^abcde]

 

\s : Whitespace

\s :will show whitespace.
\S : Non-Whitespace. If you’d like to rather show non-whitespace.

Tip: How to remember \s ? Simply, it stands for “space“. Although whitespace is a bit more explanatory.

\d : Digits

\d :will filter digits.
\D :If you’d like to rather filter non-digits then just use capital letter:

Tip: How to remember \d ? This one is easy. d stands for “digit“.

\w : Alphanumeric

\w :will show word-class (This means alphanumerical characters.).
\W :If you’d like to rather show non-word then just use capital letter:

 

Tip: How to remember \w ? w stands for “word-class“. Word-class character is the same thing as alphanumeric. It includes all the letters [a-z][A-Z], all the digits [0-9] and usually underscore: “_” is included as well.

If you tried to rename a file recently, you may recognize that only alphanumeric characters are allowed while naming a file in the computer.

. : Any Character

. : dot will capture any character there is.

Tip: By adding repetition characters to dot you can capture everything in a text.

Mini example:

“Turing10 40 – $%^ Curcuit —“

If you apply .* to the text above you’ll just get everything back:
[“Turing10 40 – $%^ Curcuit —“]

* : Zero or more

* : Adding this character to your regular expression will cause it to match 0 or more times until your expression is not matched any more.

 
Mini example:

“TuringMachine1940Transistors645”

If you apply \D+ to the text above you’ll get:
[[‘TuringMachine’, ”, ”, ”, ”, ‘Transistors’, ”, ”, ”, ”]]

Non-digits are matched zero or more times until regular expression doesn’t match and so on. As you can see even the characters that don’t match are represented as an empty string since we used * (zero or more times).

+ : One or more

+ : Adding this character to your regular expression will cause it to match 1 or more times until your expression is not matched any more.

 
Mini example:

“TuringMachine1940Transistors645”

If you apply \D+ to the text above you’ll get:
[‘TuringMachine’, ‘Transistors’]

Non-digits are matched one or more times until regular expression doesn’t match and so on.

? : Zero or One

? : Adding this character to your regular expression will cause it to match 0 or 1 times until your expression is not matched any more.

 
Mini example:

“TuringMachine1940Transistors645”

If you apply \D? to the text above you’ll get:
[‘T’, ‘u’, ‘r’, ‘i’, ‘n’, ‘g’, ‘M’, ‘a’, ‘c’, ‘h’, ‘i’, ‘n’, ‘e’, ”, ”, ”, ”, ‘T’, ‘r’, ‘a’, ‘n’, ‘s’, ‘i’, ‘s’, ‘t’, ‘o’, ‘r’, ‘s’, ”, ”, ”, ”]

Non-digits are matched zero or one times until regular expression doesn’t match and so on.

{} : Repetition

{3} : Adding this expression to your regular expression will cause it to repeat the regular expression before it 3 times.

{6,} : This will apply a repetition of 6 or more. For instance, it could be used to catch words with a length of 6 characters or more. 

{6,9} : This will apply a repetition of 6 to 9. For instance, it could be used to catch words with a length of 6 characters up to 9 characters (inclusive).

Mini example:

“TuringMachine1940Transistors645”

If you apply \d{3,4} to the text above you’ll get:
[‘1940’, ‘645’]

Digits matching 3 or 4 times are caught.

Regular Expression Examples

Here are some examples to take a look at:

INTRO

Ranges:
[a-z]  [0-9]  [abcde]  [^abcde]
abcde    12345

Short labels:
\s    Whitespace
\d    Digits
\w    Alphanumeric
.      Any Character

Repetitions:
*     Zero or more repetitions
+     One or more repetitions
?      Optional character
{m}  &  {m,n}  Repetitions

EXAMPLES

Escape characters:
\. Period  \+ Plus  \* Star etc.

Advanced Concepts:
^    :  Starts with   $  :  Ends with
(..) : Group
(x|y) x or y
MORE EXAMPLES

EXERCISES

Example 1: Digits only

import re

txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("\d" ,txt)
print(data)

[‘5’, ‘0’, ‘1’, ‘0’, ‘2’, ‘0’, ‘2’, ‘0’]

What if you want the full number as you usually would? Simple, just modify your regex as below: + will make sure “1 or more digits are all included” until the next non-digit character is encountered and will skip those and so on.
"\d+"

[’50’, ’10’, ‘2020’]

Wondering what would happen if you used * instead of + ? It will include all encounters (“zero or more”) but only print the ones matching the regex (digits: “\d”).

"\d+"

[’50’, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ’10’, ”, ‘2020’, ”]

Example 2: Different regex appoaches

Let’s try to get the words (only with letters) in the text by trying different approaches.

An attempt with non-digits regex to demonstrate its use:

 Using \D will give you everything except digits.

import re

txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("\D" ,txt)
print(data)

[‘ Best Ever French Films – Jan ‘, ‘ ‘]

And you can achive similar, but not the same, results with "[A-Z][a-z]+"  which will print everything in those letter ranges. Take a look:

import re

txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("[A-Z][a-z]+" ,txt)
print(data)

[‘Best’, ‘Ever’, ‘French’, ‘Films’, ‘Jan’]

So you can see that "\D" gives all the non-digits including special characters but "[A-Z][a-z]+" only gives the letters (upper and lower case).

Example 3

And letter ranges:

import re

txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("[A-Z][a-z]+" ,txt)
print(data)

[‘Best’, ‘Ever’, ‘French’, ‘Films’, ‘Jan’]

Or do you not want the characters abcde, just include ^ in the beginning:

. :will filter whitespace.
* :If you’d like to rather filter non-W
+ :If you’d like to rather filter non-W
? :If you’d like to rather filter non-W

Tip: How to remember \w ? w actually stands for “word-class“. Word-class character is the same thing as alphanumeric. And it includes all the letters [a-z][A-Z], all the digits [0-9] and usually underscore: “_” is included as well.

If you tried to rename a file recently, you may recognize that only alphanumeric characters are allowed while naming a file in the computer.

\ : Escape Character w/ backslash

Since some of the characters are reserved for special meanings in regular expressions (such as *, +, ?, $ etc.), if you’d like to match those characters as they are in a text you need to use backslash (escape character):

\? :will match question mark
\+ :will match plus sign
\* :will match star sign
\. :will match dot character
\$ :will match dollar sign

and so on…

\ : Advanced Concepts

Since some of the characters are reserved for special meanings in regular expressions (such as *, +, ?, $ etc.), if you’d like to match those characters as they are in a text you need to use backslash (escape character):

^ : Starts with
$ : Ends with
() : Grouping
(x|y) : x or yand so on…

 

Tip: If you use ^ inside brackets, it will mean except or not rather than starts with. I.e.: [^ABC] means except ABC characters.

Example 4: Or

You can see an implementation of escape character with question mark below. There is also grouping with parenthesis so that question mark is matched but not shown in the results.

import re

txt = "TuringMachine1940Transistors645 ?fivethree"
data = re.findall("19|64|five" ,txt)
print(data)

[’19’, ’64’, ‘five’]

Example 5: Group

You can see an implementation of escape character with question mark below. There is also grouping with parenthesis so that question mark is matched but not shown in the results.

import re

txt = "TuringMachine1940Transistors645 ?fivethree"
data = re.findall("\?five(\S+)" ,txt)
print(data)

[‘three’]

Example 6: Escaping

You can see an implementation of escape character with question mark below.

import re

txt = "Antarctica Space Observatory?"
data = re.findall("O\w+\?" ,txt)
print(data)

[‘Antarctica’]

Example 7: "Starts with" or "not"?

In this example you can see two use cases of “^” sign. First starts with “A” and then not “a”.

import re

txt = "Antarctica Space Observatory"
data = re.findall("^A[^a]+" ,txt)
print(data)

[‘Ant’]

Note that program stops when “a” is encountered.

Regular Expression (Regex) Exercises