Python Regex Lesson
Often note recognized as a Turing complete programming language, Regex or Regular Expressions still radiate a sense of “mini programming language” with its own weird (but incredibly useful!) syntax that’s used to parse, analyze and manipulate textual content.
Regex can be used with many different programming languages including Python and it’s a very desirable skill to have for pretty much anyone who is into coding or professional programming career.
In this Python lesson we will learn how to use regex for text processing through examples.
We will also categorize regex methods and expressions so this page can serve as a guide or reference in future in case you struggle to recall specific regex expressions. (A very common thing.)
Estimated Time
20 mins
Skill Level
Upper intermediate
Functions
findall, searchall
Course Provider
Provided by HolyPython.com
Used Where?
- Text searching
- Research
- Data Science
- Database search and manipulation
- Web requests, web crawling
- API access
- All kinds of other queries
Starting with regex "syntax"
For instance let’s say you got a fat text with tons of weird characters in it and you want to clean it fast. No problem:
By using this little structure below, you can tell your computer to only match letters from a to z.
[a-z]
What if you also need capital letters as you normally would? Simple:
Just type:
[a-z][A-Z]
As you can see Regex is very straight forward almost like a magical elixir in which you throw in the components you need. Only if it was more memorable… But, once you master Regex which doesn’t take too long, you can always quickly refresh your memory and continue where you left off.
What if you want to filter out only the numbers from 0 to 9 with Regex?
By using the structure below, you can tell your computer to only catch letters from 0 to 9.
[0-9]
Example: First regex implementation in Python
This simple example demonstrates the usage of Regex in Python:
import re
txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("[A-Z]" ,txt)
print(data)
[‘B’, ‘E’, ‘F’, ‘F’, ‘J’]
One letter results are often not so helpful so we will need a repetition method somehow. In regex this is handled by three different ways.
- + : Add this for one or more repetition
- * : Add this for zero or more repetition
- {m} : Add this for specific amount (m) of repetition
- {m,n} : Add this for specific range (m to n) of repetitions
We will elaborate these methods below with more examples but here is a simple example:
import re
txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("[A-Z][a-z]{2}" ,txt)
print(data)
[‘Bes’, ‘Eve’, ‘Fre’, ‘Fil’, ‘Jan’]
By the way one thing I find very useful when implementing regex solutions is to start with simple tasks and build the regex incrementally. Since regex has a very compact syntax it can get messy if you try to program everything in one big step.
So this can depend on personal taste but I find it very useful to add one function first, quickly test it and then continue with the next step of my implementation.
Also as usual, if you practice regex a lot for a while and do some meaningful projects, you will confidently recall it even if you come back to it after a long time but it takes a little while so don’t be discouraged.
If you just type a regex expression as below you will only get the characters inside the brackets:
[abcde]0
Or, do you not want the characters abcde, just include ^ in the beginning:
[^abcde]
\s
: Whitespace
\s
:will show whitespace.\S
: Non-Whitespace. If you’d like to rather show non-whitespace.
Tip: How to remember \s ? Simply, it stands for “space“. Although whitespace is a bit more explanatory.
\d
: Digits
\d
:will filter digits.\D
:If you’d like to rather filter non-digits then just use capital letter:
Tip: How to remember \d ? This one is easy. d stands for “digit“.
\w
: Alphanumeric
\w
:will show word-class (This means alphanumerical characters.).\W
:If you’d like to rather show non-word then just use capital letter:
Tip: How to remember \w ? w stands for “word-class“. Word-class character is the same thing as alphanumeric. It includes all the letters [a-z][A-Z], all the digits [0-9] and usually underscore: “_” is included as well.
If you tried to rename a file recently, you may recognize that only alphanumeric characters are allowed while naming a file in the computer.
.
: Any Character
.
: dot will capture any character there is.
Regex Tip: By adding repetition characters to dot you can capture everything in a text.
“Turing10 40 – $%^ Curcuit —“
If you apply .*
to the text above you’ll just get everything back:
[“Turing10 40 – $%^ Curcuit —“]
*
: Zero or more
*
: Adding this character to your regular expression will cause it to match 0 or more times until your expression is not matched any more.
“TuringMachine1940Transistors645”
If you apply \D+
to the text above you’ll get:
[[‘TuringMachine’, ”, ”, ”, ”, ‘Transistors’, ”, ”, ”, ”]]
Non-digits are matched zero or more times until regular expression doesn’t match and so on. As you can see even the characters that don’t match are represented as an empty string since we used * (zero or more times).
+
: One or more
+
: Adding this character to your regular expression will cause it to match 1 or more times until your expression is not matched any more.
“TuringMachine1940Transistors645”
If you apply \D+
to the text above you’ll get:
[‘TuringMachine’, ‘Transistors’]
Non-digits are matched one or more times until regular expression doesn’t match and so on.
?
: Zero or One
?
: Adding this character to your regular expression will cause it to match 0 or 1 times until your expression is not matched any more.
“TuringMachine1940Transistors645”
If you apply \D?
to the text above you’ll get:
[‘T’, ‘u’, ‘r’, ‘i’, ‘n’, ‘g’, ‘M’, ‘a’, ‘c’, ‘h’, ‘i’, ‘n’, ‘e’, ”, ”, ”, ”, ‘T’, ‘r’, ‘a’, ‘n’, ‘s’, ‘i’, ‘s’, ‘t’, ‘o’, ‘r’, ‘s’, ”, ”, ”, ”]
Non-digits are matched zero or one times until regular expression doesn’t match and so on.
{}
: Repetition
{3}
: Adding this expression to your regular expression will cause it to repeat the regular expression before it 3 times.
{6,}
: This will apply a repetition of 6 or more. For instance, it could be used to catch words with a length of 6 characters or more.
{6,9}
: This will apply a repetition of 6 to 9. For instance, it could be used to catch words with a length of 6 characters up to 9 characters (inclusive).
“TuringMachine1940Transistors645”
If you apply \d{3,4}
to the text above you’ll get:
[‘1940’, ‘645’]
Digits matching 3 or 4 times are caught.
More Regular Expression Examples
Here are some more regex examples that can help you understand different implementations of regex. We are using very small textual data in these examples but once your regex is solid you can apply it to much bigger datasets such as millions of lines.
Text analysis, text processing and sentimental analysis can be very fruitful domains and they are utilized in a broad range of domains from Artificial Analysis to Anthropology to Business Analysis to Financial Analysis to Web Services and even Legal Technology.
Ranges:
[a-z] [0-9] [abcde] [^abcde]
abcde 12345
Short labels:
\s Whitespace
\d Digits
\w Alphanumeric
. Any Character
Repetitions:
* Zero or more repetitions
+ One or more repetitions
? Optional character
{m} & {m,n} Repetitions
Escape characters:
\. Period \+ Plus \* Star etc.
Advanced Concepts:
^ : Starts with $ : Ends with
(..) : Group
(x|y) x or y
MORE REGEX EXAMPLES
Example 1: Regex for digits only
import re
txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("\d" ,txt)
print(data)
[‘5’, ‘0’, ‘1’, ‘0’, ‘2’, ‘0’, ‘2’, ‘0’]
+
will make sure “1 or more digits are all included” until the next non-digit character is encountered and will skip those and so on. "\d+"
[’50’, ’10’, ‘2020’]
Wondering what would happen if you used *
instead of +
? It will include all encounters (“zero or more”) but only print the ones matching the regex (digits: “\d”).
"\d+"
[’50’, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ’10’, ”, ‘2020’, ”]
Example 2: Regex with no digits
Let’s try to get the words (only with letters) in the text by trying different approaches.
An attempt with non-digits regex to demonstrate its use:
Using \D
will give you everything except digits.
import re
txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("\D" ,txt)
print(data)
[‘ Best Ever French Films – Jan ‘, ‘ ‘]
And you can achive similar, but not the same, results with "[A-Z][a-z]+"
which will print everything in those letter ranges. Take a look:
import re
txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("[A-Z][a-z]+" ,txt)
print(data)
[‘Best’, ‘Ever’, ‘French’, ‘Films’, ‘Jan’]
"\D"
gives all the non-digits including special characters but "[A-Z][a-z]+"
only gives the letters (upper and lower case). Example 3: Regex for letters only (lower and upper case)
And letter ranges:
import re
txt = "50 Best Ever French Films - Jan 10 2020"
data = re.findall("[A-Z][a-z]+" ,txt)
print(data)
[‘Best’, ‘Ever’, ‘French’, ‘Films’, ‘Jan’]
Or do you not want the characters abcde, just include ^ in the beginning:
.
:will filter whitespace.*
:If you’d like to rather filter non-W+
:If you’d like to rather filter non-W?
:If you’d like to rather filter non-W
Tip: How to remember \w ? w actually stands for “word-class“. Word-class character is the same thing as alphanumeric. And it includes all the letters [a-z][A-Z], all the digits [0-9] and usually underscore: “_” is included as well.
If you tried to rename a file recently, you may recognize that only alphanumeric characters are allowed while naming a file in the computer.
\
: Escape Character w/ backslash
Since some of the characters are reserved for special meanings in regular expressions (such as *, +, ?, $ etc.), if you’d like to match those characters as they are in a text you need to use backslash (escape character):
\?
:will match question mark\+
:will match plus sign\*
:will match star sign\.
:will match dot character\$
:will match dollar sign
and so on…
\
: Advanced Concepts
Since some of the characters are reserved for special meanings in regular expressions (such as *, +, ?, $ etc.), if you’d like to match those characters as they are in a text you need to use backslash (escape character):
^
: Starts with$
: Ends with()
: Grouping(x|y)
: x or yand so on…
Tip: If you use ^ inside brackets, it will mean except or not rather than starts with. I.e.: [^ABC] means except ABC characters.
Example 4: Regex with or logical operator
You can see an implementation of escape character with question mark below. There is also grouping with parenthesis so that question mark is matched but not shown in the results.
import re
txt = "TuringMachine1940Transistors645 ?fivethree"
data = re.findall("19|64|five" ,txt)
print(data)
[’19’, ’64’, ‘five’]
Example 5: Isolating groups in Regex
You can see an implementation of escape character with question mark below. There is also grouping with parenthesis so that question mark is matched but not shown in the results.
import re
txt = "TuringMachine1940Transistors645 ?fivethree"
data = re.findall("\?five(\S+)" ,txt)
print(data)
[‘three’]
Example 6: Escaping a character with Regex
You can see an implementation of escape character with question mark below.
import re
txt = "Antarctica Space Observatory?"
data = re.findall("O\w+\?" ,txt)
print(data)
[‘Antarctica’]
Example 7: ^ character in regex to identify beginning of a string
In this example you can see two use cases of “^” sign. First starts with “A” and then not “a”. ^ character has different functions inside the brackets vs outside the brackets.
import re
txt = "Antarctica Space Observatory"
data = re.findall("^A[^a]+" ,txt)
print(data)
[‘Ant’]
Note that program stops when “a” is encountered.
Wrapping-Up Regex and Looking Ahead
This wraps up our Regex Lesson with Python. We have covered pretty much all the Regular Expression concepts that you may need when analyzing textual data.
Feel free to bookmark this page as regex only comes up when needed with specific projects in intervals and it’s usually hard to recall everything instantly. What matters most is you know how to use regular expressions so you can always look it up quickly and build your code.
We have prepared specific regular expression exercises in Python which may be helpful for you to master this concept. You can find the link below. They also come with a regex cheat sheet.
In the next lesson we will be demystifying API connections with Python. APIs are an extremely powerful and common technologies that allow convenient data access usually in Json Format. API connections are also a great way to practice Regular Expressions since often large data in varying formats is involved and extracting specifically needed information from that data is where the added value is.
Next Lesson
In the next lesson we will be demystifying API connections with Python. APIs are an extremely powerful and common technologies that allow convenient data access. They also often combine different technologies such as Json Format and offer fantastic opportunities to practice Regular Expressions. Large data often comes in varying shapes and formats and extracting specifically needed information from that data is where the added value and insights are which can be achieved by Regex.
So, good luck! And if you’ve made it this far in your programming journey a big Congrats! Take a moment to realize how much you’ve been learning and that you are ready move on to different projects under domains that might be of interest to you. For ideas you can always check out some of our tutorials on Holypython.com as well as this article:
Next Lesson: APIs with Python