Let’s check out some exercises that will help you understand Regular Expressions better.

Exercise 6-a

From the list keep only the lines that start with a number or a letter after > sign.


You can use findall method from the regex library:
i.e.: re.findall()
\w+ can be a meaningful regular expression in this case.
data = re.findall('>\w+', str)

Exercise 6-b

Write a regex so that the full email addresses are extracted.
i.e.: mike@protonmail.com


One way to approach this problem is:

1- include everything that’s non-space before the “@” sign 

2- adding the “@” sign

3- everything non-space after the “@” sign.

This example really shows the versatility of regex because with this format, you will catch the emails regardless of different suffixes (.co.uk, .gov.fr, .co.jp etc.)

Regular Expression for everything except space is:

\S : Non-space characters

By combining + with \S you can apply non-space to one or more characters
i.e.: \S+

regex = r'\S+@\S+'

Note: So the part inside quotes is purely regex. But you might be wondering what r is doing in front. r’text here’ is a fantastic trick in Python that can help you avoid numerous conflicts such as back slash misinterpretations while typing directories etc.

Raw string can help you remember and understand the function of r.

It’s a good practice to have sometimes, otherwise if you type your string without the r backslashes will be trated as escape characters.

Exercise 6-c

This time write a regex to get only the part of the email before the "@" sign and include the "@" sign.
i.e: only mike@ part from mike@protonmail.com


One way to approach this problem is:

1- include everything that’s non-space before the “@” sign 

2- adding the “@” sign

Regular Expression for everything except space is:

\S : Non-space characters

By combining + with \S you can apply non-space to one or more characters
i.e.: \S+

regex = r'\S+@'

Exercise 6-d

This time write a regex to get only the part of the email before the "@" sign excluding the "@" sign.
i.e: only mike part from mike@protonmail.com


One way to approach this problem is:

1- include everything that’s non-space before the “@” sign 

2- the “@” sign

3- also by using parenthesis in the right place you can make sure @ is searched for but not included in the output.

Regular Expression for everything except space is:

\S : Non-space characters

By combining + with \S you can apply non-space to one or more characters
i.e.: \S+

You can isolate the result you’d like to have by using parenthesis: (\S+)@

“@” will be excluded.

regex = r'(\S+)@'

Exercise 6-e

Find the words with exactly 8 letters using regex.


You can use \w which will return all the alphanumeric characters

You can combine \w with {8} which will repeat the alphanumeric character for 8 times.

\w{8} : Alphanumeric characters 8 times

regex = r'\w{8}'

Exercise 6-f

Find the numbers starting with 212.


You can use \S which will return all the NonSpace characters combined with 212 in the beginning.

You can combine \S with + which will match the nonspace characters for one or more time.

\S+ : NonSpace characters for 1 or more time.

regex = r'212\S+'

Exercise 6-g

You are given stock prices for related financial tickers. (Symbols representing companies in the stock market)

Find a way to extract the tickers mentioned in the report.
i.e.: TSLA, NFLX ...


You can use [A-Z] capital letter range since stock tickers are given in all capital letters.

Now you need to match capital letters for multiple times without matching the words starting with a capital letter.

Repetition can be a suitable method in this case:

{2,}

will ensure capital letters with 2 or more times are captured only.

regex = r'[A-Z]{2,}'

Exercise 6-h

Find the html tags that are more than 4 letters.

Html tags can be found inside <> characters and closing html tags can be found in the same format after / character. </>

i.e.: <param> </param>


You can start regex with </ and end with > .

It makes more sense to search in closing tags to avoid tags with attributions. (attributions provide additional information inside tags and can be found in the beginning of the tag. Tags with attribution can start with multiple keywords and make it more complicated to search in this case.

Since html tags are only lower case letters we can use a letter range with 5 and more repetitions: [a-z]{5,} 

To isolate the tag name only you can use parenthesis: ()

regex = r'</([a-z]{5,})>'

Exercise 6-i

Loop through the list and apply regex to each element so that only items ending with semicolon (;) are matched.


You can use ;$ at the end of your regex to match the items ending with ;

You can use .+ to capture everything before the semicolon.

You can use parenthesis to isolate the values before the semicolon.

()

Inside your for for loop you can use try except to avoid Index Error while using data[0] to ignore empty lists.

except IndexError:

will particularly avoid Index Errors.

It’s generally considered a safer approach than:

except Exception:

which will catch almost all error types.

lst=[]
for i in str:
    data = re.findall(r"(.+);$" ,i)
    try:
        lst.append(data[0])
    except Exception:
        pass

Regex Cheat Sheet

[0-9]    :     0 to 9

[a-z]     :     a to z

[A-Z]    :     A to Z

abc         :     a, b and c

123         :     1, 2 and 3

.               :     Any character
[^a]       :     Not a
[^a-f]   :     Not a to f

* Zero or more repetitions
+ One or more repetitions
?  One time only
{m} m Repetitions
{m,} m or more Repetitions
{m,n} m to n Repetitions

\w :  Word class (alphanumeric)
\d :  Digits
\s :   Space (whitespace)
\W : Non-word class
\D :  Non-digit
\S :   Non-space 

|      :  Or operand
()    :  Capturing group
(()) :  Capturing subgroup
\      :  Escape a special character
^     :  Starts with
&     :  Ends with