TUTORIAL - WEB SCRAPING WITH PYTHON
In this tutorial we’re going to do some very simple web scraping using requests library and regex.
First step is to import both libraries that we’re going to need. (requests
and re
)
import requests
import re
url = r"'https://en.wikipedia.org/wiki/2018_Winter_Olympics_medal_table'"
data = requests.get(url)
data.text
name = re.findall('>\S++', data.text)
MAGIC OF REGEX
Now you will see a long text of the page in html format. It might be hard to read but if you take a closer look you will be able to make sense of it. For example, some meaningful data such as languages, years, country names come before a little code piece: </a> and after character: “>”
>Latviešu</a>
If we write a quick and dirty regex code accordingly:
If we break this regex down:
‘>\S+</a>+’
- >: find this character
- \S+: followed by everything nonwhitespace, + means 1 or more time
- <\a>+: followed by this string, + means 1 or more time.
Now you got a much more meaningful data named name.
What you can do is, you can loop through this data and print each element to see what you have:
for i in name:
print(i)
>Venues
>Marketing
>mascots
>Broadcasters
>medalists
>Controversies
>Paralympics
>IOC
>KOC
>POCOG
>v
>t
>e
>snowboarding
>[1]
>Netherlands
>[2]
>Norway
>[3]
>Germany
>[4]
>[5]
>Canada
>Norway
>[6]
>Hungary
>[7]
>[8]
>[9]
>[10]
>[11]
>References
>edit
>[12]
>Norway
>Germany
>Canada
>Netherlands
>Sweden
>Switzerland
>France
>Austria
>Japan
>Italy
>Belarus
>China
>Slovakia
>Finland
>Poland
>Hungary
>Ukraine
>Australia
>Slovenia
>Belgium
>Spain
>Kazakhstan
>Latvia
>Liechtenstein
>edit
>Curling
>OAR
>meldonium
>[13]
>[14]
>[15]
>[16]
>[17]
>NOR
>edit
>edit
>^
>^
>^
>^
>^
>^
>^
>^
>^
>^
>^
>Archived
>^
>^
>^
>^
>^
>BBC
>^
>v
>t
>e
>Events
>Pyeongchang
>Snowboarding
>Bobsleigh
>Luge
>Skeleton
>Biathlon
>Curling
>men
>women
>men
>women
>v
>t
>e
>1896
>1900
>1904
>1908
>1912
>1920
>1924
>1928
>1932
>1936
>1948
>1952
>1956
>1960
>1964
>1968
>1972
>1976
>1980
>1984
>1988
>1992
>1996
>2000
>2004
>2008
>2012
>2016
>1924
>1928
>1932
>1936
>1948
>1952
>1956
>1960
>1964
>1968
>1972
>1976
>1980
>1984
>1988
>1992
>1994
>1998
>2002
>2006
>2010
>2014
>2018
>https://en.wikipedia.org/w/index.php?title=2018_Winter_Olympics_medal_table&oldid=914956083
>Categories
>Talk
>Contributions
>Article
>Talk
>Read
>Edit
>Contents
>Help
>Asturianu
>Bosanski
>Čeština
>Dansk
>Deutsch
>Español
>فارسی
>Français
>Gaeilge
>한국어
>हिन्दी
>Hrvatski
>Interlingue
>Italiano
>עברית
>Кыргызча
>Latviešu
>Magyar
>Македонски
>Nederlands
>日本語
>Polski
>Português
>Română
>Русский
>Slovenščina
>Suomi
>Svenska
>ไทย
>Türkçe
>Українська
>West-Vlams
>中文
>Disclaimers
>Developers
Letters, Numbers and Symbols
We’re still getting lots of garbage in our results. Let’s try a different approach.
Instead of \S+ (All nonwhitespace), we could try \w+ which is all alphanumeric characters. (Letter and numbers)
name = re.findall('>\w++', data.text) print (name)
[‘>Venues</a>’, ‘>Marketing</a>’, ‘>mascots</a>’, ‘>Broadcasters</a>’, ‘>medalists</a>’, ‘>Controversies</a>’, ‘>Paralympics</a>’, ‘>IOC</a>’, ‘>KOC</a>’, ‘>POCOG</a>’, ‘>snowboarding</a>’, ‘>Netherlands</a>’, ‘>Norway</a>’, ‘>Germany</a>’, ‘>Canada</a>’, ‘>Norway</a>’, ‘>Hungary</a>’, ‘>edit</a>’, ‘>Norway</a>’, ‘>Germany</a>’, ‘>Canada</a>’, ‘>Netherlands</a>’, ‘>Sweden</a>’, ‘>Switzerland</a>’, ‘>France</a>’, ‘>Austria</a>’, ‘>Japan</a>’, ‘>Italy</a>’, ‘>Belarus</a>’, ‘>China</a>’, ‘>Slovakia</a>’, ‘>Finland</a>’, ‘>Poland</a>’, ‘>Hungary</a>’, ‘>Ukraine</a>’, ‘>Australia</a>’, ‘>Slovenia</a>’, ‘>Belgium</a>’, ‘>Spain</a>’, ‘>Kazakhstan</a>’, ‘>Latvia</a>’, ‘>Liechtenstein</a>’, ‘>edit</a>’, ‘>Curling</a>’, ‘>OAR</a>’, ‘>meldonium</a>’, ‘>NOR</a>’, ‘>edit</a>’, ‘>edit</a>’, ‘>Archived</a>’, ‘>BBC</a>’, ‘>Events</a>’, ‘>Pyeongchang</a>’, ‘>Snowboarding</a>’, ‘>Bobsleigh</a>’, ‘>Luge</a>’, ‘>Skeleton</a>’, ‘>Biathlon</a>’, ‘>Curling</a>’, ‘>men</a>’, ‘>women</a>’, ‘>men</a>’, ‘>women</a>’, ‘>1896</a>’, ‘>1900</a>’, ‘>1904</a>’, ‘>1908</a>’, ‘>1912</a>’, ‘>1920</a>’, ‘>1924</a>’, ‘>1928</a>’, ‘>1932</a>’, ‘>1936</a>’, ‘>1948</a>’, ‘>1952</a>’, ‘>1956</a>’, ‘>1960</a>’, ‘>1964</a>’, ‘>1968</a>’, ‘>1972</a>’, ‘>1976</a>’, ‘>1980</a>’, ‘>1984</a>’, ‘>1988</a>’, ‘>1992</a>’, ‘>1996</a>’, ‘>2000</a>’, ‘>2004</a>’, ‘>2008</a>’, ‘>2012</a>’, ‘>2016</a>’, ‘>1924</a>’, ‘>1928</a>’, ‘>1932</a>’, ‘>1936</a>’, ‘>1948</a>’, ‘>1952</a>’, ‘>1956</a>’, ‘>1960</a>’, ‘>1964</a>’, ‘>1968</a>’, ‘>1972</a>’, ‘>1976</a>’, ‘>1980</a>’, ‘>1984</a>’, ‘>1988</a>’, ‘>1992</a>’, ‘>1994</a>’, ‘>1998</a>’, ‘>2002</a>’, ‘>2006</a>’, ‘>2010</a>’, ‘>2014</a>’, ‘>2018</a>’, ‘>Categories</a>’, ‘>Talk</a>’, ‘>Contributions</a>’, ‘>Article</a>’, ‘>Talk</a>’, ‘>Read</a>’, ‘>Edit</a>’, ‘>Contents</a>’, ‘>Help</a>’, ‘>Asturianu</a>’, ‘>Bosanski</a>’, ‘>Čeština</a>’, ‘>Dansk</a>’, ‘>Deutsch</a>’, ‘>Español</a>’, ‘>فارسی</a>’, ‘>Français</a>’, ‘>Gaeilge</a>’, ‘>한국어</a>’, ‘>Hrvatski</a>’, ‘>Interlingue</a>’, ‘>Italiano</a>’, ‘>עברית</a>’, ‘>Кыргызча</a>’, ‘>Latviešu</a>’, ‘>Magyar</a>’, ‘>Македонски</a>’, ‘>Nederlands</a>’, ‘>日本語</a>’, ‘>Polski</a>’, ‘>Português</a>’, ‘>Română</a>’, ‘>Русский</a>’, ‘>Slovenščina</a>’, ‘>Suomi</a>’, ‘>Svenska</a>’, ‘>ไทย</a>’, ‘>Türkçe</a>’, ‘>Українська</a>’, ‘>中文</a>’, ‘>Disclaimers</a>’, ‘>Developers</a>’]
for i in name:
print(i[1:-4])
Venues
Marketing
mascots
Broadcasters
medalists
Controversies
Paralympics
IOC
KOC
POCOG
snowboarding
Netherlands
Norway
Germany
Canada
Norway
Hungary
edit
Norway
Germany
Canada
Netherlands
Sweden
Switzerland
France
Austria
Japan
Italy
Belarus
China
Slovakia
Finland
Poland
Hungary
Ukraine
Australia
Slovenia
Belgium
Spain
Kazakhstan
Latvia
Liechtenstein
edit
Curling
OAR
meldonium
NOR
edit
edit
Archived
BBC
Events
Pyeongchang
Snowboarding
Bobsleigh
Luge
Skeleton
Biathlon
Curling
men
women
men
women
1896
1900
1904
1908
1912
1920
1924
1928
1932
1936
1948
1952
1956
1960
1964
1968
1972
1976
1980
1984
1988
1992
1996
2000
2004
2008
2012
2016
1924
1928
1932
1936
1948
1952
1956
1960
1964
1968
1972
1976
1980
1984
1988
1992
1994
1998
2002
2006
2010
2014
2018
Categories
Talk
Contributions
Article
Talk
Read
Edit
Contents
Help
Asturianu
Bosanski
Čeština
Dansk
Deutsch
Español
فارسی
Français
Gaeilge
한국어
Hrvatski
Interlingue
Italiano
עברית
Кыргызча
Latviešu
Magyar
Македонски
Nederlands
日本語
Polski
Português
Română
Русский
Slovenščina
Suomi
Svenska
ไทย
Türkçe
Українська
中文
Disclaimers
Developers
Much more beautiful, right?
It’s extremely fun to make use of requests and regex, both are amazing libraries with incredible potential. Let’s do a couple more tricks. We can show only numbers or we can exclude numbers as well.
name = re.findall('>[0-9]++', data.text) for i in name:
print(i[1:-4])
1896
1900
1904
1908
1912
1920
1924
1928
1932
1936
1948
1952
1956
1960
1964
1968
1972
1976
1980
1984
1988
1992
1996
2000
2004
2008
2012
2016
1924
1928
1932
1936
1948
1952
1956
1960
1964
1968
1972
1976
1980
1984
1988
1992
1994
1998
2002
2006
2010
2014
2018
name = re.findall('>[a-zA-Z]++', data.text) for i in name:
print(i[1:-4])
Venues
Marketing
mascots
Broadcasters
medalists
Controversies
Paralympics
IOC
KOC
POCOG
snowboarding
Netherlands
Norway
Germany
Canada
Norway
Hungary
edit
Norway
Germany
Canada
Netherlands
Sweden
Switzerland
France
Austria
Japan
Italy
Belarus
China
Slovakia
Finland
Poland
Hungary
Ukraine
Australia
Slovenia
Belgium
Spain
Kazakhstan
Latvia
Liechtenstein
edit
Curling
OAR
meldonium
NOR
edit
edit
Archived
BBC
Events
Pyeongchang
Snowboarding
Bobsleigh
Luge
Skeleton
Biathlon
Curling
men
women
men
women
Categories
Talk
Contributions
Article
Talk
Read
Edit
Contents
Help
Asturianu
Bosanski
Dansk
Deutsch
Gaeilge
Hrvatski
Interlingue
Italiano
Magyar
Nederlands
Polski
Suomi
Svenska
Disclaimers
Developers
How about excluding words that doesn’t start with capital letters?
name = re.findall('>[A-Z][a-z]++', data.text) for i in name:
print(i[1:-4])
Venues
Marketing
Broadcasters
Controversies
Paralympics
Netherlands
Norway
Germany
Canada
Norway
Hungary
Norway
Germany
Canada
Netherlands
Sweden
Switzerland
France
Austria
Japan
Italy
Belarus
China
Slovakia
Finland
Poland
Hungary
Ukraine
Australia
Slovenia
Belgium
Spain
Kazakhstan
Latvia
Liechtenstein
Curling
Archived
Events
Pyeongchang
Snowboarding
Bobsleigh
Luge
Skeleton
Biathlon
Curling
Categories
Talk
Contributions
Article
Talk
Read
Edit
Contents
Help
Asturianu
Bosanski
Dansk
Deutsch
Gaeilge
Hrvatski
Interlingue
Italiano
Magyar
Nederlands
Polski
Suomi
Svenska
Disclaimers
Developers
403 FORBIDDEN ERROR
If you get this error it’s usually because the page rejects GET requests without an identified User-Agent. just copy a User-Agent header and your issue should be solved. Below is an example you can use:
headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36’}
data = requests.get(url, headers=headers)
footnotes:
[1] data.text format might be slightly foreign to you.
It actually has a very simple explanation. requests.get returns a request object, in this case named data.
This object has an attribute called .text which allows us to read the body of requests object. That’s all.