Files with Python

As soon as you get proficient with basic Python skills. Working with files and databases becomes a crucial part of real-world programming tasks.

Thanks to Python’s extremely intuitive and straightforward syntax, working with files becomes a piece of cake.

In this tutorial we will share the basics, tips and finesse of most useful file operations in Python.

Estimated Time

20 mins

Skill Level

Advanced

Exercises

na

Contents

Course Provider

Provided by HolyPython.com

Used Where?

  • Exploring data
  • Web parsing
  • Saving data
  • Collecting data
  • Data science
  • Database operations
  • Applications
  • Websites
  • Charts
  • All the domains such as: finance, medicine, game development, software, apps, energy etc. etc.
We will split this Python tutorial up to subsection based on operation needs when working with files:
  • Opening Files in Python
  • Reading Files in Python
  • Writing Files in Python (Mostly concerning writing fresh data to a new file or overwriting existing file)
  • Updating Files in Python (Mostly concerning appending new data besides existing data on an existing file making sure existing data is preserved)
  • Tips & Tricks while working with Python
  • Cheat Sheet regarding different objects, classes, methods and parameters used when working with files in Python

Each section will be provide information and examples about most common files used with Python when handling data such as:

  • Excel Files (xlsx, xlsm, xls)
  • CSV Files (Comma separated values)
  • Text Files (Simple txt files)
  • Json Files (Commonly used data structure)

 

Opening files with Python to read

There are multiple sophisticated ways to save Plotly Animations. This tutorial demonstrates some of those practical solutions:

Excel Files (".xls", ".xlsx", ".xlsm", ".xlsb", ".odf")

Chart Studio is a very convenient and resourceful cloud service provided by Plotly. It’s quite costly for enterprise options but for small operations and individual use its free account can be very useful as well.

Let’s discover options with Chart Studio. 

First all the relevant libraries can be imported like this:

f = "test.xlsx"
file = open(f)

Althought usually not necessary, you might need to specify the encoding format in your code as below:

Python uses “utf-8” format by default.

f = "test.xlsx"
file = open(f, 'r', encoding='utf-8')

Text Files (.txt)

f = "test.txt"
file = open(f, "r")

print(file)
<_io.TextIOWrapper name='Desktop/test.txt' mode='r+' encoding='cp1252'>

Please note that at this stage, we’re only creating a file handle to work with. You’ll still need to read from this file handle.

So, code above shows what happens when file handle is printed, it just returns a IOWrapper object.

However, there are some methods that can directly read from a file after opening. Don’t let this confuse you this just means these methods have the reading function embedded in them.

i.e.: Reading ‘xlsx’ with Pandas’ “read_excel”, Reading json with “json.load” (these methods won’t require you an extra reading step.)

Json Files (.json)

f = "test.json"
file = open(f, "r")

Comma Separated Value Files (.csv)

f = "test.csv"
file = open(f, "r")

File opening modes:

“r” Opens a file in read-only.
“r+” Opens a file in read and write mode (writes to the beginning). 
“rb” Opens a file in binary format in read-only.
“rb+” Opens a file in binary format in read and write mode.
“w” Opens a file in write-only.
“a” Opens a file in write-only. Creates if inexistent (writes to the end).
“a+” Opens, reads, writes, creates if inexistent (writes to the end).

Closing files in Python

Closing a file can be important after you’re done working with them. So, remember to use .close() method after your operation is finished when necessary.

File will remain opened in Python until you close it. This means if you’d like to separately open the file (let’s say in Windows or MacOS or Linux) you won’t be able to change it and you’ll get errors because the file will already be in use.

Closing Files (.close() method)

You can use the same method for all files as below:

file.close()

Reading files in Python

Once you created the file handle you can read data from it with a separate step.

Basically there are 3 very important and useful methods you should know:

  • read()
  • readline()
  • readlines()
read() reads the whole file in once as a string.
readline() only reads a specific line of data (reads the first line by default)
readlines() reads all the lines and returns a list of strings where each string is a line of data.

Excel Files (".xls", ".xlsx", ".xlsm", ".xlsb", ".odf")

Pandas is really the perfect tool for working with excel files in Python. What we get in Excel is a data frame and since Pandas is the king of data frames, it’s just perfect.

So just import pandas and then read_excel module can be used as below:

This method makes opening a file handle for excel files redundant in most cases since it directly reads from file name into the data frame.

import pandas as pd

f = "test.xlsx"
df = pd.read_excel(f)

print(df)
                 yeah
0 you know
1 you know how it is
2 this is how

You can also specify the sheet name when you’re dealing with multiple worksheets.

import pandas as pd

f = "test.xlsx"
df = pd.read_excel(f, sheet_name="Sheet2")

print(df)
     Names
0 Jessica
1 Keahni
2 Rebecca
3 Sharon
4 John

Text Files (Read, readline, readlines)

a=file.read()
print(a)

YES

NO

a=file.readline()
print(a)

YES

a=file.readlines()
print(a)

[“YES\n”, “NO”]

Json Files (json.load and json.loads)

Reading Json comes with numerous ways. They may seem redundant but actually it’s a sophisticated system in Python. Let’s take a look:

read(), readline(), readlines(): This general way of reading will return a string object.

json.load(): Loads a file as dictionary directly from the file object.

json.loads: Loads a file as dictionary from string. (“s” in loads stands for Load String)

file = open(f,"r")
a = file.read()
print(a)
print(type(a))
{"a": 666, "1": 222, "3": 444}
<class 'str'>
file = open(f,"r")
print(json.load(file))
{"a": 666, "1": 222, "3": 444}
<class 'dict'>

Here we are loading the opened file directly to Python dictionary.

file = open(f,"r")
data = file.read()
print(json.loads(data))
{"a": 666, "1": 222, "3": 444}
<class 'dict'>

Here we are reading the opened file first as json.loads needs a string to load from into Python dictionary.

CSV Files (pandas read_csv and csv.reader)

data=pd.read_csv('Desktop/test.csv')
print(data)
     city  order
0 brussels 55
1 tokyo 105
2 almati 70
3 moscow 150
4 london 162

Besides .csv files are not that different than structured text files, somewhat similar to json. So, you can also directly read them in Python:

f = r"C://Users/ABC/Desktop/test.csv"
file = open(f, "r")
data = file.read()
print(data)
city,order
brussels,55
tokyo,105
almati,70
moscow,150
london,162

And then there is the standard csv library that’s very useful as well:

import csv
f = r"C://Users/ABC/Desktop/test.csv"

file open(f, 'r')
data = csv.reader(file, delimiter=' ')
print(data)
<_csv.reader object at 0x000001CF4E38F198>

csv.reader returns an iterable csv object so you’ll actually need an extra step to construct data from it:

import csv
f = r"C://Users/ABC/Desktop/test.csv"

file open(f, 'r')
data = csv.reader(file, delimiter=' ')

for i in data:
    print(i)
['city,order']
['brussels,55']
['tokyo,105']
['almati,70']
['moscow,150']
['london,162']

Writing to files in Python

There are multiple sophisticated ways to save Plotly Animations. This tutorial demonstrates some of those practical solutions:

Excel Files (ExcelWriter and to_excel)

When it comes to Excel, the most intuitive way to deal with it is probably Data Frames. In the end Excel itself is a Data Frame system.

So Pandas Data Frames can be used to handle Excel operations and in this case write to Excel.

  • First, we need to create a Pandas DataFrame (obviously don’t forget to import pandas library before that)
  • Second, we can create a writer object to write to Excel.
  • Third, we use the writer object to actually write to Excel
  • Fourth and finally, we can save the writer object which will save the file.

Attention though, if you work with an existing file this method will completely rewrite it causing existing data to vanish. (This is because file mode in ExcelWriter class automatically defaults to “w”. More on this below. )

import pandas as pd

df = pd.DataFrame({'city': ['London', 'Aarhus', 'Oslo', 'Rotterdam'],
     'sun': ["no", "no", "no", "no"],
     'trips': ["Kenya", "Tanzania", "Spain", "Bahamas"]})

writer = pd.ExcelWriter('Desktop/test4.xlsx')
df.to_excel(writer, "Plans")
writer.save()

You’ll end up with an Excel file showing below with a Worksheet named “Plans“:

        city sun     trips
0 London no Kenya
1 Aarhus no Tanzania
2 Oslo no Spain
3 Rotterdam no Bahamas

This method can be used to handle writing all data to Excel. If you have a single cell data, well, that’s a data frame too with dimensions of 1×1. If you only have a column that’s a data frame too with dimensions of, say, 20×1.

If you check out the help file on ExcelWriter class it offers some interesting parameters to be played with. Particularly, date_format, datetime_format, engine and mode can be quite useful for everyone.

 |  Parameters
| ----------
| path : str
| Path to xls or xlsx file.
| engine : str (optional)
| Engine to use for writing. If None, defaults to
| ``io.excel.<extension>.writer``. NOTE: can only be passed as a keyword
| argument.
| date_format : str, default None
| Format string for dates written into Excel files (e.g. 'YYYY-MM-DD').
| datetime_format : str, default None
| Format string for datetime objects written into Excel files.
| (e.g. 'YYYY-MM-DD HH:MM:SS').
| mode : {'w', 'a'}, default 'w'
| File mode to use (write or append).

With the help of mode and engine parameters, code below will add an additional sheet to the Excel file instead of completely rewriting it with fresh data:

import pandas as pd

df = pd.DataFrame({'city': ['London', 'Aarhus', 'Oslo', 'Rotterdam'],
     'sun': ["no", "no", "no", "no"],
     'trips': ["Kenya", "Tanzania", "Spain", "Bahamas"]})

writer = pd.ExcelWriter('Desktop/test4.xlsx', engine= "openpyxl", mode= "a")
df.to_excel(writer, "Plans2")
writer.save()

Furthermore, you can specify some pretty useful parameters in to_excel method.

  • startrow=1,
  • index=False
  • header=None

to name a few. See below:

df.to_excel(writer, "Plans", startrow=1, index=False, header=None)

Another elegant approach here is to use with writer as structure which eliminates the need to use .save() or .close() methods separately:

import pandas as pd

df = pd.DataFrame({'city': ['London', 'Aarhus', 'Oslo', 'Rotterdam']})

with pd.ExcelWriter('Desktop/test4.xlsx') as writer
    df.to_excel(writer, "Plans2")

Text Files (.write() method)

f = 'Desktop/test.txt'
file = open(f, "r+", encoding='utf-8')

file.write("ALOHA")
file.close()

Text file will be written the following text:

ALOHA

Please note that there is a small but meaningful nuance with file opening method.

r+ will cause new data to be written to the start of the file (and possibly overwrite data partially or fully)

a+ on the other hand will open file in append mode and allow new data to be written at the end part of the file.

Json Files (.write() and json.dumps())

json.dumps can be used which stands for Json Dump String

import json
f = 'Desktop/test.json'

file = open(f, "a+", encoding="utf-8")
data = file.read()
file.write(json.dumps({111:999}))

file.close()

File will be written with following:

{111:999}

Another alternative is json.dump which will write Python dictionary to Json.

If you try the same structure with json.dump without changing anything you will get the error below:

TypeError: dump() missing 1 required positional argument: 'fp'

json.dump actually takes the file object as an argument as following:

import json
f = 'Desktop/test.json'

file = open(f, "a+", encoding="utf-8")
data = file.read()
file.write(json.dump({111:999}, file))

file.close()
{111:999}

CSV Files (csv.writer, writerows() method)

data=[["paris",555], ["moscow",888]]

f = 'Desktop/test.csv'
file=open(f, "a+", encoding="utf8")

csvwriter = csv.writer(file)
csvwriter.writerows(data)

file.close()

You can avoid skipping an extra raw in your cvs file when writing by adding a lineterminator parameter like this:

csvwriter = csv.writer(file, lineterminator = '\n')

Updating files in Python

This part can be very important to understand. As opening, reading and writing to a file in Python is a slightly different experience than doing those operations in Operating Systems. For example:

In Windows:

  • You open a file
  • You change something
  • Save it and close it.

In Python:

  • You open a file
  • You read a file and create a copy of it
  • You update the copy
  • You write updated copy back to the file.

Now in Python, if you mess something up between reading and writing steps, you will end up with an empty file.

This is because when you write to a file in Python everything is written from zero once again.

Okay, it’s not that confusing but takes a few examples/exercises to completely internalize. Let’s see:

Updating Excel Files (pd.ExcelWriter)

If you’d like to update an Excel Workbook by adding new sheets to it in Python this can be achieved by the method mentioned in Writing Excel files section (simply using “a” file mode while writing and specifying sheet name.)

However if  you’d like to append new data to the existing Excel Worksheet, this can also be achieved by Pandas ExcelWriter class. This time additionally, it will be helpful to read the data in the existing sheet before using the to_excel method to write new data and this will cause the existing data to be included.

You can use this code when you need to update an existing sheet:

Please note the use of startrow parameter which with the help of reader object is used to avoid overwriting existing rows.

import openpyxl
import pandas as pd
from openpyxl import load_workbook

df = pd.DataFrame({'city': ['London', 'Aarhus', 'Oslo', 'Rotterdam']})

writer = pd.ExcelWriter('Desktop/test6.xlsx', engine='openpyxl', mode="a")
writer.book = load_workbook('Desktop/test6.xlsx')
writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)

reader = pd.read_excel(r'Desktop/test6.xlsx', 'Plans')
df.to_excel(writer, 'Plans', index=False,header=False,startrow=len(reader)+1)

writer.close()

Updating Text Files (SImply .read(), .write() and .close())

You can directly write to a text file.

Just remember to open it in r+ or a+ modes:

f="Desktop/test.txt"

file = open(f,"a+")
file.write("\nALRIGHT NOW")
file.close()

file = open(f, "r")
data = file.read()
print(data)
YES
NO
ALRIGHT NOW

With r+ mode instead:

f="Desktop/test.txt"

file = open(f,"a+")
file.write("\nALRIGHT NOW")
file.close()

file = open(f, "r")
data = file.read()
print(data)
ALRIGHT NOW
YES
NO

Updating Json Files (json.dump and json.dumps)

Json is a fantastic file format when you’re dealing with data.

It’s sort of like text but also structured, easy to read, easy to store, easy to work with and efficient.

Oftentimes while you’re building something cool you will need to save data locally so your progress doesn’t disappear or you might need to ship it across the network as well as cache your data.

Json is perfect for that. But also, you might have to do this data saving / caching in multiple steps. That’s where the updating comes in. Let’s see:

with open(f,"r") as f1:
    data=json.load(f1)
    data.update(a)

with open(f,"w") as f2:
    f2.write(json.dumps(data))

In first block: 

  • json file is opened (in read mode)
  • json file is read as a dictionary (using load)
  • data read is updated (attention data is updated not the file yet.)

In second block: 

  • json file is opened (in write mode)
  • json file gets written (with the updated data)
Alternatively, this can be achieved by opening a file in a+ or r+ mode:
(a+ or r+ modes will affect where new data is appended: end vs beginning)
with open(f, 'a+') as file:
    feeds = json.load(file)
    entry = {}
    entry['name'] = args.name
    entry['url'] = args.url
    json.dump(entry, feeds)

Updating CSV Files (Code vs Link from Account)

There are 2 useful methods in csv library that can be used and they work in a similar manner.

csv.writer: For iterables
csv.DictWriter: For dictionaries

import csv

data=[["Baku",555], ["Sao Paolo",888]]
f = 'Desktop/test.csv'
file=open(f, "r+")

csvwriter = csv.writer(file)
csvwriter.writerows(data)

file.close()

Python File Tips

Here are some special tips that can be helpful while you’re working with files in future.

1. Use raw string format to avoid path conflicts

Sometimes a tiny glitch can keep you from hours of fruitful work or study. This probably resonates with all of us but especially the new programmer suffers from odd glitches of the computer world.

If you’d like to avoid tons of little conflicts and headaches caused by conflicts between OS path and Python or its libraries try using r” format while typing your file or directory paths in your code.

It’s as simple as adding r in front of the traditional string type data:

a = "c://Users/ABC/Desktop"   # regular string path, you're at binary gods' mercy
b = r"c://Users/ABC/Desktop"   # raw string path, cool as a cucumber

2. Use with open() as for a practical operation

When you use with open() as while working with files it offers multiple advantages:

  • It gives structure to your code
  • You don’t have to worry about opening and closing the file separately
  • You will avoid conflict with other parts of your program where the same file might be used or referenced

3. Work with backup files in the beginning

Working with files from your terminal or IDE can be a harsh experience in the beginning. Until you really get the hang of it just work with backed up / copied files so you don’t lose some important data and have to start from beginning.

Simply using “w” as read mode can cause your file content, information or data to completely vanish. And, mistakes will happen in the beginning. Just work with files for a while and everything will start getting easier and more intuitive.

3. Use appropriate methods/libraries

To work with JSON there is a whole library for it, to work with Excel, Pandas is usually a no brainer, CSV has its own library and TXT files can be read as strings directly.

Just make a mental note that GUI file operations are not the same experience as working with files between your code lines.

This is the kitchen of things, you’ll need to be a bit more sophisticated than double clicking a file to open and clicking cross the close the window. As long as you make a mental note of the differences and extra steps that may be involved sometimes (such as .read() method or .close() method) you will be fine. But if you never take a minute to really understand the subtle differences you will likely keep forgetting the sequence of things and keep coming back. Which is totally fine, it’s part of the process too.

This page will be here whenever you’d like to visit and refresh your knowledge.

4. Dealing with ParseError

When you get an error you can try this parameter.

Be careful as it will cause data to skip the lines that are not readable or corrupt.

data = pd.read_csv(‘file1.csv’, error_bad_lines=False)

data=pd.read_csv('Desktop/test.csv', error_bad_lines=False)
ParserError: Error tokenizing data. C error: Expected 2 fields in line 10, saw 3

4. Dealing with UnicodeDecodeError

This error usually arises from encoding format confusion. Please note that sometimes corrupt data will also cause it. More often than not you shouldn’t have to deal with any other encoding format than “utf-8” also can be written as “utf8” as it became the ultimate standard dealing with data globally.

However this doesn’t mean other encoding formats don’t exist at all and you’ll never encounter them.

As soon as you get an error with utf-8 you can give latin1 a try as sometimes that fixes the encoding errors.

data=pd.read_csv('Desktop/test.csv', encoding="utf-8"
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte
data=pd.read_csv('Desktop/test.csv', encoding="latin1")

Here are some common encoding formats:

utf-8 – Global standard (80%+ of the encoding in the wild wild web)

ASCII – Old format for main alphanumerical characters

utf-16 – More efficient with Chinese characters where each character doesn’t have a fixed size

latin1 (ISO-8859-1 or ISO-8859-15) -includes Euro Sign and common in Western Europe

ISO-2022-JP – useful for Japanese characters

5. Skipping extra rows when writing CSV files

This problem can be solved by specifying the lineterminator parameter in csv.writer object.

csvwriter = csv.writer(file, lineterminator = '\n')

5. a is for append mode

If you open a file in “a” or “a+” modes and then use .write method, this will write to the end of the file in other words it will append.

6. Excel Engines for different purposes

When working with Excel files from pandas library. You may be inclined to think everything is being handled by Pandas. But, this is not the case. Python world usually is a co-dependent xxxx and everyone has their specializations.

Long story short, pandas uses Excel engines to handle Excel files and you might have to specify the exact engine to avoid errors in some cases. Openpyxl is a pretty common and strong one so it’s recommended. You can specify it as below with the engine parameter:

writer = pd.ExcelWriter('Desktop/test6.xlsx', mode="a", engine="openpyxl")

For instance, xlsxwriter doesn’t offer append mode while openpyxl does and you’ll need to adjust based on your operational needs.

Python Files Cheat Sheet

Methods

open()

read(), readline(), readlines()

close()

pd.read_excel()

pd.read_csv()

json.load(), json.loads(), json.dump(), json.dumps()

Parameters

encoding=”utf-8″, encoding=”latin1″

error_bad_lines=False, error_bad_lines=True

engine=”python”, engine=”c”

r””

Opening modes: r, w, r+, a, a+

Thank you for visiting this tutorial. Please feel free to share it with your friends and followers if it was useful for you.

For further reading you can check out the Official Python documentation about File Formats here.

Pandas Official Documents about reading and writing to files: here.

Python JSON library here. And,

CSV library here.