I like Google Colaboratory for multiple reasons.
Table of Contents
First of all, the code runs on someone else’s machine so I can do something else on my laptop when the code is running, and it does not get overheated ;)
The second reason is, of course, effortless code sharing. Just click the share button, copy the link, and send it to someone else.
There is only one little problem, loading data into Colaboratory. Fortunately, you can store your dataset in Google Drive and import it in a pretty easy way.
Setup
Most of the setup part is described in the predefined code snippet that lists files in Google Drive. This part we can copy paste:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import os
import pandas as pd
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
What does it do? Imports libraries that allow us to access Google Drive and allows the Google Cloud SDK to access the Google Drive of the currently logged in user. As a result, you can access your files from python code running in Colaboratory.
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Google Drive id
Unfortunately, I could not find a way to open a file using its full path as we usually do. So if I store a file in directory data/test_dataset and call the file test.csv I cannot use path: /data/test_dataset/test.csv to access it.
Google drive uses file and directory id to identify the location. Hence, to find the id of the file I have to open data/test_dataset directory in my browser and copy the identifier from the URL.
As far as I know, it is not so easy to find the identifier of a file. To find such identifier, we must list the files in the directory:
listed = drive.ListFile({'q': "title contains 'test.csv' and '1ANnCDVS281y486EVBqm_MDadxjkelxZM' in parents"}).GetList()
for file in listed:
print('title {}, id {}'.format(file['title'], file['id']))
The code prints names and identifiers of the files in the directory. Copy the identifier of the file you want to open. You are going to need it.
Now you have everything you need to load data from Google Drives to Pandas.
Copy data from Google Drive to Colaboratory
First of all, let’s create a local directory to store a copy of the file:
download_path = os.path.expanduser('~/data')
os.makedirs(download_path)
There is one little problem with this code. If you rerun the notebook cell that contains it, the code will fail because the file already exists. If you want to ignore such error, the code should look like this:
download_path = os.path.expanduser('~/data')
try:
os.makedirs(download_path)
except FileExistsError:
pass
Now we have the file id and the output directory. We can copy the file from Google Drive:
output_file = os.path.join(download_path, 'test.csv')
temp_file = drive.CreateFile({'id': 'the_file_id'})
temp_file.GetContentFile(output_file)
Load the file in Pandas
Now is the time for a thing that looks familiar. Just load the file to a Pandas dataframe:
data = pd.read_csv(output_file)