How to load data from Google Drive to Pandas running in Google Colaboratory

I like Google Colaboratory for multiple reasons.

First of all, the code runs on someone else’s machine so I can do something else on my laptop when the code is running, and it does not get overheated ;)

The second reason is, of course, effortless code sharing. Just click the share button, copy the link, and send it to someone else.

There is only one little problem, loading data into Colaboratory. Fortunately, you can store your dataset in Google Drive and import it in a pretty easy way.

Setup

Most of the setup part is described in the predefined code snippet that lists files in Google Drive. This part we can copy paste:

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import os
import pandas as pd
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

What does it do? Imports libraries that allow us to access Google Drive and allows the Google Cloud SDK to access the Google Drive of the currently logged in user. As a result, you can access your files from python code running in Colaboratory.

Google Drive id

Unfortunately, I could not find a way to open a file using its full path as we usually do. So if I store a file in directory data/test_dataset and call the file test.csv I cannot use path: /data/test_dataset/test.csv to access it.

Google drive uses file and directory id to identify the location. Hence, to find the id of the file I have to open data/test_dataset directory in my browser and copy the identifier from the URL.

As far as I know, it is not so easy to find the identifier of a file. To find such identifier, we must list the files in the directory:

listed = drive.ListFile({'q': "title contains 'test.csv' and '1ANnCDVS281y486EVBqm_MDadxjkelxZM' in parents"}).GetList()
for file in listed:
  print('title {}, id {}'.format(file['title'], file['id']))

The code prints names and identifiers of the files in the directory. Copy the identifier of the file you want to open. You are going to need it.

Now you have everything you need to load data from Google Drives to Pandas.

Copy data from Google Drive to Colaboratory

First of all, let’s create a local directory to store a copy of the file:

download_path = os.path.expanduser('~/data')
os.makedirs(download_path)

There is one little problem with this code. If you rerun the notebook cell that contains it, the code will fail because the file already exists. If you want to ignore such error, the code should look like this:

download_path = os.path.expanduser('~/data')
try:
  os.makedirs(download_path)
except FileExistsError:
  pass

Now we have the file id and the output directory. We can copy the file from Google Drive:

output_file = os.path.join(download_path, 'test.csv')
temp_file = drive.CreateFile({'id': 'the_file_id'})
temp_file.GetContentFile(output_file)

Load the file in Pandas

Now is the time for a thing that looks familiar. Just load the file to a Pandas dataframe:

data = pd.read_csv(output_file)
Older post

Precision vs. recall - explanation

How to understand the difference between precision and recall?

Newer post

Visualize common elements of two datasets using NetworkX

How to use undirected graph to visualize common elements of two Pandas data frames