I started the “What’s cooking?” Kaggle challenge and wanted to do some data analysis. The given data set consists of three columns. Unfortunately, the last one is a list of ingredients.
Table of Contents
data:image/s3,"s3://crabby-images/c6c8e/c6c8e024efee19310278b2dbfba3301b772568ef" alt=""
I wanted to calculate how often an ingredient is used in every cuisine and how many cuisines use the ingredient. I had to split the list in the last column and use its values as rows. Additionally, I had to add the correct cuisine to every row.
Let’s look at an example. If my dataset looks like this:
cuisine_1,id_1,[ingredient_1, ingredient_2, ingredient_3]
cuisine_2,id_2,[ingredient_4, ingredient_5]
Note that my third column is a list of strings! If you have a column with a string that contains multiple values separated by a comma, you have to split those strings into lists of strings first!
I want that output:
cuisine_1,id_1,ingredient_1
cuisine_1,id_1,ingredient_2
cuisine_1,id_1,ingredient_3
cuisine_2,id_2,ingredient_4
cuisine_2,id_2,ingredient_5
In this article, I am going to show you how to do it in two ways. First, I will use the for
loops. Later, I will use only built-in Pandas functions.
The for loop way
My first idea was to iterate over the rows and put them into the structure I want. I wrote some code that was doing the job and worked correctly but did not look like Pandas code. Look at this, I dissected the data frame and rebuilt it:
ingredients = []
cuisines = []
ids = []
for _, row in data.iterrows():
cuisine = row.cuisine
identifier = row.id
for ingredient in row.ingredients:
cuisines.append(cuisine)
ingredients.append(ingredient)
ids.append(identifier)
ingredient_to_cuisine = pd.DataFrame({
"id": ids,
"ingredient": ingredients,
"cuisine": cuisines
})
There must be a better way to do it.
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
The better way
Let’s do it step by step. Before each step, I will explain what function I am going to use and why. If you are interested in the full code with no explanation, scroll to the last code snippet.
Firstly, we have to split the ingredients
column (which contains a list of values) into new columns. It is easy to do, and the output preserves the index. The index is important. We are going to need it.
data.ingredients.apply(pd.Series)
data:image/s3,"s3://crabby-images/d3c65/d3c65fc327a400149088879c527007b518a7f84b" alt="For testing, I limited the data set to three rows."
Now we can merge the new columns with the rest of the data set. There is a lot of empty values, but that is fine. We will get rid of them later:
data.ingredients.apply(pd.Series) \
.merge(data, left_index = True, right_index = True)
data:image/s3,"s3://crabby-images/7e1c2/7e1c20292dbe8a07bdd400422f9a93e1ae5dc165" alt=""
First of all, I don’t need the old ingredients
column anymore. So, let’s drop it:
data.ingredients.apply(pd.Series) \
.merge(data, right_index = True, left_index = True) \
.drop(["ingredients"], axis = 1)
data:image/s3,"s3://crabby-images/48f77/48f7744142939c8d368d8ad40d51d1e107d99943" alt=""
Now we can transform the numeric columns into separate rows using the melt
function. Note that, I use the cuisine
and the id
as the identifier variables:
data.ingredients.apply(pd.Series) \
.merge(data, right_index = True, left_index = True) \
.drop(["ingredients"], axis = 1) \
.melt(id_vars = ['cuisine', 'id'], value_name = "ingredient")
data:image/s3,"s3://crabby-images/a15e5/a15e53cb19b4d857896b5c3eedb398bb41985f08" alt=""
It looks like the variable
column contains the ids of the numeric columns. It is useless therefore we can remove that too:
data.ingredients.apply(pd.Series) \
.merge(data, right_index = True, left_index = True) \
.drop(["ingredients"], axis = 1) \
.melt(id_vars = ['cuisine', 'id'], value_name = "ingredient") \
.drop("variable", axis = 1)
data:image/s3,"s3://crabby-images/0871b/0871b4a83ca0388a4eccfc44f642eef2317d6c34" alt=""
Now, it is time to remove the empty values:
data.ingredients.apply(pd.Series) \
.merge(data, right_index = True, left_index = True) \
.drop(["ingredients"], axis = 1) \
.melt(id_vars = ['cuisine', 'id'], value_name = "ingredient") \
.drop("variable", axis = 1) \
.dropna()
data:image/s3,"s3://crabby-images/295b2/295b27e5d1eb2faa1893c32150d33669585cc065" alt=""
Done! Now we have a data set that has the ingredients in separate rows.