---
title: "Visualize common elements of two datasets using NetworkX"
description: "How to use undirected graph to visualize common elements of two Pandas data frames"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/visualize-common-elements-of-two-datasets-using-networkx
---

Yet another story from the “What’s cooking?” Kaggle competition. I was looking at other people’s Kaggle kernels and found <a href="https://www.kaggle.com/ash316/what-is-the-rock-cooking-ensembling-network">a very interesting one</a>.

The author noticed that one of the distinctive characteristics of cuisine are pairs of ingredients (for example salt + pepper, olive oil + vinegar, eggs + bacon, etc.) In the original kernel, the author used NLTK to convert the ingredients to bigrams. This solution has one huge problem. When you have an ingredient like “olive oil” it becomes a tuple (“olive”, “oil”). Two words, one ingredient. Not what I wanted.

## From a list of ingredients to bigrams

Our starting point is a dataset which looks like this:

![The input dataset](/images/2018-08-03-visualize-common-elements-of-two-datasets-using-networkx/bigrams.png)

Every row consists of an identifier, the name of the cuisine and a list of ingredients. I want a list of pairs. If the ingredient list has three elements: “eggs, salt, pepper” I want three pairs: (“eggs”, “salt”), (“eggs”, “pepper”), and (“salt”, “pepper”).

```
from itertools import combinations
dataset['bigrams'] = dataset.ingredients.apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)])
```

![Bigram lists in a separate column](/images/2018-08-03-visualize-common-elements-of-two-datasets-using-networkx/bigram_lists.png)

Visualise common pairs of ingredients
In the next step, I want to find the most popular pairs of ingredients. Then I want to create a graph with edges between cuisine and its ingredients.

Firstly, I have to convert the list of bigrams to data frame rows:

```
ingredient_to_pairs = train.bigrams.apply(pd.Series) \
    .merge(train, right_index = True, left_index = True) \
    .drop(["ingredients", "bigrams"], axis = 1) \
    .melt(id_vars = ['cuisine', 'id'], value_name = "bigrams") \
    .drop("variable", axis = 1) \
    .dropna()
```

![Bigram lists converted to rows](/images/2018-08-03-visualize-common-elements-of-two-datasets-using-networkx/bigram_pairs.png)

Now I have to count the pairs, sort them by the number of elements, and select the most popular ones.

```
mexican = ingredient_to_pairs[ingredient_to_pairs["cuisine"] == "mexican"] \
    .drop(columns = "cuisine") \
    .groupby(["bigrams"]).count().sort_values("id", ascending = False)[:25]
mexican['cuisine'] = 'mexican'
italian = ingredient_to_pairs[ingredient_to_pairs["cuisine"] == "italian"] \
    .drop(columns = "cuisine") \
    .groupby(["bigrams"]).count().sort_values("id", ascending = False)[:25]
italian['cuisine'] = 'italian'
combined = pd.concat([mexican, italian])
combined = combined.reset_index()
```

Finally, I can generate the graph using NetworkX. I use the circular layout because it makes it trivial to spot the ingredients popular in both cuisines.

```
import networkx as nx
g = nx.from_pandas_edgelist(combined, source = 'cuisine', target = 'bigrams')
pos = nx.circular_layout(g)
cmap = plt.cm.RdYlGn
colors = [n for n in range(len(g.nodes()))]
nx.draw_networkx(g, pos, node_size = combined['id'].values * 4, edge_color = 'grey', cmap = cmap, node_color = colors, font_size = 15, width = 3)
plt.title("Top 25 Bigrams for Mexican and Italian cuisine", fontsize = 40)
plt.gcf().set_size_inches(60, 60)
plt.show()
```

![The most popular ingredients in Mexican and Italian cuisine](/images/2018-08-03-visualize-common-elements-of-two-datasets-using-networkx/result.png)