What ingredients define your Cuisine?

Dan Kopf of Pricenomics does a wonderful job of exploring different various facets of cuisines of the world, rather those available in the epicurious datasets. In the article he looks at the most commonly used ingredients across various cuisines, the most distinguishing ingredients of each cuisine and the types of meats used in each.

Inspired by his article, I did some follow up work to replicate some of his findings,discover some new insights and also use the data set to build some machine learing models to predict cuisines based on the ingredients. While the accuracy of the models is nothing to write home about (which was never even the goal to begin with), it was a fun little exercise to refine my ML chops. What follow are series(2-3 perhaps) of blog posts describing both the insights from the data and also the models.

To start off, Dan (and I) got the data from here. Dan curiously uses data only from epicurious, I use both the csv files (epicurious and all-recipies).

For my data analysis, I use the wonderful python library Pandas and the equally remarkable iPython notebook (with a little help from PyCharm for long running tasks).

If you want to follow along, you can get the ipython notebook from my github repo here.

The Dataset

The dataset is a collection (actually just 2) of csv files that contain a the cusine name as the first column and the ingredients of that dish on the following columns. Since the number of ingredients are not predefined, the number of columns tend to vary based on the dish.

Pandas does not seem to handle variable number of columns in the csv very well.So I import the each line (cuisine+ingredients) into a single column.Once the data is injected into a one column dataframe,I use a couple of lambda functions with apply to split the single column into two columns (cuisine + ingredients). Following is the code to do that. 

After a little cleaning and standardization (china,chinese—> Chinese, italian,italy —> Italian) I head on to Data Exploration.


To derive any insights from this text data, the first step is convert the text data (ingredients) into numerical feature vectors. This is done by first tokenizing the data and building a matrix of all ingredients. Sklearn has many more simple and advanced examples of working with text here.

Here is the code that uses CountVectorizer class to vectorize my ingredients list.

Data Exploration

Now that the data janitoring is all done, the data frame is now ready for some exploration.

One thing to remember is that, most of these recipes came from Epicurious,Gourmet and other magazines which cater to a predominantly North American readers. So no wonder a significant chunk of the recipes have their cuisine listed as North American (American, Canadian,Cajun_Creole,Southern_SoulFood,Southwestern )

One of the first things to see is to just a get a distribution of these cuisines in our collection. A simply groupby in pandas does the deed.

Here is the output we see. Clearly NorthAmerican cuisines comprise over 75% of the dataset.Latin American is a distant second.


Now that we have the caveat out of the way, lets look at some interesting things hidden in the data.

One quick visualization is to build histograms of the number of ingredients that typically go into a particular cuisine. From my experience (of tasting many cuisines, not making any of them) most south asian cuisine tend to have a large number of ingredients while most north american,british and west european tend to be a little parsimonious. Lets see if the data bares out my hypothesis.

Here is a  histogram of the top 25 cuisines listed in my dataset.


These histogram show the count of ingredients on the X-axis and the number of dishes with those ingredients on the Y-axis. What’s interesting is to see how similar American, Canadian, French  and other Western European cuisines are – large number of dishes with 5-9 ingredients and very few beyond that. If for a minute you exclude the staples like -oil, salt,meat – there are very few other ingredients in most dishes in these cuisines. Now contrast them with something like Indian which shows almost a bi or trimodal distribution or Thai which possibly has more dishes with a greater number of ingredients than those with fewer ingredients. No wonder Indian food is among the most delicious cuisines and but also is expensive to to make and suprisingly not so popular in the US.

Other interesting distribution is the Scandinavian cuisine which sees a much rapid fall in the number of dishes with ingredients greater 10 than most other cuisines.

Here are the histograms done at the cuisine group level which also show a very similar patterns.


Here is the python code that does the data-munging and generates the nice pictures above.

Whats the most common Ingredient

All cuisines typically are based on a few ingredients that are generally available in plenty in that region.So its only natural to look for the most common ones in each cuisine. To make this a little interesting I exclude some of the more mundane 🙂 ingredients like salt,garlic and onions. I use the stop words property of the vectorizer to do this.

Here are the top most ingredients of each cuisine.

CuisineIngredient% of dishes containing

Do you see any surprises ?

What two ingredients always go together

Another interesting characteristic of most cuisines are the ingredients that often seem to go-together. Such patterns are so common that there sites that help you find the appropriate flavor combos.

In the text analytics world this is called finding co-occurrences. This could have some value in a prediction context because it gives a better insight into what the context is. In this case where the text really is a stream of unconnected labels (as seen by the count vectorizer), adding co-occurrences into the feature set might provide additional context/feature to separate the dimensions better.

The table shows the most commonly occurring combo of ingredients each cuisine ordered (along columns) by their count in decreasing order.

African((cumin, olive_oil), 38)((garlic, olive_oil), 35)((olive_oil, onion), 29)((cayenne, garlic), 27)((bell_pepper, garlic), 26)((coriander, cumin), 25)((ginger, onion), 22)((cilantro, olive_oil), 21)((black_pepper, onion), 21)((lemon_juice, olive_oil), 20)
American((egg, wheat), 11147)((butter, wheat), 10598)((milk, wheat), 7147)((vanilla, wheat), 6020)((garlic, onion), 5573)((cream, egg), 3996)((onion, pepper), 3927)((cane_molasses, wheat), 3691)((black_pepper, onion), 3325)((vegetable_oil, wheat), 3073)
Asian((ginger, soy_sauce), 366)((garlic, soy_sauce), 339)((scallion, soy_sauce), 291)((rice, vinegar), 265)((sesame_oil, soy_sauce), 259)((soy_sauce, vinegar), 247)((cayenne, garlic), 230)((vegetable_oil, vinegar), 151)((cilantro, ginger), 150)((fish, garlic), 131)
Austria((egg, wheat), 15)((butter, wheat), 14)((milk, wheat), 9)((vanilla, wheat), 9)((almond, egg), 5)((cocoa, wheat), 5)((bread, wheat), 4)((cream, wheat), 3)((rum, wheat), 3)((hazelnut, wheat), 3)
Bangladesh((onion, vegetable_oil), 4)((cayenne, onion), 4)((turmeric, vegetable_oil), 4)((garlic, onion), 3)((coriander, onion), 2)((cilantro, onion), 2)((beef, coriander), 2)((potato, vegetable_oil), 2)((cardamom, coriander), 2)((ginger, onion), 2)
Belgium((butter, wheat), 8)((cane_molasses, wheat), 6)((egg, wheat), 5)((milk, wheat), 3)((parsley, wheat), 3)((black_pepper, butter), 3)((onion, wheat), 3)((nutmeg, wheat), 3)((cinnamon, wheat), 3)((almond, butter), 2)
Cajun_Creole((cayenne, onion), 67)((garlic, onion), 63)((bell_pepper, onion), 44)((green_bell_pepper, onion), 42)((black_pepper, onion), 39)((celery, onion), 38)((onion, vegetable_oil), 38)((butter, onion), 36)((olive_oil, onion), 34)((chicken_broth, onion), 29)
Canada((butter, wheat), 195)((egg, wheat), 192)((milk, wheat), 135)((garlic, onion), 124)((black_pepper, onion), 105)((vanilla, wheat), 100)((onion, tomato), 97)((cane_molasses, wheat), 94)((cinnamon, wheat), 58)((cream, wheat), 57)
Caribbean((garlic, onion), 68)((onion, tomato), 47)((black_pepper, onion), 40)((olive_oil, onion), 36)((cayenne, onion), 32)((green_bell_pepper, onion), 31)((cumin, garlic), 30)((bell_pepper, onion), 28)((chicken, garlic), 27)((tomato, vegetable_oil), 24)
Central_SouthAmerican((garlic, onion), 100)((cayenne, onion), 91)((onion, tomato), 81)((corn, garlic), 58)((cilantro, onion), 54)((cumin, garlic), 45)((olive_oil, onion), 38)((black_pepper, garlic), 36)((bell_pepper, garlic), 32)((cheese, onion), 30)
Chinese((ginger, soy_sauce), 156)((garlic, soy_sauce), 145)((scallion, soy_sauce), 133)((sesame_oil, soy_sauce), 115)((soy_sauce, starch), 88)((pork, soy_sauce), 79)((rice, soy_sauce), 74)((pepper, soy_sauce), 60)((chicken, soy_sauce), 56)((cayenne, soy_sauce), 56)
East-African((ginger, onion), 4)((garlic, onion), 4)((cayenne, onion), 3)((onion, pepper), 3)((cumin, onion), 3)((olive_oil, onion), 2)((carrot, onion), 2)((beef, onion), 2)((cardamom, pepper), 2)((black_pepper, onion), 2)
Eastern-Europe((egg, wheat), 96)((butter, wheat), 77)((milk, wheat), 54)((garlic, onion), 42)((onion, tomato), 40)((black_pepper, onion), 39)((wheat, yeast), 36)((cream, wheat), 36)((bell_pepper, onion), 31)((vanilla, wheat), 28)
EasternEuropean_Russian((butter, wheat), 62)((egg, wheat), 54)((cream, wheat), 26)((black_pepper, onion), 23)((onion, potato), 23)((garlic, onion), 21)((milk, wheat), 20)((vanilla, wheat), 20)((vegetable_oil, wheat), 17)((olive_oil, parsley), 14)
English_Scottish((butter, wheat), 106)((egg, wheat), 93)((cream, wheat), 59)((milk, wheat), 58)((cane_molasses, wheat), 33)((vanilla, wheat), 27)((buttermilk, wheat), 23)((onion, wheat), 20)((milk_fat, wheat), 20)((vegetable_oil, wheat), 19)
French((butter, wheat), 374)((egg, wheat), 338)((cream, egg), 210)((garlic, olive_oil), 201)((milk, wheat), 176)((olive_oil, onion), 135)((vanilla, wheat), 133)((black_pepper, onion), 125)((onion, parsley), 125)((bay, onion), 93)
German((egg, wheat), 150)((butter, wheat), 111)((milk, wheat), 74)((cinnamon, wheat), 50)((onion, vinegar), 48)((vanilla, wheat), 43)((black_pepper, onion), 42)((cream, wheat), 34)((nutmeg, wheat), 32)((bacon, onion), 31)
Greek((garlic, olive_oil), 89)((olive_oil, onion), 71)((feta_cheese, olive_oil), 61)((lemon_juice, olive_oil), 60)((onion, tomato), 48)((bread, olive_oil), 42)((black_pepper, olive_oil), 39)((oregano, tomato), 36)((egg, wheat), 34)((butter, wheat), 32)
Indian((cumin, turmeric), 261)((coriander, cumin), 259)((onion, turmeric), 203)((cayenne, cumin), 192)((garlic, onion), 192)((pepper, turmeric), 170)((fenugreek, turmeric), 169)((ginger, onion), 165)((turmeric, vegetable_oil), 157)((cilantro, cumin), 115)
Indonesia((garlic, vegetable_oil), 6)((cayenne, garlic), 5)((lemon_juice, soy_sauce), 3)((chicken, soy_sauce), 3)((soy_sauce, tomato), 3)((scallion, soy_sauce), 3)((rice, soy_sauce), 3)((tomato, vegetable_oil), 3)((onion, soy_sauce), 3)((egg, tomato), 3)
Iran((onion, tomato), 7)((olive_oil, onion), 5)((black_pepper, onion), 5)((chicken, onion), 4)((turmeric, vegetable_oil), 4)((dill, yogurt), 3)((lime_juice, onion), 3)((beef, onion), 3)((parsley, tomato), 3)((cardamom, rose), 3)
Irish((butter, wheat), 32)((egg, wheat), 27)((buttermilk, wheat), 21)((cream, egg), 18)((cane_molasses, wheat), 13)((onion, potato), 12)((wheat, whole_grain_wheat_flour), 12)((oat, wheat), 12)((carrot, onion), 11)((vanilla, wheat), 9)
Israel((butter, wheat), 4)((egg, wheat), 3)((sesame_seed, wheat), 2)((vegetable_oil, wheat), 2)((black_pepper, olive_oil), 2)((garlic, parsley), 2)((onion, wheat), 2)((cumin, olive_oil), 2)((olive_oil, turkey), 1)((coconut, wheat), 1)
Italian((garlic, olive_oil), 1334)((olive_oil, tomato), 956)((basil, garlic), 787)((onion, tomato), 630)((macaroni, olive_oil), 629)((black_pepper, olive_oil), 581)((egg, wheat), 472)((parmesan_cheese, tomato), 440)((bell_pepper, olive_oil), 409)((butter, wheat), 390)
Japanese((rice, vinegar), 56)((soy_sauce, vegetable_oil), 55)((ginger, soy_sauce), 48)((scallion, soy_sauce), 46)((sake, soy_sauce), 40)((garlic, soy_sauce), 37)((barley, soybean), 33)((vegetable_oil, vinegar), 33)((vinegar, wine), 29)((egg, vegetable_oil), 29)
Jewish((egg, wheat), 134)((butter, egg), 78)((vegetable_oil, wheat), 60)((onion, wheat), 43)((garlic, olive_oil), 41)((cinnamon, egg), 40)((olive_oil, onion), 40)((black_pepper, egg), 35)((lemon_juice, wheat), 28)((vanilla, wheat), 28)
Korean((garlic, soy_sauce), 18)((sesame_oil, soy_sauce), 17)((scallion, soy_sauce), 14)((beef, soy_sauce), 12)((sesame_seed, soy_sauce), 12)((black_pepper, soy_sauce), 11)((cayenne, soy_sauce), 9)((onion, soy_sauce), 8)((soy_sauce, vinegar), 8)((ginger, soy_sauce), 7)
Lebanon((lemon_juice, olive_oil), 15)((garlic, olive_oil), 15)((olive_oil, onion), 11)((onion, parsley), 9)((mint, olive_oil), 8)((parsley, tomato), 7)((bread, olive_oil), 7)((black_pepper, olive_oil), 6)((lettuce, tomato), 5)((cucumber, tomato), 5)
Malaysia((garlic, pepper), 7)((coriander, garlic), 7)((cumin, garlic), 7)((coconut, ginger), 6)((fenugreek, pepper), 6)((pepper, turmeric), 6)((ginger, onion), 5)((chicken, garlic), 5)((onion, vegetable_oil), 5)((olive_oil, pepper), 4)
Mediterranean((garlic, olive_oil), 133)((olive_oil, onion), 100)((onion, tomato), 63)((lemon_juice, olive_oil), 58)((bell_pepper, olive_oil), 49)((olive, olive_oil), 41)((parsley, tomato), 38)((lemon, olive_oil), 37)((black_pepper, olive_oil), 36)((fish, olive_oil), 35)
Mexican((cayenne, onion), 1354)((garlic, onion), 1253)((onion, tomato), 1186)((cumin, garlic), 623)((corn, onion), 552)((cilantro, onion), 454)((bell_pepper, onion), 413)((black_pepper, onion), 400)((cheddar_cheese, onion), 394)((tomato, vegetable_oil), 381)
MiddleEastern((garlic, olive_oil), 83)((lemon_juice, olive_oil), 66)((olive_oil, onion), 65)((cumin, olive_oil), 61)((black_pepper, olive_oil), 44)((onion, wheat), 39)((bread, olive_oil), 38)((bell_pepper, olive_oil), 35)((mint, olive_oil), 34)((cayenne, olive_oil), 33)
Moroccan((cumin, olive_oil), 61)((olive_oil, onion), 49)((garlic, olive_oil), 48)((cinnamon, olive_oil), 40)((coriander, cumin), 38)((bell_pepper, olive_oil), 35)((cilantro, olive_oil), 34)((cayenne, cumin), 33)((lemon_juice, olive_oil), 31)((ginger, onion), 31)
Netherlands((butter, wheat), 25)((egg, wheat), 21)((cinnamon, wheat), 13)((cane_molasses, wheat), 10)((almond, butter), 9)((milk, wheat), 9)((apple, wheat), 8)((lard, wheat), 5)((wheat, yeast), 5)((nutmeg, wheat), 5)
North-African((cumin, onion), 20)((garlic, olive_oil), 20)((onion, tomato), 19)((olive_oil, tomato), 17)((cayenne, garlic), 16)((coriander, cumin), 14)((bell_pepper, olive_oil), 14)((chickpea, onion), 12)((black_pepper, cumin), 12)((carrot, onion), 12)
Pakistan((garlic, onion), 13)((cayenne, onion), 12)((cumin, garlic), 12)((onion, turmeric), 11)((ginger, onion), 9)((chicken, garlic), 8)((cilantro, garlic), 8)((tomato, turmeric), 8)((black_pepper, garlic), 7)((turmeric, vegetable_oil), 7)
Philippines((garlic, soy_sauce), 15)((black_pepper, soy_sauce), 14)((onion, vegetable_oil), 13)((soy_sauce, vegetable_oil), 11)((chicken, soy_sauce), 10)((pork, soy_sauce), 9)((beef, garlic), 8)((egg, wheat), 8)((carrot, pork), 7)((bay, soy_sauce), 7)
Portugal((garlic, onion), 18)((bell_pepper, garlic), 14)((butter, egg), 14)((onion, tomato), 14)((olive_oil, onion), 13)((egg, wheat), 12)((milk, wheat), 10)((black_pepper, garlic), 10)((wheat, yeast), 8)((pork_sausage, potato), 8)
Scandinavian((butter, wheat), 120)((egg, wheat), 100)((milk, wheat), 47)((cream, egg), 43)((almond, wheat), 34)((vanilla, wheat), 28)((cinnamon, wheat), 24)((wheat, yeast), 22)((cane_molasses, wheat), 22)((cardamom, wheat), 22)
South-African((coriander, fenugreek), 6)((fenugreek, pepper), 6)((pepper, turmeric), 6)((onion, pepper), 6)((cumin, fenugreek), 6)((bay, onion), 6)((beef, onion), 5)((egg, milk), 5)((black_pepper, onion), 4)((turmeric, vinegar), 4)
South-America((garlic, onion), 31)((onion, pepper), 22)((olive_oil, onion), 19)((egg, milk), 19)((bell_pepper, onion), 17)((cumin, garlic), 16)((milk, wheat), 15)((cayenne, onion), 15)((butter, wheat), 13)((chicken, onion), 12)
Southern_SoulFood((butter, wheat), 125)((egg, wheat), 99)((corn, egg), 73)((milk, wheat), 63)((buttermilk, wheat), 55)((garlic, onion), 47)((cayenne, onion), 46)((cream, wheat), 46)((black_pepper, onion), 41)((onion, vinegar), 38)
Southwestern((cayenne, onion), 59)((garlic, onion), 51)((cilantro, onion), 39)((onion, tomato), 39)((corn, onion), 29)((cumin, garlic), 28)((bell_pepper, cayenne), 28)((olive_oil, onion), 27)((black_pepper, cayenne), 22)((lime_juice, onion), 18)
Spain((olive_oil, onion), 34)((garlic, olive_oil), 29)((onion, tomato), 27)((bell_pepper, olive_oil), 23)((green_bell_pepper, onion), 16)((chicken, garlic), 14)((pepper, tomato), 14)((black_pepper, garlic), 13)((cayenne, onion), 12)((tomato, vinegar), 11)
Spanish_Portuguese((garlic, olive_oil), 141)((olive_oil, onion), 98)((bell_pepper, olive_oil), 86)((onion, tomato), 69)((cayenne, garlic), 52)((black_pepper, garlic), 44)((sherry, vinegar), 39)((chicken_broth, garlic), 36)((egg, olive_oil), 36)((parsley, tomato), 35)
Switzerland((egg, wheat), 9)((butter, wheat), 8)((onion, pepper), 5)((milk, wheat), 4)((nutmeg, wheat), 4)((cheese, pepper), 3)((lemon, wheat), 3)((cherry, wheat), 3)((pepper, wheat), 3)((almond, wheat), 2)
Thai((fish, garlic), 92)((garlic, ginger), 90)((cayenne, garlic), 90)((coriander, cumin), 83)((cilantro, garlic), 78)((coconut, coriander), 71)((cumin, turmeric), 66)((pepper, turmeric), 62)((fenugreek, pepper), 61)((chicken, garlic), 59)
Turkey((garlic, onion), 7)((olive_oil, onion), 6)((onion, tomato), 6)((bell_pepper, garlic), 6)((chicken, onion), 5)((egg, wheat), 4)((beef, tomato), 3)((pepper, tomato), 3)((black_pepper, onion), 3)((butter, wheat), 3)
UK-and-Ireland((butter, wheat), 126)((egg, wheat), 117)((milk, wheat), 77)((onion, potato), 44)((lard, wheat), 35)((cane_molasses, wheat), 35)((raisin, wheat), 34)((beef, onion), 32)((carrot, onion), 31)((cinnamon, wheat), 26)
Vietnam((fish, garlic), 54)((cayenne, fish), 35)((garlic, vegetable_oil), 35)((cilantro, fish), 32)((rice, vegetable_oil), 22)((basil, fish), 21)((black_pepper, fish), 20)((carrot, garlic), 20)((mint, rice), 20)((lime_juice, rice), 17)
West-African((cayenne, onion), 8)((onion, tomato), 8)((garlic, onion), 6)((peanut_butter, tomato), 5)((bell_pepper, onion), 5)((olive_oil, onion), 4)((chicken, onion), 4)((tomato, vegetable_oil), 4)((chicken_broth, onion), 4)((cumin, onion), 4)

So there you go, a little bit of data munging and some interesting insights into the types of ingredients the comprise a cuisine. In a later post, I will explore how this data can be converted into a supervised and unsupervised learning problem with ingredients constituting the feature vectors and the cuisine or the cuisine group as the labels.

Please feel free to leave any feedback in the comments section.

The Dummy Variable

In statistics and econometrics, a dummy variable is used to represent either the presence or absence of some categorical effect. In Field Experiments, dummy variable is used to distinguish different treatment groups – control group or treatment group.

Also a cool name for blog which will most likely be about data, statistics and economics.

More substantive or trivial posts coming soon(hopefully).