What ingredients define your Cuisine?

Dan Kopf of Pricenomics does a wonderful job of exploring different various facets of cuisines of the world, rather those available in the epicurious datasets. In the article he looks at the most commonly used ingredients across various cuisines, the most distinguishing ingredients of each cuisine and the types of meats used in each.

Inspired by his article, I did some follow up work to replicate some of his findings,discover some new insights and also use the data set to build some machine learing models to predict cuisines based on the ingredients. While the accuracy of the models is nothing to write home about (which was never even the goal to begin with), it was a fun little exercise to refine my ML chops. What follow are series(2-3 perhaps) of blog posts describing both the insights from the data and also the models.

To start off, Dan (and I) got the data from here. Dan curiously uses data only from epicurious, I use both the csv files (epicurious and all-recipies).

For my data analysis, I use the wonderful python library Pandas and the equally remarkable iPython notebook (with a little help from PyCharm for long running tasks).

If you want to follow along, you can get the ipython notebook from my github repo here.

The Dataset

The dataset is a collection (actually just 2) of csv files that contain a the cusine name as the first column and the ingredients of that dish on the following columns. Since the number of ingredients are not predefined, the number of columns tend to vary based on the dish.

Pandas does not seem to handle variable number of columns in the csv very well.So I import the each line (cuisine+ingredients) into a single column.Once the data is injected into a one column dataframe,I use a couple of lambda functions with apply to split the single column into two columns (cuisine + ingredients). Following is the code to do that. 

After a little cleaning and standardization (china,chinese—> Chinese, italian,italy —> Italian) I head on to Data Exploration.

Vectorization

To derive any insights from this text data, the first step is convert the text data (ingredients) into numerical feature vectors. This is done by first tokenizing the data and building a matrix of all ingredients. Sklearn has many more simple and advanced examples of working with text here.

Here is the code that uses CountVectorizer class to vectorize my ingredients list.

Data Exploration

Now that the data janitoring is all done, the data frame is now ready for some exploration.

One thing to remember is that, most of these recipes came from Epicurious,Gourmet and other magazines which cater to a predominantly North American readers. So no wonder a significant chunk of the recipes have their cuisine listed as North American (American, Canadian,Cajun_Creole,Southern_SoulFood,Southwestern )

One of the first things to see is to just a get a distribution of these cuisines in our collection. A simply groupby in pandas does the deed.

Here is the output we see. Clearly NorthAmerican cuisines comprise over 75% of the dataset.Latin American is a distant second.

cgbarchart

Now that we have the caveat out of the way, lets look at some interesting things hidden in the data.

One quick visualization is to build histograms of the number of ingredients that typically go into a particular cuisine. From my experience (of tasting many cuisines, not making any of them) most south asian cuisine tend to have a large number of ingredients while most north american,british and west european tend to be a little parsimonious. Lets see if the data bares out my hypothesis.

Here is a  histogram of the top 25 cuisines listed in my dataset.

histograms

These histogram show the count of ingredients on the X-axis and the number of dishes with those ingredients on the Y-axis. What’s interesting is to see how similar American, Canadian, French  and other Western European cuisines are – large number of dishes with 5-9 ingredients and very few beyond that. If for a minute you exclude the staples like -oil, salt,meat – there are very few other ingredients in most dishes in these cuisines. Now contrast them with something like Indian which shows almost a bi or trimodal distribution or Thai which possibly has more dishes with a greater number of ingredients than those with fewer ingredients. No wonder Indian food is among the most delicious cuisines and but also is expensive to to make and suprisingly not so popular in the US.

Other interesting distribution is the Scandinavian cuisine which sees a much rapid fall in the number of dishes with ingredients greater 10 than most other cuisines.

Here are the histograms done at the cuisine group level which also show a very similar patterns.

histogramcg

Here is the python code that does the data-munging and generates the nice pictures above.

Whats the most common Ingredient

All cuisines typically are based on a few ingredients that are generally available in plenty in that region.So its only natural to look for the most common ones in each cuisine. To make this a little interesting I exclude some of the more mundane 🙂 ingredients like salt,garlic and onions. I use the stop words property of the vectorizer to do this.

Here are the top most ingredients of each cuisine.

CuisineIngredient% of dishes containing
Vietnamfish0.736842
Indiancumin0.603679
Spanish_Portuguesebell_pepper0.343643
Jewishegg0.593750
Frenchegg0.441456
Central_SouthAmericancayenne0.518672
Cajun_Creolecayenne0.561644
Thaifish0.529412
Scandinavianwheat0.580000
Greeklemon_juice0.337778
Americanegg0.405131
Africancumin0.426087
MiddleEasternwheat0.379032
EasternEuropean_Russianegg0.506849
Italiantomato0.392317
Irishwheat0.500000
Mexicancayenne0.736953
Chinesesoy_sauce0.679775
Germanwheat0.647059
Mediterraneantomato0.349481
Japanesesoy_sauce0.588235
Moroccancumin0.547445
Southern_SoulFoodwheat0.485549
English_Scottishwheat0.622549
Asiansoy_sauce0.500000
Southwesterncayenne0.814815
Canadawheat0.395349
Turkeytomato0.437500
Caribbeanvegetable_oil0.311475
Bangladeshcayenne1.000000
Israelwheat0.666667
Koreansoy_sauce0.781250
Irantomato0.333333
Eastern-Europewheat0.531915
South-Africanturmeric0.437500
UK-and-Irelandwheat0.606383
Belgiumwheat0.909091
South-Americaegg0.349515
Spaintomato0.426667
Netherlandswheat0.937500
Philippinesvegetable_oil0.488372
Indonesiasoy_sauce0.500000
East-Africanginger0.363636
Switzerlandwheat0.650000
West-Africanpeanut_butter0.615385
North-Africancumin0.483333
Pakistancumin0.684211
Portugalegg0.360000
Lebanonlemon_juice0.645161
Malaysiacoconut0.555556
Austriaegg0.809524

Do you see any surprises ?

What two ingredients always go together

Another interesting characteristic of most cuisines are the ingredients that often seem to go-together. Such patterns are so common that there sites that help you find the appropriate flavor combos.

In the text analytics world this is called finding co-occurrences. This could have some value in a prediction context because it gives a better insight into what the context is. In this case where the text really is a stream of unconnected labels (as seen by the count vectorizer), adding co-occurrences into the feature set might provide additional context/feature to separate the dimensions better.

The table shows the most commonly occurring combo of ingredients each cuisine ordered (along columns) by their count in decreasing order.

 0123456789
African((cumin, olive_oil), 38)((garlic, olive_oil), 35)((olive_oil, onion), 29)((cayenne, garlic), 27)((bell_pepper, garlic), 26)((coriander, cumin), 25)((ginger, onion), 22)((cilantro, olive_oil), 21)((black_pepper, onion), 21)((lemon_juice, olive_oil), 20)
American((egg, wheat), 11147)((butter, wheat), 10598)((milk, wheat), 7147)((vanilla, wheat), 6020)((garlic, onion), 5573)((cream, egg), 3996)((onion, pepper), 3927)((cane_molasses, wheat), 3691)((black_pepper, onion), 3325)((vegetable_oil, wheat), 3073)
Asian((ginger, soy_sauce), 366)((garlic, soy_sauce), 339)((scallion, soy_sauce), 291)((rice, vinegar), 265)((sesame_oil, soy_sauce), 259)((soy_sauce, vinegar), 247)((cayenne, garlic), 230)((vegetable_oil, vinegar), 151)((cilantro, ginger), 150)((fish, garlic), 131)
Austria((egg, wheat), 15)((butter, wheat), 14)((milk, wheat), 9)((vanilla, wheat), 9)((almond, egg), 5)((cocoa, wheat), 5)((bread, wheat), 4)((cream, wheat), 3)((rum, wheat), 3)((hazelnut, wheat), 3)
Bangladesh((onion, vegetable_oil), 4)((cayenne, onion), 4)((turmeric, vegetable_oil), 4)((garlic, onion), 3)((coriander, onion), 2)((cilantro, onion), 2)((beef, coriander), 2)((potato, vegetable_oil), 2)((cardamom, coriander), 2)((ginger, onion), 2)
Belgium((butter, wheat), 8)((cane_molasses, wheat), 6)((egg, wheat), 5)((milk, wheat), 3)((parsley, wheat), 3)((black_pepper, butter), 3)((onion, wheat), 3)((nutmeg, wheat), 3)((cinnamon, wheat), 3)((almond, butter), 2)
Cajun_Creole((cayenne, onion), 67)((garlic, onion), 63)((bell_pepper, onion), 44)((green_bell_pepper, onion), 42)((black_pepper, onion), 39)((celery, onion), 38)((onion, vegetable_oil), 38)((butter, onion), 36)((olive_oil, onion), 34)((chicken_broth, onion), 29)
Canada((butter, wheat), 195)((egg, wheat), 192)((milk, wheat), 135)((garlic, onion), 124)((black_pepper, onion), 105)((vanilla, wheat), 100)((onion, tomato), 97)((cane_molasses, wheat), 94)((cinnamon, wheat), 58)((cream, wheat), 57)
Caribbean((garlic, onion), 68)((onion, tomato), 47)((black_pepper, onion), 40)((olive_oil, onion), 36)((cayenne, onion), 32)((green_bell_pepper, onion), 31)((cumin, garlic), 30)((bell_pepper, onion), 28)((chicken, garlic), 27)((tomato, vegetable_oil), 24)
Central_SouthAmerican((garlic, onion), 100)((cayenne, onion), 91)((onion, tomato), 81)((corn, garlic), 58)((cilantro, onion), 54)((cumin, garlic), 45)((olive_oil, onion), 38)((black_pepper, garlic), 36)((bell_pepper, garlic), 32)((cheese, onion), 30)
Chinese((ginger, soy_sauce), 156)((garlic, soy_sauce), 145)((scallion, soy_sauce), 133)((sesame_oil, soy_sauce), 115)((soy_sauce, starch), 88)((pork, soy_sauce), 79)((rice, soy_sauce), 74)((pepper, soy_sauce), 60)((chicken, soy_sauce), 56)((cayenne, soy_sauce), 56)
East-African((ginger, onion), 4)((garlic, onion), 4)((cayenne, onion), 3)((onion, pepper), 3)((cumin, onion), 3)((olive_oil, onion), 2)((carrot, onion), 2)((beef, onion), 2)((cardamom, pepper), 2)((black_pepper, onion), 2)
Eastern-Europe((egg, wheat), 96)((butter, wheat), 77)((milk, wheat), 54)((garlic, onion), 42)((onion, tomato), 40)((black_pepper, onion), 39)((wheat, yeast), 36)((cream, wheat), 36)((bell_pepper, onion), 31)((vanilla, wheat), 28)
EasternEuropean_Russian((butter, wheat), 62)((egg, wheat), 54)((cream, wheat), 26)((black_pepper, onion), 23)((onion, potato), 23)((garlic, onion), 21)((milk, wheat), 20)((vanilla, wheat), 20)((vegetable_oil, wheat), 17)((olive_oil, parsley), 14)
English_Scottish((butter, wheat), 106)((egg, wheat), 93)((cream, wheat), 59)((milk, wheat), 58)((cane_molasses, wheat), 33)((vanilla, wheat), 27)((buttermilk, wheat), 23)((onion, wheat), 20)((milk_fat, wheat), 20)((vegetable_oil, wheat), 19)
French((butter, wheat), 374)((egg, wheat), 338)((cream, egg), 210)((garlic, olive_oil), 201)((milk, wheat), 176)((olive_oil, onion), 135)((vanilla, wheat), 133)((black_pepper, onion), 125)((onion, parsley), 125)((bay, onion), 93)
German((egg, wheat), 150)((butter, wheat), 111)((milk, wheat), 74)((cinnamon, wheat), 50)((onion, vinegar), 48)((vanilla, wheat), 43)((black_pepper, onion), 42)((cream, wheat), 34)((nutmeg, wheat), 32)((bacon, onion), 31)
Greek((garlic, olive_oil), 89)((olive_oil, onion), 71)((feta_cheese, olive_oil), 61)((lemon_juice, olive_oil), 60)((onion, tomato), 48)((bread, olive_oil), 42)((black_pepper, olive_oil), 39)((oregano, tomato), 36)((egg, wheat), 34)((butter, wheat), 32)
Indian((cumin, turmeric), 261)((coriander, cumin), 259)((onion, turmeric), 203)((cayenne, cumin), 192)((garlic, onion), 192)((pepper, turmeric), 170)((fenugreek, turmeric), 169)((ginger, onion), 165)((turmeric, vegetable_oil), 157)((cilantro, cumin), 115)
Indonesia((garlic, vegetable_oil), 6)((cayenne, garlic), 5)((lemon_juice, soy_sauce), 3)((chicken, soy_sauce), 3)((soy_sauce, tomato), 3)((scallion, soy_sauce), 3)((rice, soy_sauce), 3)((tomato, vegetable_oil), 3)((onion, soy_sauce), 3)((egg, tomato), 3)
Iran((onion, tomato), 7)((olive_oil, onion), 5)((black_pepper, onion), 5)((chicken, onion), 4)((turmeric, vegetable_oil), 4)((dill, yogurt), 3)((lime_juice, onion), 3)((beef, onion), 3)((parsley, tomato), 3)((cardamom, rose), 3)
Irish((butter, wheat), 32)((egg, wheat), 27)((buttermilk, wheat), 21)((cream, egg), 18)((cane_molasses, wheat), 13)((onion, potato), 12)((wheat, whole_grain_wheat_flour), 12)((oat, wheat), 12)((carrot, onion), 11)((vanilla, wheat), 9)
Israel((butter, wheat), 4)((egg, wheat), 3)((sesame_seed, wheat), 2)((vegetable_oil, wheat), 2)((black_pepper, olive_oil), 2)((garlic, parsley), 2)((onion, wheat), 2)((cumin, olive_oil), 2)((olive_oil, turkey), 1)((coconut, wheat), 1)
Italian((garlic, olive_oil), 1334)((olive_oil, tomato), 956)((basil, garlic), 787)((onion, tomato), 630)((macaroni, olive_oil), 629)((black_pepper, olive_oil), 581)((egg, wheat), 472)((parmesan_cheese, tomato), 440)((bell_pepper, olive_oil), 409)((butter, wheat), 390)
Japanese((rice, vinegar), 56)((soy_sauce, vegetable_oil), 55)((ginger, soy_sauce), 48)((scallion, soy_sauce), 46)((sake, soy_sauce), 40)((garlic, soy_sauce), 37)((barley, soybean), 33)((vegetable_oil, vinegar), 33)((vinegar, wine), 29)((egg, vegetable_oil), 29)
Jewish((egg, wheat), 134)((butter, egg), 78)((vegetable_oil, wheat), 60)((onion, wheat), 43)((garlic, olive_oil), 41)((cinnamon, egg), 40)((olive_oil, onion), 40)((black_pepper, egg), 35)((lemon_juice, wheat), 28)((vanilla, wheat), 28)
Korean((garlic, soy_sauce), 18)((sesame_oil, soy_sauce), 17)((scallion, soy_sauce), 14)((beef, soy_sauce), 12)((sesame_seed, soy_sauce), 12)((black_pepper, soy_sauce), 11)((cayenne, soy_sauce), 9)((onion, soy_sauce), 8)((soy_sauce, vinegar), 8)((ginger, soy_sauce), 7)
Lebanon((lemon_juice, olive_oil), 15)((garlic, olive_oil), 15)((olive_oil, onion), 11)((onion, parsley), 9)((mint, olive_oil), 8)((parsley, tomato), 7)((bread, olive_oil), 7)((black_pepper, olive_oil), 6)((lettuce, tomato), 5)((cucumber, tomato), 5)
Malaysia((garlic, pepper), 7)((coriander, garlic), 7)((cumin, garlic), 7)((coconut, ginger), 6)((fenugreek, pepper), 6)((pepper, turmeric), 6)((ginger, onion), 5)((chicken, garlic), 5)((onion, vegetable_oil), 5)((olive_oil, pepper), 4)
Mediterranean((garlic, olive_oil), 133)((olive_oil, onion), 100)((onion, tomato), 63)((lemon_juice, olive_oil), 58)((bell_pepper, olive_oil), 49)((olive, olive_oil), 41)((parsley, tomato), 38)((lemon, olive_oil), 37)((black_pepper, olive_oil), 36)((fish, olive_oil), 35)
Mexican((cayenne, onion), 1354)((garlic, onion), 1253)((onion, tomato), 1186)((cumin, garlic), 623)((corn, onion), 552)((cilantro, onion), 454)((bell_pepper, onion), 413)((black_pepper, onion), 400)((cheddar_cheese, onion), 394)((tomato, vegetable_oil), 381)
MiddleEastern((garlic, olive_oil), 83)((lemon_juice, olive_oil), 66)((olive_oil, onion), 65)((cumin, olive_oil), 61)((black_pepper, olive_oil), 44)((onion, wheat), 39)((bread, olive_oil), 38)((bell_pepper, olive_oil), 35)((mint, olive_oil), 34)((cayenne, olive_oil), 33)
Moroccan((cumin, olive_oil), 61)((olive_oil, onion), 49)((garlic, olive_oil), 48)((cinnamon, olive_oil), 40)((coriander, cumin), 38)((bell_pepper, olive_oil), 35)((cilantro, olive_oil), 34)((cayenne, cumin), 33)((lemon_juice, olive_oil), 31)((ginger, onion), 31)
Netherlands((butter, wheat), 25)((egg, wheat), 21)((cinnamon, wheat), 13)((cane_molasses, wheat), 10)((almond, butter), 9)((milk, wheat), 9)((apple, wheat), 8)((lard, wheat), 5)((wheat, yeast), 5)((nutmeg, wheat), 5)
North-African((cumin, onion), 20)((garlic, olive_oil), 20)((onion, tomato), 19)((olive_oil, tomato), 17)((cayenne, garlic), 16)((coriander, cumin), 14)((bell_pepper, olive_oil), 14)((chickpea, onion), 12)((black_pepper, cumin), 12)((carrot, onion), 12)
Pakistan((garlic, onion), 13)((cayenne, onion), 12)((cumin, garlic), 12)((onion, turmeric), 11)((ginger, onion), 9)((chicken, garlic), 8)((cilantro, garlic), 8)((tomato, turmeric), 8)((black_pepper, garlic), 7)((turmeric, vegetable_oil), 7)
Philippines((garlic, soy_sauce), 15)((black_pepper, soy_sauce), 14)((onion, vegetable_oil), 13)((soy_sauce, vegetable_oil), 11)((chicken, soy_sauce), 10)((pork, soy_sauce), 9)((beef, garlic), 8)((egg, wheat), 8)((carrot, pork), 7)((bay, soy_sauce), 7)
Portugal((garlic, onion), 18)((bell_pepper, garlic), 14)((butter, egg), 14)((onion, tomato), 14)((olive_oil, onion), 13)((egg, wheat), 12)((milk, wheat), 10)((black_pepper, garlic), 10)((wheat, yeast), 8)((pork_sausage, potato), 8)
Scandinavian((butter, wheat), 120)((egg, wheat), 100)((milk, wheat), 47)((cream, egg), 43)((almond, wheat), 34)((vanilla, wheat), 28)((cinnamon, wheat), 24)((wheat, yeast), 22)((cane_molasses, wheat), 22)((cardamom, wheat), 22)
South-African((coriander, fenugreek), 6)((fenugreek, pepper), 6)((pepper, turmeric), 6)((onion, pepper), 6)((cumin, fenugreek), 6)((bay, onion), 6)((beef, onion), 5)((egg, milk), 5)((black_pepper, onion), 4)((turmeric, vinegar), 4)
South-America((garlic, onion), 31)((onion, pepper), 22)((olive_oil, onion), 19)((egg, milk), 19)((bell_pepper, onion), 17)((cumin, garlic), 16)((milk, wheat), 15)((cayenne, onion), 15)((butter, wheat), 13)((chicken, onion), 12)
Southern_SoulFood((butter, wheat), 125)((egg, wheat), 99)((corn, egg), 73)((milk, wheat), 63)((buttermilk, wheat), 55)((garlic, onion), 47)((cayenne, onion), 46)((cream, wheat), 46)((black_pepper, onion), 41)((onion, vinegar), 38)
Southwestern((cayenne, onion), 59)((garlic, onion), 51)((cilantro, onion), 39)((onion, tomato), 39)((corn, onion), 29)((cumin, garlic), 28)((bell_pepper, cayenne), 28)((olive_oil, onion), 27)((black_pepper, cayenne), 22)((lime_juice, onion), 18)
Spain((olive_oil, onion), 34)((garlic, olive_oil), 29)((onion, tomato), 27)((bell_pepper, olive_oil), 23)((green_bell_pepper, onion), 16)((chicken, garlic), 14)((pepper, tomato), 14)((black_pepper, garlic), 13)((cayenne, onion), 12)((tomato, vinegar), 11)
Spanish_Portuguese((garlic, olive_oil), 141)((olive_oil, onion), 98)((bell_pepper, olive_oil), 86)((onion, tomato), 69)((cayenne, garlic), 52)((black_pepper, garlic), 44)((sherry, vinegar), 39)((chicken_broth, garlic), 36)((egg, olive_oil), 36)((parsley, tomato), 35)
Switzerland((egg, wheat), 9)((butter, wheat), 8)((onion, pepper), 5)((milk, wheat), 4)((nutmeg, wheat), 4)((cheese, pepper), 3)((lemon, wheat), 3)((cherry, wheat), 3)((pepper, wheat), 3)((almond, wheat), 2)
Thai((fish, garlic), 92)((garlic, ginger), 90)((cayenne, garlic), 90)((coriander, cumin), 83)((cilantro, garlic), 78)((coconut, coriander), 71)((cumin, turmeric), 66)((pepper, turmeric), 62)((fenugreek, pepper), 61)((chicken, garlic), 59)
Turkey((garlic, onion), 7)((olive_oil, onion), 6)((onion, tomato), 6)((bell_pepper, garlic), 6)((chicken, onion), 5)((egg, wheat), 4)((beef, tomato), 3)((pepper, tomato), 3)((black_pepper, onion), 3)((butter, wheat), 3)
UK-and-Ireland((butter, wheat), 126)((egg, wheat), 117)((milk, wheat), 77)((onion, potato), 44)((lard, wheat), 35)((cane_molasses, wheat), 35)((raisin, wheat), 34)((beef, onion), 32)((carrot, onion), 31)((cinnamon, wheat), 26)
Vietnam((fish, garlic), 54)((cayenne, fish), 35)((garlic, vegetable_oil), 35)((cilantro, fish), 32)((rice, vegetable_oil), 22)((basil, fish), 21)((black_pepper, fish), 20)((carrot, garlic), 20)((mint, rice), 20)((lime_juice, rice), 17)
West-African((cayenne, onion), 8)((onion, tomato), 8)((garlic, onion), 6)((peanut_butter, tomato), 5)((bell_pepper, onion), 5)((olive_oil, onion), 4)((chicken, onion), 4)((tomato, vegetable_oil), 4)((chicken_broth, onion), 4)((cumin, onion), 4)

So there you go, a little bit of data munging and some interesting insights into the types of ingredients the comprise a cuisine. In a later post, I will explore how this data can be converted into a supervised and unsupervised learning problem with ingredients constituting the feature vectors and the cuisine or the cuisine group as the labels.

Please feel free to leave any feedback in the comments section.