Omphaloskepsis (Silly and Self Indulgent)

Everyone engages in Ego searching i.e. searching for your own name on Google and so do I. While on such a trip I came across all my search history that google had preserved for me since late 2013.What I present you here is a little bit of exploratory data analysis of my google search history.

What I realized was simply combing through the timestamps of the searches and breaking them down to days/hour/weekday level said a lot about of my internet/online habits. Adding the search text into the mix pretty much illustrated the key events of my life over the last two years.

Getting Your Google Search History

If you have allowed Google to save your search history and are always logged into google account when using using google even for search (which you should since Google is then able to personalize the results) then you can download your search history from the Google History page. Once you are on the page , navigate to the the three dots the top right corner of the page. From there you can download your search history by simply clicking the Download Searches link. Google will email you your search results as zip file containing the json files.
Screen Shot 2016-02-13 at 2.50.56 AM.pngOnce you have the zip file (and if you want to follow along) unzip the file and place the json files along with this notebook.

For some reason I could not get Pandas to parse the json files straightaway into dataframe, instead had to write a small script to convert the json files into a csv format. Here is the gist if you want to try it out.

Though I have used Python/Pandas & Plotly fro this kind of analysis before, I use Tableau this time around. Here is the simple graph that illustrates the number of searches per day over the last two years. CompleteTimeSeries   The graph shows an overall increasing trend which is not surprising but not terribly useful either. I am more interested in understanding my online behavior i.e. what time of the day am I active the most, which days and months have I searched more.

Activity By Month

MonthlyGrouping   Clearly August and September (followed by December) perhaps are my most active while May and March are the least active ones.

Activity By Day

WeekDaySearchCounts This visualization was bit of a surprise as I always thought I was much more active online on the weekends than I was during the day…turns out its not the case. One big surprise of course is Friday. I knew I slowdown by Friday but did not expect it to be by that much.

By the Hour of the day

DailyWeekendBrowsingBehavior

Looks straightforward and corresponds to my intuition though I am a little surprised that my weekend pattern is so similar to my weekday patterns. I was expecting it to be lot more divergent i.e. much more subdued online activity throughout the day with peaks only the evenings or night…turns out not to be the case.

Hourly – Over the Years

HourlyByYear1   My data for 2013 (when I started collecting my search results) and 2016 is incomplete. Once we disregard the relatively sparse data from those 2 years, clearly my search activity has increased over the years but what is surprising is how consistent my activity patterns have been over the course of the day. My activity around midnight is still close to my average during the day which is a bit worrisome when I should expect it to be much lesser. Time to get real about getting to bed a little early and getting more sleep.

What was I searching for ?

Since google search history also contains the actual search text, with a little bit of text mining (sklearn and nltk) its easy to generate a snapshot of what you were looking for at that point in life. I chose to do a simple wordcloud using the python wordclound package. Generating wordclouds using that package is fairly straightforward.

Here are a few snapshots from my life curated from my google search words.

Sept 2013

 

Nov 2013

Jan 2014

Sept 2014

Jan 2015 (frantic job search time)

Feb 2015 (move to the US)

Oct 2015

Padma Shri’s over the Years

Padma Shri (or Padmashree) is the fourth highest civilian award given by the Indian government after the Bharat Ratna, the Padma Vibhushan, and the Padma Bhushan.

The awards are given for excellence in all fields of activities – disciplines such as art, literature and education, sports, medicine, social work, science and engineering, public affairs, civil service and trade and industry.

While there is no reason to believe that all fields should be equally represented, some fields have seen their share awards represented rather disproportionately than what you would expect.Here is a quick graphic of all PadmaShri’s grouped by their categories.

Number of Padma Shri Awards

Clearly Arts has claimed more of these awards than any other category. But that wasn’t the case always, though. Arts only became the dominant category after 1984. It was rather a mixed bag until then.

PadmaShri Awards over the years by category

But after 1984, Arts has been the dominant category in every year after that.

PadmaShri Awards over the years(for selected categories)

Of course, the  story here is the secular rise in the overall number of awards over the years. And here is why those sharp peaks are intriguing

Number of Padmashri awardees over the years

Make it what you will of the pattern – almost all peaks occur in the election years.

Here is what I think is happening- purely speculative of course.

Recommendations for these awards are solicited(from the states) around August and the winners declared before/on the republic day which is in January and federal elections happen in May/June(of the election year). So given the timing, I suppose it’s the ruling party in center acquiescing to the demands of the ruling parties in the states  either for their support in the upcoming elections or simply repaying for any favors they might have done in the past before cycle end of their rule.

Either way, looking at the states from which these awardees came from and the specific categories (arts-cinema vs arts-dance etc.) for which they were awarded might provide more clues to what is happening.An exercise for some other time.

GIYF

A nice little trick to scrape some data from a list or a table on website is by using the ‘ImportHTML’ function in Google Sheets/Docs. The ‘IMPORTHTML’ function takes 3 parameters –

  • url – the url of the page you want to scrape
  • query – ‘list’ or ‘table’ html element that contains the data
  • index – the index, starting at 1, of the list or table on the page.

Here is a simple example

And the result of the import.

Useful trick to have in your back pocket to quickly scrape data off without having to resort to scrapy or beautiful soup or other tools.

The Mobile Numbers in India

The story of the mobile phones in India is a veritable success story. Its common knowledge that India is among the fastest growing mobile phone markets in the world adding more subscribers every month than any other country. India still adds around 5-7 million new subscribers each month and there is still quite a distance to go before this surge plateaus, at least in the rural market.

Teledensity - Rural India 2001-2014

The urban market, on the other hand, is showing signs of saturation where the emphasis of the providers is moving from away from SIM card sales to more value-added purchases.

Teledensity - Urban India 2001-2014

Any way, thats not really the point of this post. What has been surprising from the data is the leap  Indian consumer made from no-connectivity to a mobile phone, skipping the landline stage completely. The rise is the tele-density in India simply reflects the rise of mobile phones.

Teledensity - India 2001-2014
Teledensity - Wired India 2001-2014
Teledensity - Wireless India 2001-2014

And what is even more remarkable is how far left behind are the public providers in this race. This should put to rest to any arguments that private sector cannot provide for services in the rural India where the margins potentially are not as high and hence  the role of Government (or its agencies like BSNL) in the market place.

Teledensity - Private & Public India 2001-2014

The biggest loser in of all of this growth :

Posts and Letters Traffic 2000-2011

Admittedly, the traffic crashed between 2000-2002 before the surge in Mobile Phones began, but whatever little uptick that happened around 2003-04 could barely the withstand the onslaught of cellphone revolution. An interesting contrast. Data on the mail is here

Summer of Crime in Bihar

Is there an impact of weather/season and the amount of crime happens  –  The Jury is out on that (see here,here and here). Clearly the issue has been studied quite a bit (this from a paper from 1952) and this one more recently with no conclusive result just yet.

The most commonly seen result (so far) is that precipitation does not have a significant correlation with crime but temperature apparently does. Here is a nice summary of the research about this topic. I doubt if such a correlation would hold in an Indian Context where the amount of economic activity is significantly dependent on how the amount of rainfall a particular region gets in the season.

Anyway, while researching for this earlier post , I landed on a crime stats dataset for Bihar that had data split by different types of crime per month for the last years.Once I started I digging into the data initially for murders (since that was the topic of the previous post)a curious observation jumped out from the graphs. For some reason the number of murders seemed to be always higher in months between May and July for every year.

Murders in Bihar over the last 5 years

Intrigued as I was, I checked this trend for other violent crime categories listed by the Bihar Police. Here is what I find.

Cognizable Crimes in Bihar over the last 5 years
Rapes  in Bihar over the last 5 years
Robbery in Bihar over the last 5 years
Theft in Bihar over the last 5 years
Road_Robbery in Bihar over the last 5 years
Riots in Bihar over the last 5 years

Notice those huge bumps in the middle of the year.Only one category of crime (as listed in the data set) did not show this behavior.

Kidnapping in Bihar over the last 5 years

I am not sure if these trends mean anything in a behavioral sense i.e. people don’t get extra violent in summer or grow more docile in the winter. If any thing, these trends might just reflect some underlying economic patterns/hardships which perhaps manifested in increased crime.

Lying with statistics – # 1

Here is a recent article  about the  purported rise in murders in Bihar (which of course  was promptly and endlessly  recirculated on Social Media).

The report does not mention when the data is from but dog-whistles that new government had some thing to do with it. Thats kinda funny  because the new govt. was only sworn in the mid of November. So its definitely not 2 months yet.

So, scouring for the data  I found this from Bihar Police website. I assume this is what they were looking at because the last two months for which the data is available (Sept and Oct of 2015) do add up to 578 – the number of murders they mention in the article.

Any way, when stated without context the number of murders do look significant and the article does make it seem as if there has been a significant rise in murders (or “Jungle Raaj” as the article puts it) after the new/old government took over (even though that is clearly not true).

So, I thought I’d put it in context. (Data from here )

Picture

See the big jump in murders in September and October …neither do I.

But Bihar is a violent state, every one knows that. These numbers are probably higher than those for other states(an exercise for another time) and hence sound scary. But I don’t see how these are any thing different from what they were in previous months.

Just for comparison,  look at the what they were in 2014.

Picture1

The numbers look much better in 2015  than they were in 2014. But that doesn’t make a great headline does it ?

Digging in further, I found an interesting pattern in the number of murders (gruesome as it sounds) , and  the specific the time of the year when they seem to rise. More about it in the next post.

Curious Case of Failing Prediction Markets in India (or Not)

Here is an interesting article that summarizes the failure of election predictions in India, at both national the state level.

This report by IDFC Institute says not a single survey since 1996 has been accurate (when defined at 5% confidence interval level). And the worrying trend is rather this :

“Accuracy levels of these surveys show no signs of an improving trend since 1996 though sample sizes and number of surveys have increased substantially.”

Now if more data is not improving your accuracies then one needs to question the sampling procedures these surveys employed.

On a related note, some of the smartest enterpreneurs are already on the task to solve this prediction crisis….truly in an Indian style ofcourse.

Stars Lining up for Indian Astrology startups

All those graduates of the Astrology programs have to be employed some where.

What ingredients define your Cuisine?

Dan Kopf of Pricenomics does a wonderful job of exploring different various facets of cuisines of the world, rather those available in the epicurious datasets. In the article he looks at the most commonly used ingredients across various cuisines, the most distinguishing ingredients of each cuisine and the types of meats used in each.

Inspired by his article, I did some follow up work to replicate some of his findings,discover some new insights and also use the data set to build some machine learing models to predict cuisines based on the ingredients. While the accuracy of the models is nothing to write home about (which was never even the goal to begin with), it was a fun little exercise to refine my ML chops. What follow are series(2-3 perhaps) of blog posts describing both the insights from the data and also the models.

To start off, Dan (and I) got the data from here. Dan curiously uses data only from epicurious, I use both the csv files (epicurious and all-recipies).

For my data analysis, I use the wonderful python library Pandas and the equally remarkable iPython notebook (with a little help from PyCharm for long running tasks).

If you want to follow along, you can get the ipython notebook from my github repo here.

The Dataset

The dataset is a collection (actually just 2) of csv files that contain a the cusine name as the first column and the ingredients of that dish on the following columns. Since the number of ingredients are not predefined, the number of columns tend to vary based on the dish.

Pandas does not seem to handle variable number of columns in the csv very well.So I import the each line (cuisine+ingredients) into a single column.Once the data is injected into a one column dataframe,I use a couple of lambda functions with apply to split the single column into two columns (cuisine + ingredients). Following is the code to do that. 

After a little cleaning and standardization (china,chinese—> Chinese, italian,italy —> Italian) I head on to Data Exploration.

Vectorization

To derive any insights from this text data, the first step is convert the text data (ingredients) into numerical feature vectors. This is done by first tokenizing the data and building a matrix of all ingredients. Sklearn has many more simple and advanced examples of working with text here.

Here is the code that uses CountVectorizer class to vectorize my ingredients list.

Data Exploration

Now that the data janitoring is all done, the data frame is now ready for some exploration.

One thing to remember is that, most of these recipes came from Epicurious,Gourmet and other magazines which cater to a predominantly North American readers. So no wonder a significant chunk of the recipes have their cuisine listed as North American (American, Canadian,Cajun_Creole,Southern_SoulFood,Southwestern )

One of the first things to see is to just a get a distribution of these cuisines in our collection. A simply groupby in pandas does the deed.

Here is the output we see. Clearly NorthAmerican cuisines comprise over 75% of the dataset.Latin American is a distant second.

cgbarchart

Now that we have the caveat out of the way, lets look at some interesting things hidden in the data.

One quick visualization is to build histograms of the number of ingredients that typically go into a particular cuisine. From my experience (of tasting many cuisines, not making any of them) most south asian cuisine tend to have a large number of ingredients while most north american,british and west european tend to be a little parsimonious. Lets see if the data bares out my hypothesis.

Here is a  histogram of the top 25 cuisines listed in my dataset.

histograms

These histogram show the count of ingredients on the X-axis and the number of dishes with those ingredients on the Y-axis. What’s interesting is to see how similar American, Canadian, French  and other Western European cuisines are – large number of dishes with 5-9 ingredients and very few beyond that. If for a minute you exclude the staples like -oil, salt,meat – there are very few other ingredients in most dishes in these cuisines. Now contrast them with something like Indian which shows almost a bi or trimodal distribution or Thai which possibly has more dishes with a greater number of ingredients than those with fewer ingredients. No wonder Indian food is among the most delicious cuisines and but also is expensive to to make and suprisingly not so popular in the US.

Other interesting distribution is the Scandinavian cuisine which sees a much rapid fall in the number of dishes with ingredients greater 10 than most other cuisines.

Here are the histograms done at the cuisine group level which also show a very similar patterns.

histogramcg

Here is the python code that does the data-munging and generates the nice pictures above.

Whats the most common Ingredient

All cuisines typically are based on a few ingredients that are generally available in plenty in that region.So its only natural to look for the most common ones in each cuisine. To make this a little interesting I exclude some of the more mundane 🙂 ingredients like salt,garlic and onions. I use the stop words property of the vectorizer to do this.

Here are the top most ingredients of each cuisine.

CuisineIngredient% of dishes containing
Vietnamfish0.736842
Indiancumin0.603679
Spanish_Portuguesebell_pepper0.343643
Jewishegg0.593750
Frenchegg0.441456
Central_SouthAmericancayenne0.518672
Cajun_Creolecayenne0.561644
Thaifish0.529412
Scandinavianwheat0.580000
Greeklemon_juice0.337778
Americanegg0.405131
Africancumin0.426087
MiddleEasternwheat0.379032
EasternEuropean_Russianegg0.506849
Italiantomato0.392317
Irishwheat0.500000
Mexicancayenne0.736953
Chinesesoy_sauce0.679775
Germanwheat0.647059
Mediterraneantomato0.349481
Japanesesoy_sauce0.588235
Moroccancumin0.547445
Southern_SoulFoodwheat0.485549
English_Scottishwheat0.622549
Asiansoy_sauce0.500000
Southwesterncayenne0.814815
Canadawheat0.395349
Turkeytomato0.437500
Caribbeanvegetable_oil0.311475
Bangladeshcayenne1.000000
Israelwheat0.666667
Koreansoy_sauce0.781250
Irantomato0.333333
Eastern-Europewheat0.531915
South-Africanturmeric0.437500
UK-and-Irelandwheat0.606383
Belgiumwheat0.909091
South-Americaegg0.349515
Spaintomato0.426667
Netherlandswheat0.937500
Philippinesvegetable_oil0.488372
Indonesiasoy_sauce0.500000
East-Africanginger0.363636
Switzerlandwheat0.650000
West-Africanpeanut_butter0.615385
North-Africancumin0.483333
Pakistancumin0.684211
Portugalegg0.360000
Lebanonlemon_juice0.645161
Malaysiacoconut0.555556
Austriaegg0.809524

Do you see any surprises ?

What two ingredients always go together

Another interesting characteristic of most cuisines are the ingredients that often seem to go-together. Such patterns are so common that there sites that help you find the appropriate flavor combos.

In the text analytics world this is called finding co-occurrences. This could have some value in a prediction context because it gives a better insight into what the context is. In this case where the text really is a stream of unconnected labels (as seen by the count vectorizer), adding co-occurrences into the feature set might provide additional context/feature to separate the dimensions better.

The table shows the most commonly occurring combo of ingredients each cuisine ordered (along columns) by their count in decreasing order.

 0123456789
African((cumin, olive_oil), 38)((garlic, olive_oil), 35)((olive_oil, onion), 29)((cayenne, garlic), 27)((bell_pepper, garlic), 26)((coriander, cumin), 25)((ginger, onion), 22)((cilantro, olive_oil), 21)((black_pepper, onion), 21)((lemon_juice, olive_oil), 20)
American((egg, wheat), 11147)((butter, wheat), 10598)((milk, wheat), 7147)((vanilla, wheat), 6020)((garlic, onion), 5573)((cream, egg), 3996)((onion, pepper), 3927)((cane_molasses, wheat), 3691)((black_pepper, onion), 3325)((vegetable_oil, wheat), 3073)
Asian((ginger, soy_sauce), 366)((garlic, soy_sauce), 339)((scallion, soy_sauce), 291)((rice, vinegar), 265)((sesame_oil, soy_sauce), 259)((soy_sauce, vinegar), 247)((cayenne, garlic), 230)((vegetable_oil, vinegar), 151)((cilantro, ginger), 150)((fish, garlic), 131)
Austria((egg, wheat), 15)((butter, wheat), 14)((milk, wheat), 9)((vanilla, wheat), 9)((almond, egg), 5)((cocoa, wheat), 5)((bread, wheat), 4)((cream, wheat), 3)((rum, wheat), 3)((hazelnut, wheat), 3)
Bangladesh((onion, vegetable_oil), 4)((cayenne, onion), 4)((turmeric, vegetable_oil), 4)((garlic, onion), 3)((coriander, onion), 2)((cilantro, onion), 2)((beef, coriander), 2)((potato, vegetable_oil), 2)((cardamom, coriander), 2)((ginger, onion), 2)
Belgium((butter, wheat), 8)((cane_molasses, wheat), 6)((egg, wheat), 5)((milk, wheat), 3)((parsley, wheat), 3)((black_pepper, butter), 3)((onion, wheat), 3)((nutmeg, wheat), 3)((cinnamon, wheat), 3)((almond, butter), 2)
Cajun_Creole((cayenne, onion), 67)((garlic, onion), 63)((bell_pepper, onion), 44)((green_bell_pepper, onion), 42)((black_pepper, onion), 39)((celery, onion), 38)((onion, vegetable_oil), 38)((butter, onion), 36)((olive_oil, onion), 34)((chicken_broth, onion), 29)
Canada((butter, wheat), 195)((egg, wheat), 192)((milk, wheat), 135)((garlic, onion), 124)((black_pepper, onion), 105)((vanilla, wheat), 100)((onion, tomato), 97)((cane_molasses, wheat), 94)((cinnamon, wheat), 58)((cream, wheat), 57)
Caribbean((garlic, onion), 68)((onion, tomato), 47)((black_pepper, onion), 40)((olive_oil, onion), 36)((cayenne, onion), 32)((green_bell_pepper, onion), 31)((cumin, garlic), 30)((bell_pepper, onion), 28)((chicken, garlic), 27)((tomato, vegetable_oil), 24)
Central_SouthAmerican((garlic, onion), 100)((cayenne, onion), 91)((onion, tomato), 81)((corn, garlic), 58)((cilantro, onion), 54)((cumin, garlic), 45)((olive_oil, onion), 38)((black_pepper, garlic), 36)((bell_pepper, garlic), 32)((cheese, onion), 30)
Chinese((ginger, soy_sauce), 156)((garlic, soy_sauce), 145)((scallion, soy_sauce), 133)((sesame_oil, soy_sauce), 115)((soy_sauce, starch), 88)((pork, soy_sauce), 79)((rice, soy_sauce), 74)((pepper, soy_sauce), 60)((chicken, soy_sauce), 56)((cayenne, soy_sauce), 56)
East-African((ginger, onion), 4)((garlic, onion), 4)((cayenne, onion), 3)((onion, pepper), 3)((cumin, onion), 3)((olive_oil, onion), 2)((carrot, onion), 2)((beef, onion), 2)((cardamom, pepper), 2)((black_pepper, onion), 2)
Eastern-Europe((egg, wheat), 96)((butter, wheat), 77)((milk, wheat), 54)((garlic, onion), 42)((onion, tomato), 40)((black_pepper, onion), 39)((wheat, yeast), 36)((cream, wheat), 36)((bell_pepper, onion), 31)((vanilla, wheat), 28)
EasternEuropean_Russian((butter, wheat), 62)((egg, wheat), 54)((cream, wheat), 26)((black_pepper, onion), 23)((onion, potato), 23)((garlic, onion), 21)((milk, wheat), 20)((vanilla, wheat), 20)((vegetable_oil, wheat), 17)((olive_oil, parsley), 14)
English_Scottish((butter, wheat), 106)((egg, wheat), 93)((cream, wheat), 59)((milk, wheat), 58)((cane_molasses, wheat), 33)((vanilla, wheat), 27)((buttermilk, wheat), 23)((onion, wheat), 20)((milk_fat, wheat), 20)((vegetable_oil, wheat), 19)
French((butter, wheat), 374)((egg, wheat), 338)((cream, egg), 210)((garlic, olive_oil), 201)((milk, wheat), 176)((olive_oil, onion), 135)((vanilla, wheat), 133)((black_pepper, onion), 125)((onion, parsley), 125)((bay, onion), 93)
German((egg, wheat), 150)((butter, wheat), 111)((milk, wheat), 74)((cinnamon, wheat), 50)((onion, vinegar), 48)((vanilla, wheat), 43)((black_pepper, onion), 42)((cream, wheat), 34)((nutmeg, wheat), 32)((bacon, onion), 31)
Greek((garlic, olive_oil), 89)((olive_oil, onion), 71)((feta_cheese, olive_oil), 61)((lemon_juice, olive_oil), 60)((onion, tomato), 48)((bread, olive_oil), 42)((black_pepper, olive_oil), 39)((oregano, tomato), 36)((egg, wheat), 34)((butter, wheat), 32)
Indian((cumin, turmeric), 261)((coriander, cumin), 259)((onion, turmeric), 203)((cayenne, cumin), 192)((garlic, onion), 192)((pepper, turmeric), 170)((fenugreek, turmeric), 169)((ginger, onion), 165)((turmeric, vegetable_oil), 157)((cilantro, cumin), 115)
Indonesia((garlic, vegetable_oil), 6)((cayenne, garlic), 5)((lemon_juice, soy_sauce), 3)((chicken, soy_sauce), 3)((soy_sauce, tomato), 3)((scallion, soy_sauce), 3)((rice, soy_sauce), 3)((tomato, vegetable_oil), 3)((onion, soy_sauce), 3)((egg, tomato), 3)
Iran((onion, tomato), 7)((olive_oil, onion), 5)((black_pepper, onion), 5)((chicken, onion), 4)((turmeric, vegetable_oil), 4)((dill, yogurt), 3)((lime_juice, onion), 3)((beef, onion), 3)((parsley, tomato), 3)((cardamom, rose), 3)
Irish((butter, wheat), 32)((egg, wheat), 27)((buttermilk, wheat), 21)((cream, egg), 18)((cane_molasses, wheat), 13)((onion, potato), 12)((wheat, whole_grain_wheat_flour), 12)((oat, wheat), 12)((carrot, onion), 11)((vanilla, wheat), 9)
Israel((butter, wheat), 4)((egg, wheat), 3)((sesame_seed, wheat), 2)((vegetable_oil, wheat), 2)((black_pepper, olive_oil), 2)((garlic, parsley), 2)((onion, wheat), 2)((cumin, olive_oil), 2)((olive_oil, turkey), 1)((coconut, wheat), 1)
Italian((garlic, olive_oil), 1334)((olive_oil, tomato), 956)((basil, garlic), 787)((onion, tomato), 630)((macaroni, olive_oil), 629)((black_pepper, olive_oil), 581)((egg, wheat), 472)((parmesan_cheese, tomato), 440)((bell_pepper, olive_oil), 409)((butter, wheat), 390)
Japanese((rice, vinegar), 56)((soy_sauce, vegetable_oil), 55)((ginger, soy_sauce), 48)((scallion, soy_sauce), 46)((sake, soy_sauce), 40)((garlic, soy_sauce), 37)((barley, soybean), 33)((vegetable_oil, vinegar), 33)((vinegar, wine), 29)((egg, vegetable_oil), 29)
Jewish((egg, wheat), 134)((butter, egg), 78)((vegetable_oil, wheat), 60)((onion, wheat), 43)((garlic, olive_oil), 41)((cinnamon, egg), 40)((olive_oil, onion), 40)((black_pepper, egg), 35)((lemon_juice, wheat), 28)((vanilla, wheat), 28)
Korean((garlic, soy_sauce), 18)((sesame_oil, soy_sauce), 17)((scallion, soy_sauce), 14)((beef, soy_sauce), 12)((sesame_seed, soy_sauce), 12)((black_pepper, soy_sauce), 11)((cayenne, soy_sauce), 9)((onion, soy_sauce), 8)((soy_sauce, vinegar), 8)((ginger, soy_sauce), 7)
Lebanon((lemon_juice, olive_oil), 15)((garlic, olive_oil), 15)((olive_oil, onion), 11)((onion, parsley), 9)((mint, olive_oil), 8)((parsley, tomato), 7)((bread, olive_oil), 7)((black_pepper, olive_oil), 6)((lettuce, tomato), 5)((cucumber, tomato), 5)
Malaysia((garlic, pepper), 7)((coriander, garlic), 7)((cumin, garlic), 7)((coconut, ginger), 6)((fenugreek, pepper), 6)((pepper, turmeric), 6)((ginger, onion), 5)((chicken, garlic), 5)((onion, vegetable_oil), 5)((olive_oil, pepper), 4)
Mediterranean((garlic, olive_oil), 133)((olive_oil, onion), 100)((onion, tomato), 63)((lemon_juice, olive_oil), 58)((bell_pepper, olive_oil), 49)((olive, olive_oil), 41)((parsley, tomato), 38)((lemon, olive_oil), 37)((black_pepper, olive_oil), 36)((fish, olive_oil), 35)
Mexican((cayenne, onion), 1354)((garlic, onion), 1253)((onion, tomato), 1186)((cumin, garlic), 623)((corn, onion), 552)((cilantro, onion), 454)((bell_pepper, onion), 413)((black_pepper, onion), 400)((cheddar_cheese, onion), 394)((tomato, vegetable_oil), 381)
MiddleEastern((garlic, olive_oil), 83)((lemon_juice, olive_oil), 66)((olive_oil, onion), 65)((cumin, olive_oil), 61)((black_pepper, olive_oil), 44)((onion, wheat), 39)((bread, olive_oil), 38)((bell_pepper, olive_oil), 35)((mint, olive_oil), 34)((cayenne, olive_oil), 33)
Moroccan((cumin, olive_oil), 61)((olive_oil, onion), 49)((garlic, olive_oil), 48)((cinnamon, olive_oil), 40)((coriander, cumin), 38)((bell_pepper, olive_oil), 35)((cilantro, olive_oil), 34)((cayenne, cumin), 33)((lemon_juice, olive_oil), 31)((ginger, onion), 31)
Netherlands((butter, wheat), 25)((egg, wheat), 21)((cinnamon, wheat), 13)((cane_molasses, wheat), 10)((almond, butter), 9)((milk, wheat), 9)((apple, wheat), 8)((lard, wheat), 5)((wheat, yeast), 5)((nutmeg, wheat), 5)
North-African((cumin, onion), 20)((garlic, olive_oil), 20)((onion, tomato), 19)((olive_oil, tomato), 17)((cayenne, garlic), 16)((coriander, cumin), 14)((bell_pepper, olive_oil), 14)((chickpea, onion), 12)((black_pepper, cumin), 12)((carrot, onion), 12)
Pakistan((garlic, onion), 13)((cayenne, onion), 12)((cumin, garlic), 12)((onion, turmeric), 11)((ginger, onion), 9)((chicken, garlic), 8)((cilantro, garlic), 8)((tomato, turmeric), 8)((black_pepper, garlic), 7)((turmeric, vegetable_oil), 7)
Philippines((garlic, soy_sauce), 15)((black_pepper, soy_sauce), 14)((onion, vegetable_oil), 13)((soy_sauce, vegetable_oil), 11)((chicken, soy_sauce), 10)((pork, soy_sauce), 9)((beef, garlic), 8)((egg, wheat), 8)((carrot, pork), 7)((bay, soy_sauce), 7)
Portugal((garlic, onion), 18)((bell_pepper, garlic), 14)((butter, egg), 14)((onion, tomato), 14)((olive_oil, onion), 13)((egg, wheat), 12)((milk, wheat), 10)((black_pepper, garlic), 10)((wheat, yeast), 8)((pork_sausage, potato), 8)
Scandinavian((butter, wheat), 120)((egg, wheat), 100)((milk, wheat), 47)((cream, egg), 43)((almond, wheat), 34)((vanilla, wheat), 28)((cinnamon, wheat), 24)((wheat, yeast), 22)((cane_molasses, wheat), 22)((cardamom, wheat), 22)
South-African((coriander, fenugreek), 6)((fenugreek, pepper), 6)((pepper, turmeric), 6)((onion, pepper), 6)((cumin, fenugreek), 6)((bay, onion), 6)((beef, onion), 5)((egg, milk), 5)((black_pepper, onion), 4)((turmeric, vinegar), 4)
South-America((garlic, onion), 31)((onion, pepper), 22)((olive_oil, onion), 19)((egg, milk), 19)((bell_pepper, onion), 17)((cumin, garlic), 16)((milk, wheat), 15)((cayenne, onion), 15)((butter, wheat), 13)((chicken, onion), 12)
Southern_SoulFood((butter, wheat), 125)((egg, wheat), 99)((corn, egg), 73)((milk, wheat), 63)((buttermilk, wheat), 55)((garlic, onion), 47)((cayenne, onion), 46)((cream, wheat), 46)((black_pepper, onion), 41)((onion, vinegar), 38)
Southwestern((cayenne, onion), 59)((garlic, onion), 51)((cilantro, onion), 39)((onion, tomato), 39)((corn, onion), 29)((cumin, garlic), 28)((bell_pepper, cayenne), 28)((olive_oil, onion), 27)((black_pepper, cayenne), 22)((lime_juice, onion), 18)
Spain((olive_oil, onion), 34)((garlic, olive_oil), 29)((onion, tomato), 27)((bell_pepper, olive_oil), 23)((green_bell_pepper, onion), 16)((chicken, garlic), 14)((pepper, tomato), 14)((black_pepper, garlic), 13)((cayenne, onion), 12)((tomato, vinegar), 11)
Spanish_Portuguese((garlic, olive_oil), 141)((olive_oil, onion), 98)((bell_pepper, olive_oil), 86)((onion, tomato), 69)((cayenne, garlic), 52)((black_pepper, garlic), 44)((sherry, vinegar), 39)((chicken_broth, garlic), 36)((egg, olive_oil), 36)((parsley, tomato), 35)
Switzerland((egg, wheat), 9)((butter, wheat), 8)((onion, pepper), 5)((milk, wheat), 4)((nutmeg, wheat), 4)((cheese, pepper), 3)((lemon, wheat), 3)((cherry, wheat), 3)((pepper, wheat), 3)((almond, wheat), 2)
Thai((fish, garlic), 92)((garlic, ginger), 90)((cayenne, garlic), 90)((coriander, cumin), 83)((cilantro, garlic), 78)((coconut, coriander), 71)((cumin, turmeric), 66)((pepper, turmeric), 62)((fenugreek, pepper), 61)((chicken, garlic), 59)
Turkey((garlic, onion), 7)((olive_oil, onion), 6)((onion, tomato), 6)((bell_pepper, garlic), 6)((chicken, onion), 5)((egg, wheat), 4)((beef, tomato), 3)((pepper, tomato), 3)((black_pepper, onion), 3)((butter, wheat), 3)
UK-and-Ireland((butter, wheat), 126)((egg, wheat), 117)((milk, wheat), 77)((onion, potato), 44)((lard, wheat), 35)((cane_molasses, wheat), 35)((raisin, wheat), 34)((beef, onion), 32)((carrot, onion), 31)((cinnamon, wheat), 26)
Vietnam((fish, garlic), 54)((cayenne, fish), 35)((garlic, vegetable_oil), 35)((cilantro, fish), 32)((rice, vegetable_oil), 22)((basil, fish), 21)((black_pepper, fish), 20)((carrot, garlic), 20)((mint, rice), 20)((lime_juice, rice), 17)
West-African((cayenne, onion), 8)((onion, tomato), 8)((garlic, onion), 6)((peanut_butter, tomato), 5)((bell_pepper, onion), 5)((olive_oil, onion), 4)((chicken, onion), 4)((tomato, vegetable_oil), 4)((chicken_broth, onion), 4)((cumin, onion), 4)

So there you go, a little bit of data munging and some interesting insights into the types of ingredients the comprise a cuisine. In a later post, I will explore how this data can be converted into a supervised and unsupervised learning problem with ingredients constituting the feature vectors and the cuisine or the cuisine group as the labels.

Please feel free to leave any feedback in the comments section.

The Dummy Variable

In statistics and econometrics, a dummy variable is used to represent either the presence or absence of some categorical effect. In Field Experiments, dummy variable is used to distinguish different treatment groups – control group or treatment group.


Also a cool name for blog which will most likely be about data, statistics and economics.


More substantive or trivial posts coming soon(hopefully).