Tuesday, June 10, 2014

Playing with Yelp's Phoenix Academic Dataset: Hierarchical Clustering

Yelp_Academic_Dataset

Yelp Dataset Challenge

loading in datasets

In [2]:
import sys
import re
import os
import shutil
import commands
import pandas as pd
import numpy as np
import json
import matplotlib
import matplotlib.pyplot as plt
from datetime import datetime

import statsmodels.api as sm
from pandas.tools.plotting import autocorrelation_plot
import itertools

from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
import scipy.cluster.hierarchy as sch
In [3]:
%matplotlib inline
In [4]:
filename = "C:\Users\jchao\Desktop\yelp_phoenix_academic_dataset\yelp_academic_dataset_review.json"
f = open(filename,'rU')
review = [json.loads(line) for line in f]
f.close()
review_df = pd.DataFrame(review)
review_colnames = review_df.columns.values.tolist()
print review_colnames
[u'business_id', u'date', u'review_id', u'stars', u'text', u'type', u'user_id', u'votes']

In [4]:
print review_df.head(1)
              business_id        date               review_id  stars  \
0  WIcDFpHEnC3ihNmS7-6-ZA  2011-02-11  0ESSqLfOae77muWTv_zUqA      3   

                                                text    type  \
0  Lately i have been feeling homesick for asian ...  review   

                  user_id                                    votes  
0  r-t7IiTSD0QZdt8lOUCqeQ  {u'funny': 1, u'useful': 1, u'cool': 1}  

[1 rows x 8 columns]

In [5]:
filename1 = "C:\Users\jchao\Desktop\yelp_phoenix_academic_dataset\yelp_academic_dataset_business.json"
f1 = open(filename1,'rU')
biz = [json.loads(line) for line in f1]
f1.close()
biz_df = pd.DataFrame(biz)
biz_colnames = biz_df.columns.values.tolist()
print biz_colnames
[u'attributes', u'business_id', u'categories', u'city', u'full_address', u'hours', u'latitude', u'longitude', u'name', u'neighborhoods', u'open', u'review_count', u'stars', u'state', u'type']

In [6]:
filename3 = "C:\Users\jchao\Desktop\yelp_phoenix_academic_dataset\yelp_academic_dataset_checkin.json"
f3 = open(filename3,'rU')
checkin = [json.loads(line) for line in f3]
f3.close()
checkin_df = pd.DataFrame(checkin)
checkin_colnames = checkin_df.columns.values.tolist()
print checkin_colnames
[u'business_id', u'checkin_info', u'type']

first convert dates into day of week

In [7]:
review_df['date'] = pd.to_datetime(review_df['date'])
print review_df.dtypes
review_df['wkday'] = [datetime.weekday(d) for d in review_df['date']]
review_df.head(1)
business_id            object
date           datetime64[ns]
review_id              object
stars                   int64
text                   object
type                   object
user_id                object
votes                  object
dtype: object

Out[7]:
business_id date review_id stars text type user_id votes wkday
0 WIcDFpHEnC3ihNmS7-6-ZA 2011-02-11 0ESSqLfOae77muWTv_zUqA 3 Lately i have been feeling homesick for asian ... review r-t7IiTSD0QZdt8lOUCqeQ {u'funny': 1, u'useful': 1, u'cool': 1} 4

1 rows × 9 columns

In [8]:
bywkday = review_df.groupby(by='wkday')
bywkday['stars'].mean()
Out[8]:
wkday
0        3.775806
1        3.807053
2        3.807055
3        3.797368
4        3.783918
5        3.765101
6        3.733346
Name: stars, dtype: float64

looking at the avg stars people give by day. the difference between day of week appears to be small

In [9]:
wkdayct = bywkday.size()
print wkdayct
wkday
0        52798
1        49371
2        49636
3        45595
4        43812
5        44964
6        48846
dtype: int64

0 is Monday and 6 is Sunday. There appears to be a pretty clear pattern that people are reviewing a lot more between Sunday - Tuesday vs. rest of week. I posit this is because there is a lag in the time between people's visit to their review ie. checking out restaurants during the weekend and then reviewing during work on Monday (oops!)

In [10]:
review_by_day = pd.DataFrame(wkdayct)
review_by_day['wkday'] = range(7)
review_by_day.columns = ['reviews','wkday']
review_by_day['wkday_name'] = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

f=plt.figure(figsize=(12,8))
review_by_day.plot(x='wkday_name', y='reviews',kind='bar')

plt.xlabel('Day of Week')
plt.ylabel('Number of Reviews')
plt.title('Number of Reviews by Day of Week for Phoenix, AZ')
Out[10]:
<matplotlib.text.Text at 0x781b3828>
In [11]:
# using the checkins dataset for all businesses, i want to count the checkins by day
# the systax used by yelp is (0-24)-(0-6) where the first number is the hour and the second number
# is the day of week, so in this case we can just count the 2nd and add
checkin_dict = {}
for info in checkin_df['checkin_info']:
    for k, v in info.items():
        match = re.search(r'-(\d)', k)
        day = match.group(1)
        if day not in checkin_dict:
            checkin_dict[day] = v
        else:
            checkin_dict[day] += v
            
sorted_checkin_by_day = [(k, checkin_dict[k]) for k in sorted(checkin_dict.keys())]
         
In [12]:
checkin_by_day = pd.DataFrame(sorted_checkin_by_day)
checkin_by_day.columns = ['wkday','checkins']
checkin_by_day['wkday_name'] = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

f=plt.figure(figsize=(12,8))
checkin_by_day.plot(x='wkday_name', y='checkins',kind='bar')

plt.xlabel('Day of Week')
plt.ylabel('Number of Check Ins')
plt.title('Number of Check-Ins by Day of Week for Phoenix, AZ')
Out[12]:
<matplotlib.text.Text at 0x782d6208>

and sure enough, we see that there are a lot more checkins during Fri-Sun. So far so good.

i want to use autocorrelation to show time lag between reviews and checkins, unfortunately there isn't a good function i liked so wrote my own:

In [15]:
def ccf(x,y, lag_max=7):
    lag = abs(lag_max)
    xbar = np.mean(x)
    ybar = np.mean(y)
    
    xdemean = x - xbar
    ydemean = y - ybar
    
    covx = np.dot(xdemean.T,xdemean) / len(x)
    covy = np.dot(ydemean.T,ydemean) / len(x)
    covxy = np.sqrt(covx*covy)
    np.roll(xdemean,1)
    acfx = np.array([np.dot(np.roll(xdemean,-i)[:len(x)-i].T, xdemean[:len(x)-i]) / len(x) /covx \
    for i in range(0,lag+1)])
    acfy = np.array([np.dot(np.roll(ydemean,-i)[:len(y)-i].T, ydemean[:len(y)-i]) / len(y) /covy \
    for i in range(0,lag+1)])
    acfxy = np.array([np.dot(np.roll(xdemean,-i)[:len(x)-i].T, ydemean[:len(y)-i]) / len(x) /covxy \
    for i in range(0,lag+1)])
    acfyx = np.array([np.dot(np.roll(ydemean,-i)[:len(y)-i].T, xdemean[:len(x)-i]) / len(x) /covxy \
    for i in range(0,lag+1)])    

    return np.array([acfx, acfxy, acfyx, acfy])
In [16]:
checkin_array = np.array(checkin_by_day['checkins'])
review_array = np.array(review_by_day['reviews'])

xyccf = ccf(review_array, checkin_array, 6)
xyccf_df = pd.DataFrame(xyccf.T)
xyccf_df.columns = ['acf x', 'acf x to y', 'acf y to x', 'acf y']
xyccf_df['Lag'] = range(0,7)
In [17]:
# plotting the acf for the 4 series

f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(figsize=(16,12), nrows=2, ncols=2, sharey=True)
ax1.set_xlim(-0.5,len(xyccf_df['Lag']))
ax1.set_ylim(-1,1)
ax1.vlines(xyccf_df['Lag'], ymin=[0], ymax=xyccf_df['acf x'], colors='k', linestyles='solid')
ax1.axhline(0, color='black', lw=2)
ax1.set_xlabel('Lag')
ax1.set_title('acf review')
ax1.grid(True)
ax2.set_xlim(-0.5,len(xyccf_df['Lag']))
ax2.vlines(xyccf_df['Lag'], ymin=[0], ymax=xyccf_df['acf x to y'], colors='k', linestyles='solid')
ax2.axhline(0, color='black', lw=2)
ax2.set_xlabel('Lag')
ax2.set_title('acf review to checkin')
ax2.grid(True)
ax3.set_xlim(-0.5,len(xyccf_df['Lag']))
ax3.vlines(xyccf_df['Lag'], ymin=[0], ymax=xyccf_df['acf y to x'], colors='k', linestyles='solid')
ax3.axhline(0, color='black', lw=2)
ax3.set_xlabel('Lag')
ax3.set_title('acf checkin to review')
ax3.grid(True)
ax4.set_xlim(-0.5,len(xyccf_df['Lag']))
ax4.vlines(xyccf_df['Lag'], ymin=[0], ymax=xyccf_df['acf y'], colors='k', linestyles='solid')
ax4.axhline(0, color='black', lw=2)
ax4.set_xlabel('Lag')
ax4.set_title('acf checkin')
ax4.grid(True)

Looking at the review to checkin(top right) you can see a positive correlation with a lag =3 ie checking at time = t, compared to review at time t+3, which supports our theory that it will take a couple of days before people get to reviewing, so a Friday visit followed by a Monday review is highly probably here. Another interesting thing is the lag between checkin to review is lag 4 or lag 5, which makes intuitive sense and confirms the behavior Monday Review, then Weekend back to checkin.

In [18]:
# next i want to look at reviews by category and see if any correlation there
# since most businesses can belong to multiple categories, first find all the unique tags
cats = []
for i in range(len(biz)):
    for category in biz_df['categories'][i]:
        if category not in cats:
            cats.append(category)
            
cats.sort()
In [20]:
# then i want to count the average reviews per business per week day
biz_review = biz_df.merge(review_df, on='business_id', how='inner', suffixes=('biz','review'))
biz_review['one'] = 1
br_colnames = biz_review.columns.values.tolist()
biz_review['review_count_by_wkday'] = biz_review.groupby(['wkday', 'business_id'])['one'].transform(np.sum)

bywkday = biz_review.groupby(by='wkday')
avg_review_per_biz_by_day = bywkday['review_count_by_wkday'].agg(np.mean)
In [21]:
print avg_review_per_biz_by_day
wkday
0        22.632979
1        20.913026
2        20.367677
3        18.627591
4        17.447275
5        18.864380
6        22.112312
Name: review_count_by_wkday, dtype: float64

averages also show the same pattern as the aggregate

In [31]:
# counting the reviews for each tag
catrev_byday = {}
for cat in cats:
    catrev_byday[cat] = [0,]*7
    

for i in range(len(biz_review)):
    categories = biz_review['categories'][i]
    day = biz_review['wkday'][i]
    for category in categories:
        catrev_byday[category][day] += 1
        
# keeping tags that are more popular and dropping tags with less than 100 reviews. median of revsum is about 128 
catrev_byday_df = pd.DataFrame(catrev_byday)
revsum = catrev_byday_df.sum(0)
keep = revsum[revsum>100]
print len(keep)
332

In [28]:
catrev_byday_df.head()
Out[28]:
Accessories Accountants Active Life Acupuncture Adult Adult Education Adult Entertainment Advertising Afghan African Airlines Airport Shuttles Airports Allergists Amateur Sports Teams American (New) American (Traditional) Amusement Parks Anesthesiologists Animal Shelters
0 120 12 1350 9 8 2 9 6 23 33 79 34 201 4 21 5767 4536 93 1 9 ...
1 129 19 1259 10 5 1 6 4 26 39 81 19 182 11 13 5087 3899 71 1 7 ...
2 113 15 1203 15 11 4 7 6 28 44 63 26 161 5 21 5030 3987 58 0 8 ...
3 103 22 1133 12 3 2 4 4 16 40 72 19 177 9 17 4737 3737 81 0 10 ...
4 108 17 1033 8 2 3 11 2 20 42 69 26 157 2 16 4472 3578 72 0 8 ...

5 rows × 591 columns

here you have all the different tags on the columns and their reviews broken down by the day of week. now we transpose it and we are ready for some clustering!

In [34]:
# hierachical clustering
toclust = catrev_byday_df.ix[:,keep.keys()].T
# i'm using the corrleation distance metric here - pros is that it will demean and divide by the size, therefore no normalization
# required by the user. also, it will allow me to see the tags that are most similar in the pattern of review by day,
# otherwise the clustering might be mostly driven by the popular tags
distanceMatrix = pdist(toclust, metric="correlation")
L = linkage(distanceMatrix, method='complete')
cat_cluster = dendrogram(linkage(distanceMatrix, method='complete'),
           orientation="left",
           labels=toclust.index,
           color_threshold=0.3, 
           leaf_font_size=6)

f = plt.gcf()
f.set_size_inches(24, 30)

There appears to be a few big clusters forming at distance less than 0.3 => since we are using correlation with complete method of cluster forming that means the largest 2 tags within the same cluster are at least 1- 0.3 = 0.7 correlated! Focus on the few large ones we see next.

In [38]:
ind = sch.fcluster(L, 0.3, 'distance')
ind_df = pd.DataFrame(ind)
ind_df.columns = ['groupID']
groupct = ind_df.groupby('groupID').size()
focusgroup = groupct[groupct>10]

toclust['groupID'] = ind
# normalization function
def scale(matrix):
    from numpy import mean, std
    return (matrix - mean(matrix, axis=0)) / std(matrix, axis=0)
# same matrix as toclust but normalized
clust_norm = toclust.ix[:,:7]
clust_norm = toclust.ix[:,:7]
clust_norm = scale(clust_norm)
clust_norm['groupID'] = ind
In [39]:
RealHousewives = toclust[toclust['groupID'] == focusgroup.index[0]]

RHmean = clust_norm[clust_norm['groupID'] == focusgroup.index[0]].groupby('groupID').agg(np.mean)

print RealHousewives
print RHmean
                            0    1    2    3    4    5    6  groupID
Body Shops                 32   39   31   35   28   23   18       34
Dermatologists             15   28   14   25   19    7   11       34
Doctors                   183  200  166  181  166  107   70       34
Education                 106   99   93  105  102   74   70       34
Orthodontists              13   28   18   26   23    6   10       34
Party Supplies             18   24   16   18   16   13    9       34
Pest Control               28   31   27   31   29   17    8       34
Pet Boarding/Pet Sitting   62   73   59   61   61   44   41       34
Pet Services              179  208  181  175  166  151  132       34
Professional Services     103  132  109  126   99   68   61       34
Tanning                    68   86   70   68   68   50   43       34

[11 rows x 8 columns]
                0         1         2         3         4         5         6
groupID                                                                      
34      -0.191756 -0.184798 -0.190867 -0.184984 -0.183466 -0.193442 -0.198332

[1 rows x 7 columns]

This first cluster I'm calling Real Housewives because the tags appear highly related to women and looking at the group mean across week days, it's pretty evenly spread, suggesting that the users in this group are likely housewives to be able to seek these services during weekdays.

In [41]:
ErrandsAndHealth = toclust[toclust['groupID'] == focusgroup.index[1]]
EHmean = clust_norm[clust_norm['groupID'] == focusgroup.index[1]].groupby('groupID').agg(np.mean)
print ErrandsAndHealth
print EHmean
                                   0    1    2    3    4    5    6  groupID
Arts & Crafts                    145  144  148  136  134  119   93       35
Chiropractors                     49   54   67   61   54   37   21       35
Cosmetic Dentists                 56   78   69   55   49   28   19       35
Dentists                         143  203  197  174  141   85   68       35
Electricians                      16   18   19   17   15   13   11       35
Financial Services               104  109  103   84   94   43   43       35
Fitness & Instruction            423  477  474  422  402  340  332       35
General Dentistry                107  140  140  126   98   51   43       35
Gyms                             260  283  272  248  235  201  188       35
Health & Medical                 621  683  680  643  565  403  306       35
Heating & Air Conditioning/HVAC   46   60   67   58   58   38   26       35
Home Services                    579  570  644  565  515  405  363       35
Juice Bars & Smoothies           186  189  183  179  174  165  141       35
Music & DVDs                      78   73   76   77   69   55   50       35
Optometrists                      55   56   65   65   59   42   36       35
Printing Services                 43   40   49   47   45   32   27       35
Trainers                         174  181  176  147  162  119  109       35

[17 rows x 8 columns]
               0         1         2        3         4         5         6
groupID                                                                    
35      -0.14268 -0.129909 -0.127531 -0.12915 -0.130083 -0.153056 -0.166269

[1 rows x 7 columns]

The 2nd cluster I'm calling Errands and Health because the tags are mostly related to fitness/health related, and also general errands. Notice the reviews are heavily skewed towards earlier in the week, which also suggest that many of these activities are performed later in the week. Exception is likely Fianacial Services which are typically not open on Sunday and half day only on Sat. The reviews for Financial Services on Sat/Sun is most likley from an earlier visit during the week.

In [42]:
ShoppingMall = toclust[toclust['groupID'] == focusgroup.index[2]]
SMmean = clust_norm[clust_norm['groupID'] == focusgroup.index[2]].groupby('groupID').agg(np.mean)

print ShoppingMall
print SMmean
                                   0     1     2     3     4     5     6  \
Active Life                     1350  1259  1203  1133  1033  1043  1183   
Asian Fusion                    1072   990  1016   921   840   884   914   
Chinese                         1790  1663  1661  1572  1484  1542  1624   
Department Stores                378   344   354   327   260   298   341   
Electronics                       89    81    92    78    59    63    78   
Fashion                          964   899   922   856   758   785   851   
Gluten-Free                      494   474   459   432   404   376   436   
Grocery                         1113  1037  1054   953   845   866  1022   
Health Markets                   183   163   172   165   136   145   156   
Massage                          223   189   204   166   151   159   170   
Musical Instruments & Teachers    68    54    60    51    38    44    50   
Professional Sports Teams         18    18    18    14    12     9    15   
Sandwiches                      3435  3408  3429  3134  2947  2895  3122   
Shopping                        3361  3147  3238  3000  2717  2729  2871   

                                groupID  
Active Life                          59  
Asian Fusion                         59  
Chinese                              59  
Department Stores                    59  
Electronics                          59  
Fashion                              59  
Gluten-Free                          59  
Grocery                              59  
Health Markets                       59  
Massage                              59  
Musical Instruments & Teachers       59  
Professional Sports Teams            59  
Sandwiches                           59  
Shopping                             59  

[14 rows x 8 columns]
                0         1         2         3        4        5         6
groupID                                                                    
59       0.246342  0.256018  0.256014  0.259176  0.23212  0.22286  0.217565

[1 rows x 7 columns]

The 3rd cluster I'm calling Shopping Mall because the tags seem associated with things you get done at a shopping mall/strip mall over the weekend. This group also show low reviews during Fri-Sun and high review during Mon-Weds.

In [43]:
GuysNightOut = toclust[toclust['groupID'] == focusgroup.index[3]]
GNmean = clust_norm[clust_norm['groupID'] == focusgroup.index[3]].groupby('groupID').agg(np.mean)

print GuysNightOut
print GNmean
                            0      1      2      3      4      5      6  \
American (New)           5767   5087   5030   4737   4472   4907   5588   
American (Traditional)   4536   3899   3987   3737   3578   3745   4351   
Bars                     6600   5743   5871   5400   5058   5118   6052   
Brazilian                  58     48     45     41     41     42     54   
Breweries                 590    496    530    473    479    478    566   
Buffets                   695    640    669    604    592    620    662   
Burgers                  2319   2135   2159   2005   1896   2006   2160   
Cocktail Bars              54     32     40     32     26     23     43   
Irish                     308    260    261    251    250    253    282   
Italian                  3833   3521   3683   3351   3306   3390   3640   
Lounges                  1158   1018    957    861    816    880   1061   
Nightlife                7287   6353   6446   5903   5542   5701   6706   
Pizza                    3734   3558   3640   3331   3332   3387   3591   
Restaurants             36620  33783  34303  31420  30763  31883  34946   
Sports Bars              1109    931   1024    932    845    846   1019   
Steakhouses              1454   1158   1180   1085   1096   1134   1373   
Venues & Event Spaces     295    262    241    215    193    204    298   
Wine Bars                1064    951    972    895    844    842    993   

                        groupID  
American (New)               62  
American (Traditional)       62  
Bars                         62  
Brazilian                    62  
Breweries                    62  
Buffets                      62  
Burgers                      62  
Cocktail Bars                62  
Irish                        62  
Italian                      62  
Lounges                      62  
Nightlife                    62  
Pizza                        62  
Restaurants                  62  
Sports Bars                  62  
Steakhouses                  62  
Venues & Event Spaces        62  
Wine Bars                    62  

[18 rows x 8 columns]
                0         1         2         3         4         5         6
groupID                                                                      
62       1.728998  1.686308  1.690904  1.698129  1.686027  1.685013  1.722027

[1 rows x 7 columns]

The 4th cluster I'm calling Guys Night Out. The tags appear strong affinity towards Male interests - bars, breweries, burgers, pizza, sports etc. These are likely happy hours and after work hangouts between a couple of guys with/without their significant others. They don't exhibit the same skew towards early week like the earlier clusters and mostly pretty evenly spread, which is why it lead to me thinking it's general after work drinking/dinner and catch a game on TV.

In [44]:
DayOutAboutTown = toclust[toclust['groupID'] == focusgroup.index[4]]
DOmean = clust_norm[clust_norm['groupID'] == focusgroup.index[4]].groupby('groupID').agg(np.mean)

print DayOutAboutTown
print DOmean
                              0     1     2     3     4     5     6  groupID
Arts & Entertainment       2068  1733  1602  1468  1437  1642  1915       72
Comedy Clubs                 64    55    45    44    38    52    60       72
Dance Clubs                 185   171   160   149   123   156   177       72
Event Planning & Services  1705  1573  1408  1255  1168  1199  1518       72
Golf                        163   145   123   140   116   122   156       72
Hawaiian                    306   283   251   238   239   245   265       72
Hotels                     1260  1161  1009   895   844   856  1147       72
Hotels & Travel            1810  1683  1465  1372  1268  1277  1605       72
Museums                     217   195   152   132   134   160   182       72
Music Venues                683   591   568   547   482   569   589       72
Parks                       325   294   244   251   209   251   302       72
Stadiums & Arenas           220   186   167   135   116   126   179       72
Tours                        91    82    76    72    59    82    89       72
Women's Clothing            298   289   253   276   242   260   271       72

[14 rows x 8 columns]
                0        1         2         3         4         5         6
groupID                                                                     
72       0.079577  0.06993  0.035456  0.038288  0.029711  0.041712  0.068698

[1 rows x 7 columns]

The 5th cluster I'm calling Day Out about Town. These tags actually look a lot like things a couple would do together. Going to a comedy club on Tues. night, dancing on Friday, going to a game on the weekend. There are also several tourist related keywords - most likely because tourists aren't restricted to weekends and are doing a lot of sight seeing during the middle of the week. Looking at the group mean, it looks like these activities took place during Weds.-Fri. and reviews are happening Sun.-Tues.

conclusion:

From studying yelp's reviews and checkins by day of the week, it certainly suggests there is a 2-3 day lag between when people are actually enjoying the experience to the time of the actual review being posted about the experience. This seems largely related to the work week / weekend behavior of different types of individuals. It would seem to suggest sending out reminders to review around 48 hours after a checkin, which might be the natural cycle for consumers getting ready to review and increase review conversion.

Some other stuff we could look at are the lat/long of the top 5 clusters and see if there is a pattern. I would imagine the shopping mall cluster will show several dense clusters, day about town might be somewhat close to down town locations, and real house wives clusters' lat/long be in close proximity to the users own neighborhoods or within 25 miles of where they live.

In []:
 

No comments:

Post a Comment