Yelp Dataset Challenge¶
loading in datasets
import sys
import re
import os
import shutil
import commands
import pandas as pd
import numpy as np
import json
import matplotlib
import matplotlib.pyplot as plt
from datetime import datetime
import statsmodels.api as sm
from pandas.tools.plotting import autocorrelation_plot
import itertools
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
import scipy.cluster.hierarchy as sch
%matplotlib inline
filename = "C:\Users\jchao\Desktop\yelp_phoenix_academic_dataset\yelp_academic_dataset_review.json"
f = open(filename,'rU')
review = [json.loads(line) for line in f]
f.close()
review_df = pd.DataFrame(review)
review_colnames = review_df.columns.values.tolist()
print review_colnames
print review_df.head(1)
filename1 = "C:\Users\jchao\Desktop\yelp_phoenix_academic_dataset\yelp_academic_dataset_business.json"
f1 = open(filename1,'rU')
biz = [json.loads(line) for line in f1]
f1.close()
biz_df = pd.DataFrame(biz)
biz_colnames = biz_df.columns.values.tolist()
print biz_colnames
filename3 = "C:\Users\jchao\Desktop\yelp_phoenix_academic_dataset\yelp_academic_dataset_checkin.json"
f3 = open(filename3,'rU')
checkin = [json.loads(line) for line in f3]
f3.close()
checkin_df = pd.DataFrame(checkin)
checkin_colnames = checkin_df.columns.values.tolist()
print checkin_colnames
first convert dates into day of week
review_df['date'] = pd.to_datetime(review_df['date'])
print review_df.dtypes
review_df['wkday'] = [datetime.weekday(d) for d in review_df['date']]
review_df.head(1)
bywkday = review_df.groupby(by='wkday')
bywkday['stars'].mean()
looking at the avg stars people give by day. the difference between day of week appears to be small
wkdayct = bywkday.size()
print wkdayct
0 is Monday and 6 is Sunday. There appears to be a pretty clear pattern that people are reviewing a lot more between Sunday - Tuesday vs. rest of week. I posit this is because there is a lag in the time between people's visit to their review ie. checking out restaurants during the weekend and then reviewing during work on Monday (oops!)
review_by_day = pd.DataFrame(wkdayct)
review_by_day['wkday'] = range(7)
review_by_day.columns = ['reviews','wkday']
review_by_day['wkday_name'] = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
f=plt.figure(figsize=(12,8))
review_by_day.plot(x='wkday_name', y='reviews',kind='bar')
plt.xlabel('Day of Week')
plt.ylabel('Number of Reviews')
plt.title('Number of Reviews by Day of Week for Phoenix, AZ')
# using the checkins dataset for all businesses, i want to count the checkins by day
# the systax used by yelp is (0-24)-(0-6) where the first number is the hour and the second number
# is the day of week, so in this case we can just count the 2nd and add
checkin_dict = {}
for info in checkin_df['checkin_info']:
for k, v in info.items():
match = re.search(r'-(\d)', k)
day = match.group(1)
if day not in checkin_dict:
checkin_dict[day] = v
else:
checkin_dict[day] += v
sorted_checkin_by_day = [(k, checkin_dict[k]) for k in sorted(checkin_dict.keys())]
checkin_by_day = pd.DataFrame(sorted_checkin_by_day)
checkin_by_day.columns = ['wkday','checkins']
checkin_by_day['wkday_name'] = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
f=plt.figure(figsize=(12,8))
checkin_by_day.plot(x='wkday_name', y='checkins',kind='bar')
plt.xlabel('Day of Week')
plt.ylabel('Number of Check Ins')
plt.title('Number of Check-Ins by Day of Week for Phoenix, AZ')
and sure enough, we see that there are a lot more checkins during Fri-Sun. So far so good.
i want to use autocorrelation to show time lag between reviews and checkins, unfortunately there isn't a good function i liked so wrote my own:
def ccf(x,y, lag_max=7):
lag = abs(lag_max)
xbar = np.mean(x)
ybar = np.mean(y)
xdemean = x - xbar
ydemean = y - ybar
covx = np.dot(xdemean.T,xdemean) / len(x)
covy = np.dot(ydemean.T,ydemean) / len(x)
covxy = np.sqrt(covx*covy)
np.roll(xdemean,1)
acfx = np.array([np.dot(np.roll(xdemean,-i)[:len(x)-i].T, xdemean[:len(x)-i]) / len(x) /covx \
for i in range(0,lag+1)])
acfy = np.array([np.dot(np.roll(ydemean,-i)[:len(y)-i].T, ydemean[:len(y)-i]) / len(y) /covy \
for i in range(0,lag+1)])
acfxy = np.array([np.dot(np.roll(xdemean,-i)[:len(x)-i].T, ydemean[:len(y)-i]) / len(x) /covxy \
for i in range(0,lag+1)])
acfyx = np.array([np.dot(np.roll(ydemean,-i)[:len(y)-i].T, xdemean[:len(x)-i]) / len(x) /covxy \
for i in range(0,lag+1)])
return np.array([acfx, acfxy, acfyx, acfy])
checkin_array = np.array(checkin_by_day['checkins'])
review_array = np.array(review_by_day['reviews'])
xyccf = ccf(review_array, checkin_array, 6)
xyccf_df = pd.DataFrame(xyccf.T)
xyccf_df.columns = ['acf x', 'acf x to y', 'acf y to x', 'acf y']
xyccf_df['Lag'] = range(0,7)
# plotting the acf for the 4 series
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(figsize=(16,12), nrows=2, ncols=2, sharey=True)
ax1.set_xlim(-0.5,len(xyccf_df['Lag']))
ax1.set_ylim(-1,1)
ax1.vlines(xyccf_df['Lag'], ymin=[0], ymax=xyccf_df['acf x'], colors='k', linestyles='solid')
ax1.axhline(0, color='black', lw=2)
ax1.set_xlabel('Lag')
ax1.set_title('acf review')
ax1.grid(True)
ax2.set_xlim(-0.5,len(xyccf_df['Lag']))
ax2.vlines(xyccf_df['Lag'], ymin=[0], ymax=xyccf_df['acf x to y'], colors='k', linestyles='solid')
ax2.axhline(0, color='black', lw=2)
ax2.set_xlabel('Lag')
ax2.set_title('acf review to checkin')
ax2.grid(True)
ax3.set_xlim(-0.5,len(xyccf_df['Lag']))
ax3.vlines(xyccf_df['Lag'], ymin=[0], ymax=xyccf_df['acf y to x'], colors='k', linestyles='solid')
ax3.axhline(0, color='black', lw=2)
ax3.set_xlabel('Lag')
ax3.set_title('acf checkin to review')
ax3.grid(True)
ax4.set_xlim(-0.5,len(xyccf_df['Lag']))
ax4.vlines(xyccf_df['Lag'], ymin=[0], ymax=xyccf_df['acf y'], colors='k', linestyles='solid')
ax4.axhline(0, color='black', lw=2)
ax4.set_xlabel('Lag')
ax4.set_title('acf checkin')
ax4.grid(True)
Looking at the review to checkin(top right) you can see a positive correlation with a lag =3 ie checking at time = t, compared to review at time t+3, which supports our theory that it will take a couple of days before people get to reviewing, so a Friday visit followed by a Monday review is highly probably here. Another interesting thing is the lag between checkin to review is lag 4 or lag 5, which makes intuitive sense and confirms the behavior Monday Review, then Weekend back to checkin.
# next i want to look at reviews by category and see if any correlation there
# since most businesses can belong to multiple categories, first find all the unique tags
cats = []
for i in range(len(biz)):
for category in biz_df['categories'][i]:
if category not in cats:
cats.append(category)
cats.sort()
# then i want to count the average reviews per business per week day
biz_review = biz_df.merge(review_df, on='business_id', how='inner', suffixes=('biz','review'))
biz_review['one'] = 1
br_colnames = biz_review.columns.values.tolist()
biz_review['review_count_by_wkday'] = biz_review.groupby(['wkday', 'business_id'])['one'].transform(np.sum)
bywkday = biz_review.groupby(by='wkday')
avg_review_per_biz_by_day = bywkday['review_count_by_wkday'].agg(np.mean)
print avg_review_per_biz_by_day
averages also show the same pattern as the aggregate
# counting the reviews for each tag
catrev_byday = {}
for cat in cats:
catrev_byday[cat] = [0,]*7
for i in range(len(biz_review)):
categories = biz_review['categories'][i]
day = biz_review['wkday'][i]
for category in categories:
catrev_byday[category][day] += 1
# keeping tags that are more popular and dropping tags with less than 100 reviews. median of revsum is about 128
catrev_byday_df = pd.DataFrame(catrev_byday)
revsum = catrev_byday_df.sum(0)
keep = revsum[revsum>100]
print len(keep)
catrev_byday_df.head()
here you have all the different tags on the columns and their reviews broken down by the day of week. now we transpose it and we are ready for some clustering!
# hierachical clustering
toclust = catrev_byday_df.ix[:,keep.keys()].T
# i'm using the corrleation distance metric here - pros is that it will demean and divide by the size, therefore no normalization
# required by the user. also, it will allow me to see the tags that are most similar in the pattern of review by day,
# otherwise the clustering might be mostly driven by the popular tags
distanceMatrix = pdist(toclust, metric="correlation")
L = linkage(distanceMatrix, method='complete')
cat_cluster = dendrogram(linkage(distanceMatrix, method='complete'),
orientation="left",
labels=toclust.index,
color_threshold=0.3,
leaf_font_size=6)
f = plt.gcf()
f.set_size_inches(24, 30)
There appears to be a few big clusters forming at distance less than 0.3 => since we are using correlation with complete method of cluster forming that means the largest 2 tags within the same cluster are at least 1- 0.3 = 0.7 correlated! Focus on the few large ones we see next.
ind = sch.fcluster(L, 0.3, 'distance')
ind_df = pd.DataFrame(ind)
ind_df.columns = ['groupID']
groupct = ind_df.groupby('groupID').size()
focusgroup = groupct[groupct>10]
toclust['groupID'] = ind
# normalization function
def scale(matrix):
from numpy import mean, std
return (matrix - mean(matrix, axis=0)) / std(matrix, axis=0)
# same matrix as toclust but normalized
clust_norm = toclust.ix[:,:7]
clust_norm = toclust.ix[:,:7]
clust_norm = scale(clust_norm)
clust_norm['groupID'] = ind
RealHousewives = toclust[toclust['groupID'] == focusgroup.index[0]]
RHmean = clust_norm[clust_norm['groupID'] == focusgroup.index[0]].groupby('groupID').agg(np.mean)
print RealHousewives
print RHmean
This first cluster I'm calling Real Housewives because the tags appear highly related to women and looking at the group mean across week days, it's pretty evenly spread, suggesting that the users in this group are likely housewives to be able to seek these services during weekdays.
ErrandsAndHealth = toclust[toclust['groupID'] == focusgroup.index[1]]
EHmean = clust_norm[clust_norm['groupID'] == focusgroup.index[1]].groupby('groupID').agg(np.mean)
print ErrandsAndHealth
print EHmean
The 2nd cluster I'm calling Errands and Health because the tags are mostly related to fitness/health related, and also general errands. Notice the reviews are heavily skewed towards earlier in the week, which also suggest that many of these activities are performed later in the week. Exception is likely Fianacial Services which are typically not open on Sunday and half day only on Sat. The reviews for Financial Services on Sat/Sun is most likley from an earlier visit during the week.
ShoppingMall = toclust[toclust['groupID'] == focusgroup.index[2]]
SMmean = clust_norm[clust_norm['groupID'] == focusgroup.index[2]].groupby('groupID').agg(np.mean)
print ShoppingMall
print SMmean
The 3rd cluster I'm calling Shopping Mall because the tags seem associated with things you get done at a shopping mall/strip mall over the weekend. This group also show low reviews during Fri-Sun and high review during Mon-Weds.
GuysNightOut = toclust[toclust['groupID'] == focusgroup.index[3]]
GNmean = clust_norm[clust_norm['groupID'] == focusgroup.index[3]].groupby('groupID').agg(np.mean)
print GuysNightOut
print GNmean
The 4th cluster I'm calling Guys Night Out. The tags appear strong affinity towards Male interests - bars, breweries, burgers, pizza, sports etc. These are likely happy hours and after work hangouts between a couple of guys with/without their significant others. They don't exhibit the same skew towards early week like the earlier clusters and mostly pretty evenly spread, which is why it lead to me thinking it's general after work drinking/dinner and catch a game on TV.
DayOutAboutTown = toclust[toclust['groupID'] == focusgroup.index[4]]
DOmean = clust_norm[clust_norm['groupID'] == focusgroup.index[4]].groupby('groupID').agg(np.mean)
print DayOutAboutTown
print DOmean
The 5th cluster I'm calling Day Out about Town. These tags actually look a lot like things a couple would do together. Going to a comedy club on Tues. night, dancing on Friday, going to a game on the weekend. There are also several tourist related keywords - most likely because tourists aren't restricted to weekends and are doing a lot of sight seeing during the middle of the week. Looking at the group mean, it looks like these activities took place during Weds.-Fri. and reviews are happening Sun.-Tues.
conclusion:¶
From studying yelp's reviews and checkins by day of the week, it certainly suggests there is a 2-3 day lag between when people are actually enjoying the experience to the time of the actual review being posted about the experience. This seems largely related to the work week / weekend behavior of different types of individuals. It would seem to suggest sending out reminders to review around 48 hours after a checkin, which might be the natural cycle for consumers getting ready to review and increase review conversion.
Some other stuff we could look at are the lat/long of the top 5 clusters and see if there is a pattern. I would imagine the shopping mall cluster will show several dense clusters, day about town might be somewhat close to down town locations, and real house wives clusters' lat/long be in close proximity to the users own neighborhoods or within 25 miles of where they live.
No comments:
Post a Comment