Previously, we explored a large dataset containing 60,000 anonymized OKCupid users from the year 2012.
You also might want to check out Gender Classification with OKCupid Data, which also uses our OKCupid data.
In this article, we apply K-Means clustering, a machine learning algorithm, to group our OKCupid users into dating pools, as a means to narrow down their potential matches.
Clustering, is a technique which groups similar data points together. Let’s use this to group similar people together and recommend who you should date. People in the same cluster as you are the people who we will recommend.
We start by reviewing the features of our users. Which, among the features, would you want your date to have in common with you?
expanded_df_backup.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 59811 entries, 0 to 59945
Data columns (total 48 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 59811 non-null int64
1 body_type 54538 non-null object
2 diet 35481 non-null object
3 drinks 56846 non-null object
4 drugs 45755 non-null object
5 education 53211 non-null object
6 essay0 54351 non-null object
7 essay1 52275 non-null object
8 essay2 50210 non-null object
9 essay3 48375 non-null object
10 essay4 49318 non-null object
11 essay5 49010 non-null object
12 essay6 46085 non-null object
13 essay7 47404 non-null object
14 essay8 40636 non-null object
15 essay9 47245 non-null object
16 ethnicity 54159 non-null object
17 height 59811 non-null float64
18 income 11456 non-null float64
19 job 51641 non-null object
20 last_online 59811 non-null object
21 location 59811 non-null object
22 offspring 24334 non-null object
23 orientation 59811 non-null object
24 pets 39931 non-null object
25 religion 39631 non-null object
26 sex 59811 non-null object
27 sign 48787 non-null object
28 smokes 54320 non-null object
29 speaks 59761 non-null object
30 status 59811 non-null object
31 diet_adherence 35481 non-null object
32 diet_type 35481 non-null object
33 city 59811 non-null object
34 state/country 59811 non-null object
35 offspring_want 24334 non-null object
36 offspring_attitude 9711 non-null object
37 religion_type 39631 non-null object
38 religion_attitude 39631 non-null object
39 sign_type 48787 non-null object
40 sign_attitude 48787 non-null object
41 dog_preference 28880 non-null object
42 cat_preference 21293 non-null object
43 has_dogs 39931 non-null float64
44 has_cats 39931 non-null float64
45 num_ethnicities 54159 non-null float64
46 optional_%unfilled 59811 non-null float64
47 num_languages 59811 non-null int64
dtypes: float64(6), int64(2), object(40)
memory usage: 22.4+ MB
After asking someone which features they would want to have in common with them, the chosen features are:
Let’s isolate that subset of features.
clustering_df = sparse_essay_df.copy()
cat_selection = ['drugs', 'orientation',
'diet_adherence', 'diet_type','religion_attitude', 'dog_preference', 'cat_preference', 'has_dogs',
'has_cats']
numeric_selection = []
feature_selection = feature_selection_to_list(clustering_df, cat_selection, numeric_selection)
clustering_df = clustering_df[feature_selection]
#Conversion to Scipy csr_matrix
clustering_coo = clustering_df.sparse.to_coo()
clustering_csr = clustering_coo.tocsr()
clustering_csr.get_shape()
(59811, 37)
Now that we have our subset let’s search for the best number of k clusters to use in our model.
#Recommended setting for training Kmeans on Windows
import os
os.environ["OMP_NUM_THREADS"] = "1"
from sklearn.cluster import KMeans
execute = False
if execute:
num_clusters = list(range(1,200))
inertias= []
for i in num_clusters:
model = KMeans(n_clusters = i)
model.fit(clustering_csr)
inertias.append(model.inertia_)
plt.plot(num_clusters, inertias, '-o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()
The Inertia vs k graph below took 5 hours to produce (hence the default setting above to skip execution of the code snippet). Inertia, is a metric which represents how spaced out the points of a cluster are relative to its centroid. A line has been drawn over the tail end of the graph so that we can clearly mark where the graph becomes linear. The specific point where the graph becomes linear is known as the elbow point, and is the number of clusters that we should use.
k = 100 is where the linearity begins. Let’s create the model we will use for clustering with k = 100.
model = KMeans(n_clusters = 100)
model.fit(clustering_csr)
labels = model.predict(clustering_csr)
Let’s investigate the uniformity of our cluster distribution.
ser_labels = pd.Series(labels)
ser_labels_props = ser_labels.value_counts()
plt.figure(figsize = (16,9))
plt.pie(x=ser_labels_props.values, labels = ser_labels_props.index,
wedgeprops=dict(width=0.10,
edgecolor="k",
linewidth=0.7))
plt.text(0, 0, 'Cluster Distribution',
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 20)
plt.show()
The clusters are somewhat distributed across users.
The code below produces an interface which allows you to enter information and, upon pressing Run Interact, output the number of the cluster you belong to, the number of people in your cluster, and a random user profile from the same cluster. People who are in the same cluster are similar to each other.
Note that as an experiment, null values have been included as their own category. The model also groups people who leave similar fields blank. The values of ‘nan’ or ‘None’ or ‘No Prefix’ means that you are not sharing information for that particular field.
The actual interface is not available here because it requires an active python kernel to run. However, you can still see the interface in action via a GIF Image Preview. If you would like to use the interface yourself, open and run ‘Report_stable.ipynb’ from this project’s Github Repository.
from ipywidgets import interact_manual
import random
def make_profile(sex, drugs, orientation, diet_adherence, diet_type, religion_attitude,
dog_preference, cat_preference, has_dogs, has_cats, show_same_gender):
user_info = [drugs, orientation, diet_adherence, diet_type, religion_attitude,
dog_preference, cat_preference, has_dogs, has_cats]
user_dict = {}
for each in clustering_df.columns.to_list():
user_dict[each] = [0]
for each in user_info:
user_dict[each] = [1]
user_df = pd.DataFrame(data=user_dict, index = ['You'])
sparse_user_df = user_df.copy()
for each in sparse_user_df.columns.to_list():
sparse_user_df[each] = pd.arrays.SparseArray(sparse_user_df[each].values, dtype='uint8')
sparse_user_df = sparse_user_df.sparse.to_coo()
sparse_user_df = sparse_user_df.tocsr()
user_cluster_label = model.predict(sparse_user_df)
user_cluster = np.where(labels == user_cluster_label)[0]
print("Your cluster is {}.\nThere are {} users in your cluster.".format(user_cluster_label, len(user_cluster)))
if show_same_gender == False:
user_sex_dict = {'Male':'m', "Female":'f'}
user_sex = user_sex_dict[sex]
#Look for an opposite gender
while user_sex == user_sex_dict[sex]:
rand_index = random.randint(0, len(user_cluster))
rand_user = user_cluster[rand_index]
user_sex = expanded_df_backup.iloc[rand_user]['sex']
else:
rand_index = random.randint(0, len(user_cluster))
rand_user = user_cluster[rand_index]
print("Is this person dateable? Press Run again to see a new recommendation.\n")
show_user_data(expanded_df_backup.iloc[:,0:31], rand_user)
interact_manual(
make_profile,
sex = ['Male', 'Female'],
drugs = feature_selection_to_list(clustering_df, ['drugs'], []),
orientation = feature_selection_to_list(clustering_df, ['orientation'], []),
diet_adherence = feature_selection_to_list(clustering_df, ['diet_adherence'], []),
diet_type = feature_selection_to_list(clustering_df, ['diet_type'], []),
religion_attitude = feature_selection_to_list(clustering_df, ['religion_attitude'], []),
dog_preference = feature_selection_to_list(clustering_df, ['dog_preference'], []),
cat_preference = feature_selection_to_list(clustering_df, ['cat_preference'], []),
has_dogs = feature_selection_to_list(clustering_df, ['has_dogs'], []),
has_cats = feature_selection_to_list(clustering_df, ['has_cats'], []),
show_same_gender = True
)
In this article, we accomplished the following: