Dating Pools using K-Means Clustering

Previously, we explored a large dataset containing 60,000 anonymized OKCupid users from the year 2012.

You also might want to check out Gender Classification with OKCupid Data, which also uses our OKCupid data.

In this article, we apply K-Means clustering, a machine learning algorithm, to group our OKCupid users into dating pools, as a means to narrow down their potential matches.

Clustering, is a technique which groups similar data points together. Let’s use this to group similar people together and recommend who you should date. People in the same cluster as you are the people who we will recommend.

Work Skills showcased in this article:

Application of K-Means Clustering using scikit-learn

Feature Selection

We start by reviewing the features of our users. Which, among the features, would you want your date to have in common with you?

expanded_df_backup.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59811 entries, 0 to 59945
Data columns (total 48 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 age                 59811 non-null  int64  
 body_type           54538 non-null  object 
 diet                35481 non-null  object 
 drinks              56846 non-null  object 
 drugs               45755 non-null  object 
 education           53211 non-null  object 
 essay0              54351 non-null  object 
 essay1              52275 non-null  object 
 essay2              50210 non-null  object 
 essay3              48375 non-null  object 
essay4              49318 non-null  object 
essay5              49010 non-null  object 
essay6              46085 non-null  object 
essay7              47404 non-null  object 
essay8              40636 non-null  object 
essay9              47245 non-null  object 
ethnicity           54159 non-null  object 
height              59811 non-null  float64
income              11456 non-null  float64
job                 51641 non-null  object 
last_online         59811 non-null  object 
location            59811 non-null  object 
offspring           24334 non-null  object 
orientation         59811 non-null  object 
pets                39931 non-null  object 
religion            39631 non-null  object 
sex                 59811 non-null  object 
sign                48787 non-null  object 
smokes              54320 non-null  object 
speaks              59761 non-null  object 
status              59811 non-null  object 
diet_adherence      35481 non-null  object 
diet_type           35481 non-null  object 
city                59811 non-null  object 
state/country       59811 non-null  object 
offspring_want      24334 non-null  object 
offspring_attitude  9711 non-null   object 
religion_type       39631 non-null  object 
religion_attitude   39631 non-null  object 
sign_type           48787 non-null  object 
sign_attitude       48787 non-null  object 
dog_preference      28880 non-null  object 
cat_preference      21293 non-null  object 
has_dogs            39931 non-null  float64
has_cats            39931 non-null  float64
num_ethnicities     54159 non-null  float64
optional_%unfilled  59811 non-null  float64
num_languages       59811 non-null  int64  
dtypes: float64(6), int64(2), object(40)
memory usage: 22.4+ MB

After asking someone which features they would want to have in common with them, the chosen features are:

Drugs
Diet
Pets
Orientation
Religion (attitude only)

Let’s isolate that subset of features.

clustering_df = sparse_essay_df.copy()
cat_selection = ['drugs', 'orientation',
                'diet_adherence', 'diet_type','religion_attitude', 'dog_preference', 'cat_preference', 'has_dogs',
                'has_cats']
numeric_selection = []
feature_selection = feature_selection_to_list(clustering_df, cat_selection, numeric_selection)
clustering_df = clustering_df[feature_selection]

#Conversion to Scipy csr_matrix
clustering_coo = clustering_df.sparse.to_coo()
clustering_csr = clustering_coo.tocsr()
clustering_csr.get_shape()

(59811, 37)

Choosing a value of k

Now that we have our subset let’s search for the best number of k clusters to use in our model.

#Recommended setting for training Kmeans on Windows
import os
os.environ["OMP_NUM_THREADS"] = "1"

from sklearn.cluster import KMeans

execute = False
if execute:
    num_clusters = list(range(1,200))
    inertias= []
    for i in num_clusters:
      model = KMeans(n_clusters = i)
      model.fit(clustering_csr)
      inertias.append(model.inertia_)

    plt.plot(num_clusters, inertias, '-o')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia')
    plt.show()

The Inertia vs k graph below took 5 hours to produce (hence the default setting above to skip execution of the code snippet). Inertia, is a metric which represents how spaced out the points of a cluster are relative to its centroid. A line has been drawn over the tail end of the graph so that we can clearly mark where the graph becomes linear. The specific point where the graph becomes linear is known as the elbow point, and is the number of clusters that we should use.

K_Clusters_evaluation_annotated

k = 100 is where the linearity begins. Let’s create the model we will use for clustering with k = 100.

model = KMeans(n_clusters = 100)
model.fit(clustering_csr)
labels = model.predict(clustering_csr)

Let’s investigate the uniformity of our cluster distribution.

ser_labels  = pd.Series(labels)
ser_labels_props = ser_labels.value_counts()
plt.figure(figsize = (16,9))
plt.pie(x=ser_labels_props.values, labels = ser_labels_props.index,
                wedgeprops=dict(width=0.10,
                                edgecolor="k",
                                linewidth=0.7))
plt.text(0, 0, 'Cluster Distribution', 
                 horizontalalignment = 'center',
                 verticalalignment = 'center',
                 fontsize = 20)
plt.show()

png

The clusters are somewhat distributed across users.

Date Match Recommendations

The code below produces an interface which allows you to enter information and, upon pressing Run Interact, output the number of the cluster you belong to, the number of people in your cluster, and a random user profile from the same cluster. People who are in the same cluster are similar to each other.

Note that as an experiment, null values have been included as their own category. The model also groups people who leave similar fields blank. The values of ‘nan’ or ‘None’ or ‘No Prefix’ means that you are not sharing information for that particular field.

The actual interface is not available here because it requires an active python kernel to run. However, you can still see the interface in action via a GIF Image Preview. If you would like to use the interface yourself, open and run ‘Report_stable.ipynb’ from this project’s Github Repository.

from ipywidgets import interact_manual
import random

def make_profile(sex, drugs, orientation, diet_adherence, diet_type, religion_attitude,
                 dog_preference, cat_preference, has_dogs, has_cats, show_same_gender):
    
    user_info = [drugs, orientation, diet_adherence, diet_type, religion_attitude,
                 dog_preference, cat_preference, has_dogs, has_cats]
    user_dict = {}
    for each in clustering_df.columns.to_list():
         user_dict[each] = [0]
    for each in user_info:
        user_dict[each] = [1]
    user_df = pd.DataFrame(data=user_dict, index = ['You'])
    sparse_user_df = user_df.copy()
    for each in sparse_user_df.columns.to_list():
        sparse_user_df[each] = pd.arrays.SparseArray(sparse_user_df[each].values, dtype='uint8')
        
    sparse_user_df = sparse_user_df.sparse.to_coo()
    sparse_user_df = sparse_user_df.tocsr()
    
    user_cluster_label = model.predict(sparse_user_df)
    user_cluster = np.where(labels == user_cluster_label)[0]
    print("Your cluster is {}.\nThere are {} users in your cluster.".format(user_cluster_label, len(user_cluster)))
    
    if show_same_gender == False:
        user_sex_dict = {'Male':'m', "Female":'f'}
        user_sex = user_sex_dict[sex]
        #Look for an opposite gender
        while user_sex == user_sex_dict[sex]:
            rand_index = random.randint(0, len(user_cluster))
            rand_user = user_cluster[rand_index]
            user_sex = expanded_df_backup.iloc[rand_user]['sex']
    else:
        rand_index = random.randint(0, len(user_cluster))
        rand_user = user_cluster[rand_index]       

    print("Is this person dateable? Press Run again to see a new recommendation.\n")
    show_user_data(expanded_df_backup.iloc[:,0:31], rand_user)

interact_manual(
    make_profile,
    sex = ['Male', 'Female'],
    drugs = feature_selection_to_list(clustering_df, ['drugs'], []),
    orientation = feature_selection_to_list(clustering_df, ['orientation'], []),
    diet_adherence = feature_selection_to_list(clustering_df, ['diet_adherence'], []),
    diet_type = feature_selection_to_list(clustering_df, ['diet_type'], []),
    religion_attitude = feature_selection_to_list(clustering_df, ['religion_attitude'], []),
    dog_preference = feature_selection_to_list(clustering_df, ['dog_preference'], []),
    cat_preference = feature_selection_to_list(clustering_df, ['cat_preference'], []),
    has_dogs = feature_selection_to_list(clustering_df, ['has_dogs'], []),
    has_cats = feature_selection_to_list(clustering_df, ['has_cats'], []),
    show_same_gender = True
)

GIF Image Preview

Dating Pools using K-Means Clustering Recap

In this article, we accomplished the following:

Selected features for our model, on the basis of wanting a potential date to be similar for those particular features
Used the ‘elbow method’ for selecting a value of k for our k-means clustering model
Created an interactive interface for presenting profiles of suggested users to date based on input user data

← Previous Post Next Post →