Cluster Analysis in Salesforce

Intro

Cluster analysis is a task of grouping similar objects into a number of groups called clusters. It is one of the unsupervised machine learning methods. It is called "unsupervised" because labelled data or training dataset is not known upfront. Cluster analysis is widely used in market research, medicine, biology, finance, computer science and many other fields. There are many clustering algorithms that can be grouped into several classes, such as hierarchical clustering, centroid-based clustering, density-based clustering and others.

A comparison of the clustering algorithms from scikit-learn:

In this post I will describe how to cluster data with numerical and mixed types, how to visualize the clustering results of a multidimensional dataset on a 2D plot like the above and how clustering algorithms can be implemented on the Salesforce platform.

K-Means

K-Means is the most popular clustering algorithm because of its simplicity and speed. K-Means belongs to a centroid-based clustering algorithm class. First "K" refers to the number of clusters you want your data to group into. This is the only algorithm's input parameter. The algorithm has the following steps:

Centroid initialization step: Randomly generate or choose an initial k centers C = { c₁, c₂, ... , c_k } .
Data assignment step: Assign each data point to the closest cluster in C by calculating distances between the data point x and each centroid c_i . We can use Euclidean or Manhattan distance as a similarity measure
Centroid update step: Calculate new centroids by computing the arithmetic mean of all data points assigned to each cluster
Repeat steps 2 and 3 until C no longer changes.

Random choice of initial centroids can result in poor overall clustering outcome. K-Means++ centroid initialization algorithm can be used on step 1 which addresses this issue by choosing better distributed initial centroids. Although the initial selection in this case takes extra time, the k-means part itself converges very quickly after this seeding and thus the algorithm actually lowers the computation time. We will use this algorithm for both K-Means and K-Medoids initialization in the Apex implementation below.

Clustering of Mixed Data Types

K-Means algorithm assumes that our dataset will only contain continuous (numeric) data. But sometimes the dataset might have data of different types: numeric, categorical or text. In this case we cannot use Euclidean or Manhattan distance functions to calculate pairwise similarity because they require numeric arguments. We also cannot use K-Means algorithm because the centroid update step requires numeric data to compute the arithmetic mean. There are several approaches to clustering data with mixed types, one of the methods is using weighted Gower distance as a similarity measure and K-Medoids as a clustering algorithm.

Gower distance

Gower distance (1971) supports variable data types. It is defined as follows:

where:

w_ijk - the weight for variable k between observations i and j.

S_ijk - the distance between i and j on variable k

For each variable type a particular distance function is used. For a numerical variable k distance S_ijk is calculated as:

where:

r_k = max(x_k) - min(x_k)

x_jk = the value of variable k for object i

For a categorical variable k distance S_ijk is calculated as:

For a text variable k any string similarity function can be used. This article compares popular string similarity algorithms. I will use Levenshtein distance (1965) just because Apex String class already has getLevenshteinDistance method which implements this algorithm.

S_ijk in this case is calculated as:

where lev(x_ik, x_jk) is a Levenshtein distance between strings x_ik and x_jk

len(x_k) is the length of the string x_k

Levenshtein distance returns the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. We divide it by the maximum string length to normalize the result into [0...1] range as in numerical and category cases.

It is worth noting that the Levenshtein distance is a character edit distance, it doesn't take words' meaning into account. Therefore the words "ball" and "bill" might be in one cluster but "cat" and "kitten" will be in different clusters.

For a long text variable k tf-idf method in combination with the Cosine similarity will be used to calculate the distance S_ijk. It is based on keyword weights of each document (long text value). I will write another article to describe this method and its implementation in Apex.

K-Medoids

The k-medoids or partitioning around medoids (PAM) algorithm is a variation of the K-Means algorithm. Unlike K-Means in K-Medoids cluster centers are chosen from existing data points. These center points are called medoids. The algorithm has also one input parameter "K" - the number of clusters we want to group into. K-Medoids has the following steps:

Medoid initialization step: this step is exactly the same as in K-Means. We will be using K-Means++ initialization
Data assignment step: Assign all data points to closest medoids and calculate cost function. This step is very similar to K-Means. Instead of Euclidean distance we will use Gower distance described above as a pairwise similarity measure. Cost function is calculated as follows:

where d(i,j)=S_ij is a Gower distance between medoid i and data point j

Medoid swap step
- For each medoid m_i
  - For each non-medoid data point x_j in cluster i
    - Swap medoid m_i and data point x_j
    - Assign all data points to new medoids
    - Calculate new cost using the above cost function
    - If cost<minCost
      - Remember medoids configuration
      - minCost = cost
    - Restore m_i
  - Perform best swap which corresponds to the minimal cost
Repeat step 2 and 3 until there was no change in medoids

CLARA

The runtime complexity of the original K-Medoids PAM algorithm is O(k(n-k)^2^). It would take an enormous amount of time to cluster thousands of records. Memory and storage is another factor. In case if the entire dataset cannot be loaded into memory (like in Salesforce Apex with 12M heap limit) the code has to access data storage very frequently for read and write.

There is a variation of the PAM algorithm which is called CLARA (Clustering LARge Applications, 1986). CLARA applies PAM algorithm on a subsample with n' < n objects, with the suggested value of n' = 40 + 2k. Afterwards, the remaining objects are assigned to their closest medoid. The subsample of n' objects is selected randomly from the original datasource.

Number of clusters

Both K-Means and K-Medoids require us to specify the number of clusters as an input parameter. How can we identify the optimal number of clusters for the dataset? There are several methods exist which evaluate the clustering results. I will use Silhouette coefficient which takes into account the average distance to data points in the same cluster and the average distance to elements in the neighbouring clusters. The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. To identify the optimal number of clusters we need to run the algorithm with different cluster sizes and choose the one with the highest silhouette value.

Refinement

Once we have the initial cluster distribution using CLARA algorithm we can validate how well each data point was assigned to its cluster compared to a neighboring cluster. To measure this we will use the similar cost functions as in Silhouette calculations.

Refinement algorithm has the following steps:

For each data point i:
- Calculate the mean distance to all data points in the current cluster C_i (excluding the current data point i):

Find the nearest neighbor cluster C_k and calculate the mean distance to all data points in that cluster:

If b(i) < a(i) reassign the current data point i to cluster C_k

Implementation in Apex and Lightning

The above clustering algorithms are implemented in many programming languages, such as Python, R, Java, C#. So why Apex? Here are the reasons which make clustering with Apex as a possible and an attractive solution:

Salesforce customers have lots of data stored in the cloud. It would be very convenient and secure if clustering is performed right on the spot without exporting large amounts of data.
Despite the absence of some language features (like lambdas), ML libraries, and most importantly very strict governor limits, Apex code runs on a very powerful cloud computing platform.
Cluster model is designed and executed in Lightning Experience UI with a literally few mouse clicks.
Lightning is a native Salesforce UI platform, the clustering application can be easily integrated into Salesforce. No need to create a web application from scratch, think about databases, hosting, authentication and other details. This all comes with the "Salesforce package".

ML algorithms require heavy data processing and computations. Apex developers must always think about governor limits when writing the code. Most important limits in our case are: Apex CPU time, heap size, number of SOQL queries per transaction, number of DML operations. How to implement a very complex K-Medoids algorithm with 60 seconds CPU limit and 12M heap size limit (in async mode)? One of the options is batch apex and job chaining. This option doesn't require us to use third party tools or computing resources (like AWS).

Schema

Before implementing clustering algorithms we need to design the data structure which will contain our models and clustering results. Clustering model is represented by two objects: ClusterModel__c and ClusterModelField__c. ClusterModel__c contains model definition, such as model name, algorithm, source object name, optional data filter and a SOQL query which will be used to retrieve data from the source object. ClusterModelField__c is a child object which contains definitions for each model field, such as field name, field type (numeric, categorical, text, output) and weight (used in Gower distance formula).

Clustering result is represented by three objects: ClusterJob__c, ClusterJobCluster__c and ClusterJobResult__c. ClusterJob__c represents a clustering job or run. A job record will be created each time we run a model. This object contains job start and end times, status, model id, number of clusters and silhouette score. ClusterJobCluster__c contains the list of clusters for a job.

ClusterJobResult__c contains data records copied from the source object and cluster numbers (and cluster ids) for each record.

I decided to copy the data from the source object to ClusterJobResult__c because we need a place to store a cluster number for each record. Since our application should support any standard or custom object it is complicated to store this information in the source object. This will require an additional step to copy data, but at the same time (and in the same loop) we can do some additional initialization steps, such as:

Find min and max values for each field. This is required for distance calculation (Gower, Euclidean, Manhattan)
Calculate the number or records instead of running an expensive "SELECT COUNT() ..." query
Populate a random value to ClusterJobResult__c.Random__c. This is required for CLARA algorithm which sorts the data on this field to select a random subsample of records
Serialize source record values to json and store in Json__c field

Batch chaining

Both K-Means and K-Medoids have several steps which can be implemented with batch Apex. For K-Medoids (CLARA) the steps are:

Data initialization - populate ClusterJobResult__c, find min and max values
Medoid initialization - retrieve a random subset of records, initialize medoids with K-Means++
Subset data assignment - assign each record in the subset to the closest medoid
Medoid swap
Refinement
Silhouette coefficient calculation
Entire data assignment - assign the rest of records in the dataset to closest medoids
Optional result output step - populate cluster number to a custom field in the source object. The field name must start with 'ClusterNumber' to prevent a possible data loss.

Since these steps are implemented as async batch Apex classes we need some logic which would run the next step after the previous one is finished or go back and repeat a few previous steps. For K-Medoids steps 3 and 4 should repeat until medoids are stable or the maximum number of repeat iterations is reached. This can be done using Apex batch job chaining. From Apex Developer Guide: "You can chain a batch job by calling Database.executeBatch or System.scheduleBatch from the finish method of the current batch class. The new batch job will start after the current batch job finishes".

When chaining batch apex classes we need to pass and preserve the job state. It will contain current step index, medoids array, min and max values, and some other cached data and counters. To maintain the state all batch apex classes implement Database.Stateful interface. When scheduling a new batch current batch class passes the state to the new batch class. In addition the state will also be stored as an attachment in ClusterJob__c record because it will be used later in visualization. The size of the state should be less than 12M (in async mode) which is dictated by the heap size governor limit.

Use case

In our use case we will try to cluster credit card applicants, the sample ML dataset is available for download from Kaggle. First we will create a credit_card_application__c custom object which will contain the data. Salesforce has a handy wizard which can create a custom object from a csv file and import the data from that file.

We will cluster the data based on the following fields:

AMT_INCOME_TOTAL__c (Numeric): Annual income

CNT_CHILDREN__c (Numeric): Number of children

CNT_FAM_MEMBERS__c (Numeric): Number of family members

NAME_EDUCATION_TYPE__c (Text): Lower secondary, Secondary / secondary special, Incomplete higher, Higher education, Academic degree

NAME_HOUSING_TYPE__c (Category): With parents, Co-op apartment, Rented apartment, House / apartment, Municipal apartment, Office apartment

NAME_INCOME_TYPE__c Category: Working, Commercial associate, Pensioner, State servant, Student

We will create K-Medoids model using the below SOQL query:

SELECT AMT_INCOME_TOTAL__c, CNT_CHILDREN__c, CNT_FAM_MEMBERS__c, Id, NAME_EDUCATION_TYPE__c, NAME_HOUSING_TYPE__c, NAME_INCOME_TYPE__c, Name FROM credit_card_application__c

After running the model with 5 clusters the silhouette score will be 0.4280. Let's run the model with different cluster numbers (from 3 to 10) to determine the optimal number of clusters. When calculations are completed we can build the following report on cluster jobs:

As we can see the maximum silhouette score of 0.62 was returned for 7 clusters. So 7 is the optimal number of clusters for the credit card applicants dataset. The algorithm reported the following clusters and cluster centers (medoids):

Visualization

To visualize the multidimensional dataset into lower dimensional space (2D in our case) we need somehow to reduce the dimensionality of the data. One of the most popular dimensionality reduction techniques is called t-distributed stochastic neighborhood embedding, or [t-SNE]{.underline}. t-SNE models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. This method also works with mixed data types as it uses distance function to perform the calculation.

t-SNE calculations will be performed in javascript on the client side, I will use [tSNEJS]{.underline} library to reduce the dimensionality to 2D and [D3.js]{.underline} library for the visualization. Here is t-SNE plot for the above use case (7 clusters):

t-SNE visualization of credit card applicants grouped into 7 clusters

t-SNE visualization with d3 force collision prevention (forceCollide)

Conclusion

In this post I tried to describe how to cluster a data set with mixed data types and how to implement clustering algorithms on the Salesforce platform. The above algorithms are implemented in the "Cluster Analysis" managed package. The package is available for install from the AppExchange, the source code is published in GitHub repo.

Some plans for future versions:

It would be nice to have other algorithms implemented, like spectral clustering and DBSCAN
Salesforce runs Apex batches sequentially. It would be nice to have several batch jobs running in parallel. To achieve this we need to perform some sort of PK chunking and schedule several batch apex jobs in parallel

Please let me know if you have any questions or ideas on the methodology and/or the implementation.

Iskander Mukhamedgaliyev
Salesforce

References

Clustering Large Data Sets (by Leonard Kaufman, Peter J.Rousseeuw, 1986)
Clustering with optimised weights for Gower's metric (by Jeroen van den Hoven) https://beta.vu.nl/nl/Images/stageverslag-hoven_tcm235-777817.pdf
Clustering on mixed type data (by Thomas Filaire) https://towardsdatascience.com/clustering-on-mixed-type-data-8bbd0a2569c3
Visualizing Data using t-SNE (by Laurens van der Maaten) https://lvdmaaten.github.io/tsne/
Building Machine Learning Systems with Apex (Presented on DF14 by Jen Wyher and Paul Battisson) https://www.slideshare.net/pbattisson/df14-building-machine-learning-systems-with-apex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Analysis in Salesforce

Intro

K-Means

Clustering of Mixed Data Types

Gower distance

K-Medoids

CLARA

Number of clusters

Refinement

Implementation in Apex and Lightning

Schema

Batch chaining

Use case

Visualization

Conclusion

References

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally