Descriptive Analytics for Fraud Detection
Recall
Aims at finding behaviour deviating from the norm. It is based on the average behaviour of customers over a given time period.
Assumes the availability of historical data with known frauds. It can only detect know fraud patterns. It is useful to explain the anomalies found by descriptive analytics.
Descriptive analytics is relevant in environments where:
Organizations are starting doing fraud detection
There is no labeled historical data set available.
Fraudsters are continuously adapting their strategies.
Clustering
Main unsupervised technique, it's the first technique to try on a dataset.
Idea: split the dataset into groups to maximize intragroup similarities and maximize differences between different groups.

We also want to group anomalies into small, sparse clusters, based on similarity. Those are the relevent elements that we want to analyze to find out if they are fraudolent or not.
From this image we can also infer one of the most important challenges of clustering: it is important to understand if the plotted groups are really relevant, or if they carry redundant information that was already known before applying the clustering technique.
Types of clustering
Hierarchical
Agglomerative
Divisive
Nonhierarchical
k-means
SOM
Distance metric
Aim of clustering: group observations based on similarity -> a distance metric is needed to quantify similarity. Depending on data type we can have either Euclidean distance or Minowsky distance.
DM applied to categorical (or binary) variables
We can have categorical variables mixed with the continuous ones. Those are binary variables which can be analysed with two methods:
Simple matching coefficients (SMC): calculates the number of identical matches between the variable values (equally important). For e.g. we take two instances and we check how many times the values are equal (device =
Samsung, 2FA=trueand device =Apple, 2FA =true, #similarities = 1)Jaccard index: measures the similarity between both claims across red flags that were raised at least once.
Example #1

Example #2
Consider a system with:
100 red flags indicators
5 raised on average
Q: Which one do we apply? SMC or Jaccard?
A:
Jaccard because if we compute the SMC of all instances we always obtain high level similarities, instead with Jaccard we'll count them only when both red flags will be raised.
Problem with categorical variables
If we have lots if values to choose from for each category (e.g. device_manifacturer = ['Apple', 'Samsung', 'Huawei', 'Google', ...]) we have two options:
Grouping the results (0/1 dummies), for e.g. IP addresses can be grouped by range.
Exploiting matrices used for binary variables (check that the values are the same as given ones, or SMC), or plotting values as frequencies.
Mix with continuous variables
Code categorical variables as 0/1 dummies and use a continuous distance measure.
Use a weighted combination of distance metrics (less straighforward and less frequently used).
There is not a best approach, it's mostly a matter of trial and error.
Hierarchical clustering

Pretty self explanatory.
Types of distance between clusters

Those are all distance mesure options that needs to be be choose and set when computing the clustering. Depending on the choosen distance measure you can get totally different results. Brief explanation of each:
Single linkage: in order to compute the distance you take the closest points between two clusters and you compute the distance w.r.t. to each other.
Complete linkage: distance between two most dissimilar points between two clusters.
Average linkage: average between all distance of all points of two clusters
Centroid method: compute the centroids, which are sorts of means of the clusters, and then you compute the distance between centroids.
Dward distance: it tries it add some performance metrics between the final results. It computes similarities of two different clustering solutions.
Problem: if you want to compute those distances in reality you'll see that it's not an efficient process. Just finding the two elements that have the most similarities is not easy to compute.
Deciding the number of clusters
Dendogram: tree-like diagram that records the sequences of merges. Vertical (or horizontal scale) gives the distance between two clusters after being merged. Then you put a threeshold on the distance to consider two elements to be merged. If you cut the dendrogram at the desired level you'll find the optimal clustering.
Screen plot: plot of the distance at which clusters are merged. The elbow point then indicates the optimal clustering.
It is important to have graphical methods because they improve the efficiency of the process which is of crucial importance.

Conclusions
Advantages:
The number of clusters does not need to be specified prior to the analysis.
Disadvantages:
Do not scale: the higher the data, the higher the computational complexity.
The interpretation of the clusters is often subjective and depends on the business expert and/or data scientist.
Non-Hierarchical clustering
k-Means
We need to select from the beginning the number of clusters we're going to use, which can be done only after a interview with your client/analyst you're able to retrieve the exact amount of fraudolent behaviours. Otherwise you need to bruteforce it. Let's assume we'll use k clusters.
For each single instance you compute the distance and assign it to the closest centroid. Then you recompute it the centroids as an average of the distribution of the points. You repeat until you reach a number of iterations or centroids no longer change.
Example

Suppose we choose the worst possible positions for the centroids. We assign each observation to the closest centroid:

Conclusions
Disadvantages:
k needs to be known
Final results depends on where you put the initial centroids
Advantages:
Not computationally complex as hierarchical methods
Self organizing-maps
Idea: neural network with only two layers (input, output):
Unsupervised learning algorithm that allows users to visualize and cluster high-dimensional data on a low-dimensional grid of neurons
Main strength: they allow to visualise and cluster high dimensional data on a low dimensional space.
First of all we decide the structure of the output:

Then you construct the map by connecting input and output: Each input is connected to all neurons in the output layer with weights w=[w1β,...,wNβ] where N is the number of variables. All weights are randomly initialized. 
At this point you compute the distance between new instances of the same output with all the weights between an input and an output by computing the Best matching unit (BMU):
The weight vector of the BMU and its neighbors in the grid are then adapted to match the BMU using the following learning rule:
Where:
t represent the time index during training.
hciβ(t) defines the neighbourhood of the BMU.

Semi-supervised clustering (clustering with constraints)
Those are methods that directly exploits expert knowledge to cluster data. They put constraints on how to cluster elements inside data.
Idea: bias the clustering with expert knowledge such that the clusters can be found quicker and they have the desired properties.
The constraints can be enforced during the clustering: each time an observation is (re-)assigned, the constraints are verified and the (re-)assignment halted in case violations occur.
Types of constraints
Observation-level constraints: set for individual observations.
Cluster-level constraints: defined at the level of the cluster.
Minimum separation or Ξ΄ constraints.
E-constraints.
Other constraints:
Requirement to have balanced clusters, whereby each cluster contains the same amount of observations.
It means that instances will be labeled with preformed clusters known via expert knowledge.
One class Support Vector Machine (SVM)
Idea: distinguishing the norm from the anomalies by representing data in 2d and finding the best hyperplane which separates the norm from anomalies (linear separation).

Evaluating and Interpreting Clustering Solutions
No trivial exercise
There exists no universal criterion
Possible approach: SSE betwen different clustering solutions to select the one that maximises/minimizes these two solutions.
Example
Suppose we decide to plot the different features of each cluster with histograms to understand if the clusters we got are meaningful or not.

For example in the recency graph we can see that high tx amount is distinguished w.r.t. low rx amount) -> good clustering.
Conclusion
The final result of a clustering technique is a labeled dataset on which we can train a supervised learning method.
Last updated