Predictive Analytics for Fraud Detection
The aim is to build an analytical model predicting a target measure of interest. Two types of predictive analytics can be distinguished depending on the measurement level of the target:
Regression
Classification
Regression
Target variable
continuous
Either limited (X∈[0,1]) or unlimited (X∈[0,∞]).
Categorical
Has a a limited set of predefined values:
Binary classification: only two classes are considered (e.g., fraud versus no-fraud)
Multiclass classification: the target can belong to more than two classes (e.g., severe fraud, medium fraud, no fraud)
The target fraud indicator is usually hard to (obtain) and determine. One can never be fully sure that a certain transaction is fraudulent, and the target labels are typically not noise-free.
Linear regression
Technique to model a continuous target variable: Y=β0+β1X1+…+βNXN.
Y: Target;
X1,...,XN: Explanatory variables;
The β parameters can then be estimated by minimizing a squared error function;
Logistic regression
Logistic Regression Model: Combination of the linear regression with the bounding function:
$P($ fraud $=$ yes $\mid$ Revenue, Employees, VATCompliant $)$
Outcome: bounded between 0 and 1.
When estimating this using OLS, two key problems arise:
The errors/target are not normally distributed but follow a Bernoulli distribution with only two values;
There is no guarantee that the target is between 0 and 1, which would be handy since it can then be interpreted as a probability.
Consider now the following bounding function:

The outcome is always between 0 and 1.

The bounding function is much better for logistic regression.
Variable selection
Aims at reducing the number of variables -> more concise and faster to evaluate.
Both linear and logistic regression have built-in techniques to perform variable selection.
When performing variable selection, statistical significance is not the only metrics that matter. Sometimes we also need to account for interpretability, operational efficiency and regulatementary requirements.
Decision trees

For each node we establish a condition which gets splitted between branches depending on the value of the input variable.
Top node = root node specifying a testing condition, of which the outcome corresponds to a branch leading up to an internal node.
Terminal nodes = leaf nodes assign fraud labels.
Various algorithms that differ in how the implement the key decisions to they build a tree:
Splitting decision: Which variable to split at what value (e.g., Transaction amount is > $100, 000 or not=
Stopping decision: When to stop adding nodes to the tree?
Assignment decision: What class (e.g., fraud or no fraud) to assign to a leaf node? -> Look at the majority class within the leaf node to make the decision (winner-take-all learning).
Splitting decision

In order to answer the splitting decision, one must define the concept of impurity or chaos.
Minimal impurity occurs when all customers are either good or bad.
Maximal impurity occurs when one has the same number of good and bad customers.
Decision trees aim at minimizing the impurity in the data.
Impurity metrics
Entropy -> Gain: weighted decrease in entropy.
Gini

Splitting decision: In order to answer the splitting decision, various candidate splits must be evaluated in terms of their decrease in impurity.
Stopping decision: If the tree continues to split, it will have one leaf node per observation: Overfitting. The tree has become too complex and fails to correctly model the or trend in the data and it will generalize poorly to new unseen data.
How do we avoid overfitting? -> Split data in:
Training sample (70%) -> make splitting decision
Validation sample (30%) -> independent sample to monitor the misclassification error (or any other performance metric)

Example

Decision Trees Properties
Every tree can also be represented as a rule set: every path from a root note to a leave node makes up a simple if-then rule (recall: looks a bit like expert based systems, also based on rules).
Decision trees essentially model decision boundaries orthogonal to the axes -> They simplify the linear separation with the aim to minimize impurities, to make easier to distinguish fraudolent and non fraudolent. Graphically we can represent this behaviour like the following:

Example

If Transaction amount > $100, 000 And Unemployed = No Then no fraud
If Transaction amount > $100, 000 And Unemployed = Yes Then fraud
If Transaction amount $\leq$ $100, 000 And Previous fraud = Yes Then fraud
If Transaction amount $\leq$ $100, 000 And Previous fraud = No Then no fraud Decision trees essentially model decision boundaries orthogonal to the axes.
Regression treees
Same functioning as decision trees, what changes are the metrics:
Mean Squared Error (MSE)
Variance (ANOVA) test and F-statistic
Considerations on decision trees
They improve variables selection, since
They allow to segment the dataset
Advantages:
Decision tree gives a white-box model with a clear explanation: interpretable;
Operationally efficient;
Powerful techniques and allow for more complex decision boundaries than a logistic regression;
Nonparametric: no normality or independence assumptions are needed;
Disadvantages:
They tend to overfit, meaning that resources must be spent to correct this;
Highly dependent on the sample that was used for tree construction: a small variation in the underlying sample might yield a totally different tree.
Neural Networks
Neural networks are mathematical representations inspired by the functioning of the human brain.
In practice they are a generalization of the statistical models seen up to now.
More advanced way to represent correlation between different data. They can model very complex patterns and decision boundaries in the data.
Processing element or neuron performs two operations:
It takes the inputs and multiplies them with the weights (including the intercept term β0 , called the bias term);
Puts this into a nonlinear transformation function (~logistic regression). This means that logistic (and linear) regression is a neural network with one neuron.
MultiLayer Perceptron (MLP) Neural Network
Input layer
Hidden layer
Works like a feature extractor: combine the inputs into features that are then subsequently offered to the output layer;
Each node has a nonlinear transformation function.
Output layer
Makes the prediction;
Has a linear transformation function.
It is like an ensemble of different regression models. They can be used for both classification and regression.
In the fraud analytics setting, complex patterns rarely occur.
Neural Network Weight Learning
The optimization (for optimal parameter values) is more complex: Iterative algorithm that optimizes a cost-function:
Continuous target variable -> Mean Squared Error (MSE) cost function
Binary target variable -> Maximum Likelihood cost function
The procedure starts from a set of random weights, which are then iteratively adjusted to the patterns in the data using an optimization algorithm.
Key issue: the curvature of the objective function is not convex and may be multimodal, which means that The error function can thus have multiple local minima but typically only one global minimum.

If the starting weights are chosen in a suboptimal way, one may get stuck in a local minimum.
Preliminary Training
Try out different starting weights
Start the optimization procedure for a few steps
Continue with the best intermediate solution
Stopping Criterion
The optimization procedure then continues until:
The error function shows no further progress
The weights stop changing substantially
After a fixed number of optimization steps (Epochs)
Hidden Neurons (Weight and) Number
We're creating a non linear relation between input and output and the number of hidden neurons is related to nonlinearity in data. The more complex pattern in data, more neurons are needed to model their behaviour. Steps to choose the optimal number of neurons:
Split the data into a training, validation, and test set.
Vary the number of hidden neurons
Train a neural network on the training set
Measure the performance on the validation set.
Choose the number of hidden neurons with optimal validation set performance.
Measure the performance on the independent test set
Overfitting Problem
Neural networks can model very complex patterns and decision boundaries in the data -> They can even model the noise in the training data -> Too many neurons lead to overfitting. Two strategies to fix that:
Elbow method:
Training set -> estimate the weights.
Validation set -> independent data set used to decide when to stop training and avoid this overfitting.
Weight regularization: keep weights small in absolute sense to avoid fitting the noise in the data -> Add a weight size term to the objective function of the neural network.
Opening the Neural Network Black Box
Black-box: relate inputs to outputs in a mathematically complex, nontransparent, and opaque way.
Opening the Neural Network black box:
Variable selection
Rule extraction
Two-stage models
Variable Selection
Select variables that actively contribute to the NN output.
In linear and logistic regression -> inspecting the p-values. (we can simply extract the betas)
In neural networks -> no p-values. (we have lot's of values none of which is to be left behind)
Hinton Diagram
Visualizes the weights between the inputs and the hidden neurons as squares such that size of the square is proportional to the size of the weight and color of the square represents the sign of the weight (e.g. black=negative weight and white=positive weight).

For e.g. the amount is really important for node 1, 2. The income can be removed because it has few importance w.r.t. the other weights.
This method is really useful to check for importance of both neurons (rows) and weights (columns).
Problem of Machine Learning: Data is difficult to get/train. People starts to provide pre-trained models, which means that those weights are tuned on a specific dataset. A possible attack is adding an additional layer never enabled during training to the dataset which gets triggered only with a specific input. This will make the network misclassify because it has other weights. This can be countered with this type of analysis.
Backward Variable Selection
Build a neural network with all N variables.
Remove each variable in turn and reestimate the network. This will give N networks each having N – 1 variables.
Remove the variable whose absence gives the best performing network (e.g., in terms of misclassification error, mean squared error).
Repeat this procedure until the performance decreases significantly.

Rule Extraction
Variable selection allows users to see which variables are important and which ones are not, but it does not offer a clear insight into its internal workings. The relationship between the inputs and the output remains nonlinear and complex.
The rule sets must be evaluated in terms of:
Accuracy
Conciseness
Fidelity measures to what extent the extracted rule set succeeds in mimicking the neural network.
Rule Extraction Procedure: Extract if-then classification rules, mimicking the behavior of the neural network.
Decompositional technique: decompose the network’s internal workings by inspecting weights and/or activation values.
Pedagogical technique: consider the neural network as a black box and use the neural network predictions as input to a white-box analytical technique such as decision trees. It can be used with any underlying algorithm.
Decompositional Rule Extraction

Pedagogical Rule Extraction Techniques

Two-stage Model Setup
Idea: use a NN to predict how much the result of the simple will be bad. We'll then sum up the good result of the NN with the interpretation of the Logistic Regression. Steps:
Estimate an easy-to-understand model first (e.g., linear regression, logistic regression). a. This will give us the interpretability part.
Use a neural network to predict the errors made by the simple model using the same set of predictors. a. performance benefit of using a nonlinear model.
It provides an ideal balance between model interpretability (which comes from the first part) and model performance (which comes from the second part).
Support Vector Machine
Support Vector Machines (SVMs) deal with the shortcomings of neural networks:
Objective function is non convex (multiple local minima).
Effort that to tune the number of hidden neurons.
Idea: a high lever pov it's a child of linear programming, which is a problem in which you're trying to find the best boundaries to separate different classes. The main problem of this approach is that there's more optimal boundaries, but the SVM wants only one -> solution: separate as much as we can the boundaries of the SVM.
Note that it's very similar to the Logistic Regression. As a consequence SVMs can also be used for regression applications with a continuous target: Find a function f(x), as flat as possible, which has at most ϵ deviation from the actual targets.
Opening the SVM Black Box: Transparency
Complex in settings where interpretability is important. SVMs have a universal approximation property:
They do not require tuning of the number of hidden neurons;
They are characterized by convex optimization.
Variable selection can be performed using the backward variable selection procedure: This will essentially reduce the variables but not provide any additional insight into the workings of the SVM.
Rule extraction approaches: the SVM can be represented as a neural network.
Pedagogical approaches: can be easily combined with SVMs since considers the underlying model as a black box.
SVM is first used to construct a data set with SVM predictions for each of the observations.
This data set is then given to a decision tree algorithm to build a decision tree.
Additional training set observations can be generated to facilitate the tree construction process.
Ensemble Methods
Idea: Aims at estimating multiple analytical models instead of using only one.
Decision trees are usually used in this kind of approach. There are three ways we can implement them:
Bagging
Boosting
Random forests
Bagging (Bootstrap aggregating)
Starts by taking B bootstraps from the underlying sample. A bootstrap is a sample with replacement.
Build a classifier for every bootstrap.
Note that for classification, a new observation will be classified by letting all B classifiers vote, while for regression, the prediction is the average of the outcome of the B models.
Key element for bagging: instability of the analytical technique. If perturbing the data set by means of the bootstrapping procedure can alter the model constructed, then bagging will improve the accuracy For models that are robust with respect to the underlying data set, Bagging will not give much added value.
Boosting
Estimate multiple models using a weighted data sample.
Starting from uniform weights
Iteratively re-weight the data according to the classification error: a. misclassified cases get higher weights.
Idea: difficult observations should get more attention.
A popular implementation of this is the Adaptive boosting/Adaboost procedure: it simply randomizes the weights.
Problem: risk of fitting the noise. Risk of overfitting to the hard (potentially noisy) examples in the data, which will get higher weights as the algorithm proceeds. This is especially relevant in a fraud detection setting because the target labels are typically quite noisy.
Random Forests
One of the best approaches in the Fraud Detection Domain.
Steps:
Given a data set with n observations and N inputs.
m = constant chosen on beforehand.
For t=1,…,T a. Take a bootstrap sample with n observations. b. Build a decision tree whereby for each node of the tree, randomly choose m variables on which to base the splitting decision. c. Split on the best of this subset. d. Fully grow each tree without pruning.
TDLR: Take dataset, split, build a tree with random decision features, ensemble results.
Random forests can be used with both classification trees and regression trees.
Key concepts:
The dissimilarity amongst the base classifiers, which is obtained by adopting a bootstrapping procedure to select the training samples of the individual base classifiers;
The selection of a random subset of attributes at each node;
The strength of the individual base models.
The diversity of the base classifiers creates an ensemble that is superior in performance compared to the single models.
Problem: We have n black-box models instead of one. How tf do we evaluate them?
A popular procedure to do so is as follows:
Permute the values of the variable under consideration on the validation or test set.
For each tree, calculate the VI value, the difference between the error on the original, unpermuted data, and the error on the permuted data. a. In a regression setting, the error can be the MSE, whereas in a classification setting, the error can be the misclassification rate.
Order all variables according to their VI value. The variable with the highest VI value is the most important.
Last updated