Evaluating a Fraud Detection Model
When evaluating predictive models, two key decisions need to be made:
A first decision concerns the data set split up.
A second decision concerns the performance metrics.
Note: Do not mix training, validation and testing. We're overfitting on the dataset we used -> optimal performance for it, worst for others -> the model will not work in a real world scenario.
Large datasets
The decision how to split up the data set for performance measurement depends on its size. Large data sets -> the data can be split up into:
(70%) Training dataset to build the model
(30%) Test dataset to calculate its performance
Strict separation between training and test sample: No observation that was used for training, can be used for testing.
In case of decision trees or neural networks, the validation sample is a separate sample -> it is used during model development (i.e., to make the stopping decision): 40% training, 30% validation, and 30% test sample.
Small datasets
Small data sets -> special schemes need to be adopted:
Cross-validation
the data is split into K folds (e.g., 5, 10). An analytical model is then trained on Kβ1 training folds and tested on the remaining validation fold.
Repeated for all possible validation folds resulting in K performance estimates, which are averaged.
Leave-one-out cross-validation
every observation is left out in turn and a model is estimated on the remaining Kβ1 observations.
This gives K analytical models in total
Model Selection
Cross-validation gives multiple models:
Key question "what should be the final model that is being outputted from the procedure"?
All models collaborate in an ensemble setup by using a (weighted) voting procedure.
Do leave one out cross-validation and pick one of the models at random. Since the models differ up to one observation only, they will be quite similar anyway.
Build one final model on all observations but report the performance coming out of the cross-validation procedure as the best independent estimate.
Performance Metrics

Receiver Operating Characteristic (ROC) Curve
Changing the cutoff -> performance measure that is independent from the cut-off.

Area under the ROC curve (AUC)
AUC provides a simple figure-of-merit for the performance:
The higher the AUC, the better the performance
Bounded between 0 and 1
Can be interpreted as a probability.
It represents the probability that a randomly chosen fraudster gets a higher score than a randomly chosen nonfraudster.
Other Performances Metrics
Interpretability
White-box : Linear and logistic regression, and decision trees.
Black-box: Neural networks, SVMs, and ensemble methods.
Justifiability
Verifies to what extent the relationships modeled are in line with expectations -> verifying the univariate impact of a variable on the model's output.
Operational efficiency: ease with which one can implement, use, and monitor the final model
to be able to quickly evaluate the fraud model
Linear and rule based models are easy to implement
Notes
Problem: Fraud-detection data sets often have a very skew target class distribution frauds are <=1. This creates problems for the analytical techniques -> flooded by nonfraudulent observations and thus tend toward classifying every observation as nonfraudulent.
During training never balance the dataset in the testing part.
Recommended to increase the number of fraudulent observations or their weight, such that the analytical techniques can pay better attention to them.
Increase the number (and variability) of frauds by:
Increasing the time horizon for prediction. a. Instead of predicting fraud with a six-month forward-looking time horizon, a 12-month time horizon
Sampling every frauds twice (or more). a. We predict fraud with a one-year forward-looking time horizon using information from a one year backward looking time horizon i. By shifting the observation point earlier or later, the same fraudulent observation can be sampled twice. ii. The variables collected will be similar but not perfectly the same, since they are measured on a different time frame.
Finding the optimal number is subject to a trial-and-error exercise depending on the skewness of the target.
Oversampling vs Undersampling:
OS: Replicate frauds two or more times so as to make the distribution less skew.
US: Remove nonfrauds two or more times so as to make the distribution less skew.
US and OS can also be combined. Undersampling usually results in better classifiers than oversampling. They should be conducted on the training data and not on the test data to give an unbiased view on model performance.
Cost-sensitive Learning: Assigns higher misclassification costs to the minority class.
Idea: When classifying events the analyst will waste time evey time a false positive is provided to him (cost increases), while a false negative is a direct loss.
Last updated