Evaluating a Fraud Detection Model

When evaluating predictive models, two key decisions need to be made:

  1. A first decision concerns the data set split up.

  2. A second decision concerns the performance metrics.

Note: Do not mix training, validation and testing. We're overfitting on the dataset we used -> optimal performance for it, worst for others -> the model will not work in a real world scenario.

Large datasets

The decision how to split up the data set for performance measurement depends on its size. Large data sets -> the data can be split up into:

  • (70%) Training dataset to build the model

  • (30%) Test dataset to calculate its performance

Strict separation between training and test sample: No observation that was used for training, can be used for testing.

In case of decision trees or neural networks, the validation sample is a separate sample -> it is used during model development (i.e., to make the stopping decision): 40% training, 30% validation, and 30% test sample.

Small datasets

Small data sets -> special schemes need to be adopted:

  • Cross-validation

    • the data is split into K\mathrm{K} folds (e.g., 5, 10). An analytical model is then trained on Kβˆ’1\mathrm{K}-1 training folds and tested on the remaining validation fold.

    • Repeated for all possible validation folds resulting in K\mathrm{K} performance estimates, which are averaged.

  • Leave-one-out cross-validation

    • every observation is left out in turn and a model is estimated on the remaining Kβˆ’1\mathrm{K}-1 observations.

    • This gives K\mathrm{K} analytical models in total

Model Selection

Cross-validation gives multiple models:

Key question "what should be the final model that is being outputted from the procedure"?

  • All models collaborate in an ensemble setup by using a (weighted) voting procedure.

  • Do leave one out cross-validation and pick one of the models at random. Since the models differ up to one observation only, they will be quite similar anyway.

  • Build one final model on all observations but report the performance coming out of the cross-validation procedure as the best independent estimate.

Performance Metrics

Receiver Operating Characteristic (ROC) Curve

Changing the cutoff -> performance measure that is independent from the cut-off.

Area under the ROC curve (AUC)

AUC provides a simple figure-of-merit for the performance:

  • The higher the AUC, the better the performance

  • Bounded between 0 and 1

  • Can be interpreted as a probability.

  • It represents the probability that a randomly chosen fraudster gets a higher score than a randomly chosen nonfraudster.

Other Performances Metrics

  • Interpretability

  • White-box : Linear and logistic regression, and decision trees.

  • Black-box: Neural networks, SVMs, and ensemble methods.

  • Justifiability

    • Verifies to what extent the relationships modeled are in line with expectations -> verifying the univariate impact of a variable on the model's output.

  • Operational efficiency: ease with which one can implement, use, and monitor the final model

    • to be able to quickly evaluate the fraud model

    • Linear and rule based models are easy to implement

Notes

Problem: Fraud-detection data sets often have a very skew target class distribution frauds are <=1<= 1%. This creates problems for the analytical techniques -> flooded by nonfraudulent observations and thus tend toward classifying every observation as nonfraudulent.

  • During training never balance the dataset in the testing part.

  • Recommended to increase the number of fraudulent observations or their weight, such that the analytical techniques can pay better attention to them.

  • Increase the number (and variability) of frauds by:

    1. Increasing the time horizon for prediction. a. Instead of predicting fraud with a six-month forward-looking time horizon, a 12-month time horizon

    2. Sampling every frauds twice (or more). a. We predict fraud with a one-year forward-looking time horizon using information from a one year backward looking time horizon i. By shifting the observation point earlier or later, the same fraudulent observation can be sampled twice. ii. The variables collected will be similar but not perfectly the same, since they are measured on a different time frame.

    Finding the optimal number is subject to a trial-and-error exercise depending on the skewness of the target.

  • Oversampling vs Undersampling:

    • OS: Replicate frauds two or more times so as to make the distribution less skew.

    • US: Remove nonfrauds two or more times so as to make the distribution less skew.

    US and OS can also be combined. Undersampling usually results in better classifiers than oversampling. They should be conducted on the training data and not on the test data to give an unbiased view on model performance.

  • Cost-sensitive Learning: Assigns higher misclassification costs to the minority class.

    Idea: When classifying events the analyst will waste time evey time a false positive is provided to him (cost increases), while a false negative is a direct loss.

Last updated