Miscellanea, Clarification of unclear concepts
Supervised Learning
Self organizing-maps
Idea: neural network with only two layers (input, output):
Unsupervised learning algorithm that allows users to visualize and cluster high-dimensional data on a low-dimensional grid of neurons
Unsupervised Learning
Dealing with N.N. Black-Box Approach
Variable selection: Hinton Diagram, works by visualizing the weight of each variable against each hidden neuron to determine the relevance of each variable in the model
The relationship between the inputs and the output is nonlinear and complex -> Solution: Extract
if-thenclassification rules, mimicking the behavior of the neural network. Two ways:Decompositional technique: decompose the network’s internal workings by inspecting weights and/or activation values.
Pedagogical technique: consider the N.N. as a Black-Box and its predictions as input to a white-box analytical technique such as decision trees. Pro: it can be used with any underlying algorithm.
Support Vector Machines
SVMs deal with the shortcomings of neural networks. From Support Vector Machine Algorithm - GeeksforGeeks:
Support Vector Machine(SVM) is a supervised machine learning algorithm used for both classification and regression. Though we say regression problems as well its best suited for classification. The objective of SVM algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. The dimension of the hyperplane depends upon the number of features.
SVMs have a universal approximation property:
They do not require tuning of the number of hidden neurons;
They are characterized by convex optimization.
Variable selection can be performed using the backward variable selection procedure: This will essentially reduce the variables but not provide any additional insight into the workings of the SVM.
Ensemble Methods
Bagging
In this, we take multiple subsets of the training dataset. For each subset, we take a model with the same learning algorithms (e.g. decision tree,lLogistic regression, etc.), to predict the output for the same set of test data. Once we predict each model, we use a model averaging technique to get the final prediction output.
Note that for classification, a new observation will be classified by letting all #subsets classifiers vote, while for regression, the prediction is the average of the outcome of the #subsets models.
Boosting
Estimate multiple models using a weighted data sample.
Starting from uniform weights
Iteratively re-weight the data according to the classification error: a. misclassified cases get higher weights.
Idea: difficult observations should get more attention.
A popular implementation of this is the Adaptive boosting/Adaboost procedure: it simply randomizes the weights.
Random forest is a machine learning algorithm which combines the output of multiple decision trees to reach a single result. It handles both classification and regression problems. The random forest algorithm is an extension of the bagging method as it utilizes both bagging and feature randomness to create an uncorrelated forest of decision trees.
Decision Trees
Decision trees start with a basic question, such as, “Should I surf?” From there, you can ask a series of questions to determine an answer, such as, “Is it a long period swell?” or “Is the wind blowing offshore?”. These questions make up the decision nodes in the tree, acting as a means to split the data. Each question helps an individual to arrive at a final decision, which would be denoted by the leaf node. Observations that fit the criteria will follow the “Yes” branch and those that don’t will follow the alternate path. Decision trees seek to find the best split to subset the data.
While decision trees are common supervised learning algorithms, they can be prone to problems, such as bias and overfitting. However, when multiple decision trees form an ensemble in the random forest algorithm, they predict more accurate results, particularly when the individual trees are uncorrelated with each other.
F.D.M. Evaluation
Large Datasets
The decision how to split up the data set for performance measurement depends on its size. Large data sets -> the data can be split up into:
(70%) Training dataset to build the model
(30%) Test dataset to calculate its performance
Strict separation between training and test sample: No observation that was used for training, can be used for testing.
In case of decision trees or neural networks, the validation sample is a separate sample -> it is used during model development (i.e., to make the stopping decision): 40% training, 30% validation, and 30% test sample.
Small Datasets
Small data sets -> special schemes need to be adopted:
Cross-validation
The data is split into K folds (e.g., 5, 10). An analytical model is then trained on K-1 training folds and tested on the remaining validation fold.
Repeated for all possible validation folds combination resulting in K performance estimates, which are averaged.
Leave-one-out cross-validation
Every observation is left out in turn and a model is estimated on the remaining K-1 observations.
This gives K analytical models in total.
Investigation Process
Acquisition
Acquisition with device turned on: if can be turned off -> turn off (unplug, do not shut down normally), normal acquisition. Else:
Disconnect from network
Dump volatile memory
Runtime information acqusition (running processes, network information)
Disk acquisition (can be done with target powered on).
It is important to document every step of the procedure.
Last updated