Prof. Raul Vicente Zafra, Institute of Computer Science UT
PhD Davit Bzhalava, Karolinska Institutet (Sweden)
PhD Ricardo Vigário, University Nova (Portugal)
Description of the problem
In classical statistics, models are rather simple and together with some assumptions about the data itself, it is possible to say if the given result is statistically significant or not. Machine learning algorithms on the other hand can have hundreds of millions of model weights. Such models can explain any data with 100% accuracy that changes the rules of the game. This issue is solved by evaluating the models on a separate test set. Some data points are not used in the model fitting phase. If the best model has been found, the quality of the model is evaluated on that test set. This method works well but it has a problem that some of the precious data is wasted for testing the model and not actually used in training. Researchers have come up with many solutions to improve the efficiency of data usage. One of the main methods is called nested cross-validation that uses data very efficiently but it has a problem that it makes it very difficult to interpret model parameters.
Result and benefit
In this thesis, we invented a novel approach for data partitioning that we termed "Cross-validation and cross-testing". First, cross-validation is used on part of the data to determine and lock the model. Then testing of the model on a separate test set is performed in a novel way such that on each testing cycle, part of the data is also used in a model training phase. This gives us an improved system for using machine learning algorithms in the case where we need to interpret model parameters but not the model weights. For example, it gives us a nice possibility to be able to describe that the data has a linear relationship instead of quadratic one or that the best neural network has 5 hidden layers.