This proposal is also available as pre-print (PDF) on OSF.io. If you want to cite this proposal, please cite:
Beel, Joeran, Lukas Wegmeth, and Tobias Vente. 2024. “E-fold Cross-validation: A Computing and Energy-efficient Alternative to K-fold Cross-validation with Adaptive Folds.” OSF Preprints. June 6. doi:10.31219/osf.io/exw3j.
Introduction
K-fold cross-validation is widely regarded as a robust method for model evaluation in machine learning and related fields, including recommender systems. Unlike a simple hold-out split, k-fold cross-validation ensures that each instance in the dataset is used for training and validation. Furthermore, by performing the evaluation process k times with different subsets, this method typically achieves better generalization on test data than a single hold-out split.
The generally accepted downside of k-fold cross-validation is a k times longer evaluation time and, hence, k times more computational power. This, in turn, means k-times more energy consumption and eventually k-times more CO2 emissions. We agree that repeating experiments multiple times is typically beneficial, and the additional required energy and CO2 emissions are a necessary trade-off.
The main problem with k-fold cross-validation is the rather arbitrary and fixed size of k. Empirical evidence suggests that a k between 5 and 10 is typically ideal. Hence, researchers usually choose a k between 5 and 10 for their experiments. What exact k researchers choose depends on gut feeling and the availability of computational resources. Notably, k typically remains the same during a researcher’s entire experimental pipeline, that is, over all datasets and algorithms.
We argue that an arbitrary fixed k over all experiments is not ideal. On the one hand, whatever k is chosen, there is a risk that k is not large enough to achieve the optimal performance (the smaller k, the more significant the risk). On the other hand, there will be experiments in which k is chosen unnecessarily large. In these cases, optimal performance is achieved, but time, energy, and CO2 are wasted.
We propose the idea of e-fold cross-validation. The core idea is that e is chosen ’intelligently’ and individually for each experiment and dataset. This contrasts a static k chosen by gut feeling and past experiences on what k is ’typically’ good. Our goal for e-fold cross-validation is that e is as small as possible not to waste time and energy and not to create unnecessary CO2 emissions but large enough to provide (near) optimal performance. How exactly e is chosen is subject to future research. One potential way is presented in the remainder.
Methodology
As a first step, related work must be identified. This groups into related work that identifies the optimal size of k, and into related work that proposes alternatives to using a fixed k. To strengthen the motivation for e-fold cross-validation, the significance of the problem of k-fold cross-validation should be quantified. To do so, some literature analysis could be conducted. For instance, analysing how many researchers actually use cross-fold validation and with which k might be useful. To find out, recent papers, for instance, of the ACM Recommender-Systems Conference could be surveyed.
To further quantify the severity of the potential problem behind k-fold cross-validation, experimental work is needed. Through experiments, it should be identified how often k-fold cross-validation is too large, i.e. ’wasting’ time and energy, and how often k is too small, i.e. not achieving optimal performance.
One option for achieving this is as follows.
- Identify at least 20 recommender-systems or machine-learning datasets, better more. The datasets should vary in their characteristics (size, domain …).
- Hold out 10% of each dataset for later testing.
- Split the remaining data into k folds, with k=2…25.
- Train and evaluate at least 8 algorithms from at least two different software libraries.
- Repeat 2.-4. at least 10 times, better 100 times.
- Analyse the data
- Visualize the results for k=2…25
- Identify how often which k was optimal
- Estimate the consequences if a sub-optimal k, e.g. between 5 and 10, had been chosen.
Then, propose at least one approach for how e could be ’intelligently’ selected, possibly individually, for each dataset in an experiment. Rather obvious methods would be that a certain level of statistical significance (or confidence intervals) is reached. But likely, there are other options, too.
Finally, conduct additional experiments to identify how effective the proposed method is. This means finding out how the novel method does (not) achieve the optimal performance and how much time, energy and CO2 emissions the method could save.
Notes
Please note that this proposal is intended for Bachelor and Master students who seek a thesis topic. Before beginning the work, it is particularly important to conduct a thorough literature search to find out if methods similar to the one proposed exist already. Be aware, though, that even if similar ideas exist, this topic would still be suitable as a Bachelor’s or Master’s thesis.
0 Comments