I recall vividly when more than a decade ago – I was a PhD student – Konstan & Adomavicius warned that “the recommender systems research community […] is facing a crisis where a significant number of research papers lack the rigor and evaluation to be properly judged and, therefore, have little to contribute to collective knowledge [14]”. Similar concerns were already voiced two years earlier by Ekstrand et al. [12]. Over the following years, many more researchers criticized the evaluation practices in the community [13, 21, 19, 10], myself included [5, 8, 4, 20, 23, 15, 6, 7]. The situation may have somewhat improved in the past years due to more awareness in the community [13], the reproducibility track at the ACM RecSys conference, innovative submission formats like “result-blind reviews” [9] via registered reports at ACM TORS, and several new software libraries, including Elliot [1], RecPack [16], Recbole [25], and LensKit-Auto [22]. Yet the decade-old criticism by Konstan & Adomavicius remains as true today as it was a decade ago.

Konstan & Adomavicius proposed that, among others, best-practice guidelines on recommender systems research and evaluations might offer a solution to the crisis [14]. In their paper, they also presented results from a small survey that indicated that such guidelines would be welcomed by many members of the community. However, to my knowledge, no comprehensive guidelines or checklists have been specifically created for the recommender systems community, or at least they have not been widely adopted. Recently, I attempted to develop guidelines for releasing recommender systems research code [3], based on the NeurIPS and ’Papers with Code’ guidelines [24], but progress has been limited.

I echo the demand by Konstan & Adomavicius [14] for the recommender systems community to establish best-practice guidelines and/or checklists for researchers and reviewers. Such guidelines would facilitate the conduct of ’good’ research, and they would assist reviewers in conducting thorough reviews. By ’good research’ I primarily mean reproducible research with a sound methodology. But ’good’ research also refers to research that others easily can build upon, e.g. because data and code are available; research that is ethical; and research that is sustainable, e.g. because no resources were wasted.

My vision is best-practice guidelines that are not merely a collection of opinions but are instead grounded in empirical evidence. This approach would be analogous to the medical field, where guidelines for practitioners are justified based on empirical research findings. Additionally, these medical guidelines indicate the degree of consensus among experts, allowing medical practitioners to understand how widely accepted each best practice is. In areas with less expert consensus, deviations from the best practice by practitioners would be more acceptable. This model ensures that guidelines are both scientifically robust and flexible.

In my view, best-practice guidelines for recommender systems research and evaluation should include the following components in addition to the best practices themselves:

  1. Justification: A justification for the best practice, ideally based on empirical evidence.
  2. Confidence: An estimate of how sound the evidence is.
  3. Severity: An estimate of the importance of the best practice and the potential consequences of not following it.
  4. Consensus: The degree of agreement within the community or among experts that the proposed best practice is indeed a best practice

Table 1 illustrates what a best practice may look like, using the example of random seeds. A random seed is an initial value for a pseudo-random number generator, ensuring that the sequence of random numbers it produces is reproducible. This reproducibility is crucial for consistent experiment results, fair comparisons between different algorithms, and reliable debugging. For instance, when splitting a dataset into training and testing sets, using a fixed random seed ensures the same split is produced each time. This consistency allows researchers to compare the performance of different algorithms on identical data splits, ensuring that any performance differences are due to the algorithms themselves and not variations in the data splits. Generating random random-seeds is not a trivial task, and dedicated tools exist for it [11].

Creating a preliminary set of guidelines for recommender systems evaluation should be straightforward. Existing communities, particularly in machine learning, already have robust best-practice guidelines and checklists. Notably, NeurIPS [17, 18] and the AutoML conference [2] offer guidelines that could be adapted for recommender system experiments with relatively minor modifications. Initially, these guidelines do not require empirical evidence or consensus surveys. They can be simple and aligned with those used in the machine-learning community. Over time, these guidelines can be tailored more to fit recommender systems research, expanded and substantiated with empirical evidence and broader consensus.

The creation and justification of best practices can likely be undertaken by any motivated researcher with experience in recommender systems research. However, the final selection of these best practices, particularly concerning points 3 (severity) and 4 (consensus), should be conducted by reputable members of the RecSys community. This could be achieved through a Dagstuhl seminar with selected experts or by the steering committee of the ACM Recommender Systems Conference.

In conclusion, establishing well-defined best-practice guidelines, endorsed by the community and enforced by key publication venues such as the ACM Recommender Systems conference and the ACM Transactions on Recommender Systems (TORS) journal, would be a significant move towards resolving the long-standing crisis in the recommender system research community. For over a decade, the community has struggled with inconsistencies and a lack of rigor in research practices. By adopting and enforcing these guidelines, we can ensure higher research standards, facilitate reproducibility, and contribute more robustly to collective knowledge.

Random Seeds Best-Practice1) Experiments must be repeated (n>=5) with different random seeds each time. This is true for each aspect of an experiment that requires randomness. This includes splitting data and initialising weights in neural networks.

2) The exact random seeds used for experiments must be reported in the paper or the code.

JustificationIn the context of data splitting, Wegmeth et al. [23] showed that when random seeds differed – i.e. data splits contained different data due to randomness – the performance of the same algorithm, with the same hyper-parameters varied by up to 12% [23]. In contrast, repeating and averaging experiments with different random seeds, led to a maximum difference of only around 4%. This means, if only a single run had been conducted, the results could be up to 6% above or under the ’true’ result, possibly more. By repeating the experiments, the difference would have been only ±2% in the worst case. The variance depended on the applied metrics, cut-offs, datasets, and splitting methods (lower variance for cross-fold validation, higher variance for holdout validation). Therefore, repeating experiments with different random seeds ensures that the reported result is closer to the ’true’ result. Reporting the exact random seeds is also a prerequisite (besides many other factors) for an exact replication of experiments. A researcher who wants to replicate an experiment and who uses the identical random seeds as the original researcher will have the same data in the train and validation splits as the original researcher. Knowing the exact random seeds also makes it easier to detect fraudulent behavior such as cherry picking.
SeverityMedium: If not appropriately conducted, reported results may be off the ’true’ results by multiple per cent.
ConfidenceLow (the empirical evidence is based only on one workshop publication [23]).
Consensus82% of the ACM RecSys Steering Committee agree with this best practice. PLEASE NOTE: This is an example for illustration purposes. The percentage is made up.
Table 1 Best Practices for Random Seeds (Example)


