New paper: “A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation”

Yesterday, we published a pre-print on the shortcomings of current research-paper recommender system evaluations. One of the findings was that results of offline and online experiments sometimes contradict each other. We did a more detailed analysis on this issue and wrote a new paper about it. More specifically, we conducted a comprehensive evaluation of a set of recommendation algorithms using (a) an offline evaluation and (b) an online evaluation. Results of the two evaluation methods were compared to determine whether and when results of the two methods contradicted each other. Subsequently, we discuss differences and validity of evaluation methods focusing on research paper recommender systems. The goal was to identify which of the evaluation methods were most authoritative, or, if some methods are unsuitable in general. By ‘authoritative’, we mean which evaluation method one should trust when results of different methods contradict each other.

Bibliographic data: Beel, J., Langer, S., Genzmehr, M., Gipp, B. and Nürnberger, A. 2013. A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation. Proceedings of the Workshop on Reproducibility and Replication in Recommender Systems Evaluation (RepSys) at the ACM Recommender System Conference (RecSys) (2013), 7–14.

Our current results cast doubt on the meaningfulness of offline evaluations. We showed that offline evaluations could often not predict results of online experiments (measured by click-through rate – CTR) and we identified two possible reasons.

The first reason for the lacking predictive power of offline evaluations is the ignorance of human factors. These factors may strongly influence whether users are satisfied with recommendations, regardless of the recommendation’s relevance. We argue that it probably will never be possible to determine when and how influential human factors are in practice. Thus, it is impossible to determine when offline evaluations have predictive power and when they do not. Assuming that the only purpose of offline evaluations is to predict results in real-world settings, the plausible consequence is to abandon offline evaluations entirely.