Yesterday, we published a pre-print on the shortcomings of current research-paper recommender system evaluations. One of the findings was that results of offline and online experiments sometimes contradict each other. We did a more detailed analysis on this issue and wrote a new paper about it. More specifically, we conducted a comprehensive evaluation of a set of recommendation algorithms using (a) an offline evaluation and (b) an online evaluation. Results of the two evaluation methods were compared to determine whether and when results of the two methods contradicted each other. Subsequently, we discuss differences and validity of evaluation methods focusing on research paper recommender systems. The goal was to identify which of the evaluation methods were most authoritative, or, if some methods are unsuitable in general. By ‘authoritative’, we mean which evaluation method one should trust when results of different methods contradict each other.
Bibliographic data: Beel, J., Langer, S., Genzmehr, M., Gipp, B. and Nürnberger, A. 2013. A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation. Proceedings of the Workshop on Reproducibility and Replication in Recommender Systems Evaluation (RepSys) at the ACM Recommender System Conference (RecSys) (2013), 7–14.
Our current results cast doubt on the meaningfulness of offline evaluations. We showed that offline evaluations could often not predict results of online experiments (measured by click-through rate – CTR) and we identified two possible reasons.
The first reason for the lacking predictive power of offline evaluations is the ignorance of human factors. These factors may strongly influence whether users are satisfied with recommendations, regardless of the recommendation’s relevance. We argue that it probably will never be possible to determine when and how influential human factors are in practice. Thus, it is impossible to determine when offline evaluations have predictive power and when they do not. Assuming that the only purpose of offline evaluations is to predict results in real-world settings, the plausible consequence is to abandon offline evaluations entirely.
The second reason why (user-) offline-datasets may not always have predictive power is due to their incompleteness. This is attributable to insufficient user knowledge of the literature, or biases arising in the citation behavior of some researchers. Our results led to the conclusion that sometimes incomplete and biased datasets may have the same negative effects on different algorithms. In other situations, they have different effects on different algorithms, which is why offline evaluations could only sometimes predict results of online evaluations. Since we see no way of knowing when negative effects of incomplete datasets would be the same for two algorithms, we concluded that user-offline-datasets are not suitable for predicting the performance of recommender systems in practice.
However, we also argued that offline evaluations might have some inherent value and it may make sense to apply algorithms in real-world systems if they performed well in offline evaluations but poorly in online evaluations or user studies. The underlying assumption is that users who contributed to the offline dataset know better than users receiving recommendations, which papers are relevant for certain information needs. Theoretically, this could be the case for datasets compiled by experts but we argued that expert-datasets are overspecialized and not practically feasible, and thus unsuitable for evaluations of recommender systems. Evaluations based on user-offline-datasets could have some value to determine which algorithms are best if the algorithms performed consistent in online evaluations and user studies. However, this also means that offline evaluations alone are of little value.
Our study only represents a first step in the direction of deciding whether and when offline-evaluations should be used. Future research should clarify with more certainty whether offline-evaluations are indeed unsuitable for evaluating research paper recommender systems. We cannot exclude with certainty that we did not miss an important argument, or that there may be a way to determine the situations in which offline evaluations do have predictive power. In addition, the offline dataset by Docear might not be considered an optimal dataset due to the large numbers of novice users. A repetition of our analysis on other datasets may possibly lead to more favorable results for offline evaluations. It might also make sense to repeat our study with more offline-metrics such as recall, or NDCG, and additionally conduct a large-scale user study. It might also be argued that CTR is not an ideal evaluation measure and it should not be considered the goal of an offline-evaluation to predict CTR. CTR only measures how interesting a title appears to a user. Measuring instead how often users actually cite the recommended paper may be a more appropriate measure, and offline-evaluations likely correlate more with this measure.
In summary, the community requires a more thorough investigation of the usefulness of offline evaluations and more sound empirical evidence before we can abandon offline evaluations entirely. Meanwhile, we would suggest treating results of offline evaluations with skepticism.
For a more detailed discussion on the usefulness of offline experiments, and more details on our analysis, check out our pre-print of the paper “A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation”.
Also, find here, the complete summary:
Our results cast doubt on the meaningfulness of offline evaluations. We showed that offline evaluations could often not predict CTR, and we identified two possible reasons.
The first reason for the lacking predictive power of offline evaluations is the ignorance of human factors. These factors may strongly influence whether users are satisfied with recommendations, regardless of the recommendation’s relevance. We argue that it probably will never be possible to determine when and how influential human factors are in practice. Thus, it is impossible to determine when offline evaluations have predictive power and when they do not. Assuming that the only purpose of offline evaluations is to predict results in real-world settings, the plausible consequence is to abandon offline evaluations entirely.
The second reason why (user-) offline-datasets may not always have predictive power is due to their incompleteness. This is attributable to insufficient user knowledge of the literature, or biases arising in the citation behavior of some researchers. Our results led to the conclusion that sometimes incomplete and biased datasets may have the same negative effects on different algorithms. In other situations, they have different effects on different algorithms, which is why offline evaluations could only sometimes predict results of online evaluations. Since we see no way of knowing when negative effects of incomplete datasets would be the same for two algorithms, we concluded that user-offline-datasets are not suitable for predicting the performance of recommender systems in practice.
However, we also argued that offline evaluations might have some inherent value and it may make sense to apply algorithms in real-world systems if they performed well in offline evaluations but poorly in online evaluations or user studies. The underlying assumption is that users who contributed to the offline dataset know better than users receiving recommendations, which papers are relevant for certain information needs. Theoretically, this could be the case for datasets compiled by experts but we argued that expert-datasets are overspecialized and not practically feasible, and thus unsuitable for evaluations of recommender systems. Evaluations based on user-offline-datasets could have some value to determine which algorithms are best if the algorithms performed consistent in online evaluations and user studies. However, this also means that offline evaluations alone are of little value.
Our study represents a first step in the direction of deciding whether and when offline-evaluations should be used. Future research should clarify with more certainty whether offline-evaluations are indeed unsuitable for evaluating research paper recommender systems. We cannot exclude with certainty that we did not miss an important argument, or that there may be a way to determine the situations in which offline evaluations do have predictive power. In addition, the offline dataset by Docear might not be considered an optimal dataset due to the large numbers of novice users. A repetition of our analysis on other datasets may possibly lead to more favorable results for offline evaluations. It might also make sense to repeat our study with more offline-metrics such as recall, or NDCG, and additionally conduct a large-scale user study. It might also be argued that CTR is not an ideal evaluation measure and it should not be considered the goal of an offline-evaluation to predict CTR. CTR only measures how interesting a title appears to a user. Measuring instead how often users actually cite the recommended paper may be a more appropriate measure, and offline-evaluations likely correlate more with this measure.
In addition, it should be researched to what extent the limitations of offline datasets for research paper recommender systems apply to other domains and ‘true-offline-datasets’. True-offline-datasets are not relevant for research paper recommender systems but for many other recommender systems. They contain ratings of real users and we could imagine that they possibly represent a near-perfect ground truth. Results of true-offline evaluations would not contradict results from online evaluations. Although, there is also doubt on how reliable user rating are [25].
In summary, we require a more thorough investigation of the usefulness of offline evaluations and more sound empirical evidence before we can abandon offline evaluations entirely. Meanwhile, we would suggest treating results of offline evaluations with skepticism.
0 Comments