New pre-print: “Research Paper Recommender System Evaluation: A Quantitative Literature Survey”

As you might know, Docear has a recommender system for research papers, and we are putting a lot of effort in the improvement of the recommender system. Actually, the development of the recommender system is part of my PhD research. When I began my work on the recommender system, some years ago, I became quite frustrated because there were so many different approaches for recommending research papers, but I had no clue which one would be most promising for Docear. I read many many papers (far more than 100), and although there were many interesting ideas presented in the papers, the evaluations… well, most of them were poor. Consequently, I did just not know which approaches to use in Docear.

Meanwhile, we reviewed all these papers more carefully and analyzed how exactly authors conducted their evaluations. More precisely, we analyzed the papers for the following questions.

To what extent do authors perform user studies, online evaluations, and offline evaluations?
How many participants do user studies have?
Against which baselines are approaches compared?
Do authors provide information about algorithm’s runtime and computational complexity?
Which metrics are used for algorithm evaluation, and do different metrics provide similar rankings of the algorithms?
Which datasets are used for offline evaluations
Are results comparable among different evaluations based on different datasets?
How consistent are online and offline evaluations? Do they provide the same, or at least similar, rankings of the evaluated approaches?
Do authors provide sufficient information to re-implement their algorithms or replicate their experiments?

The results are quite frustrating. The review of 176 publications has shown that no consensus exists on how to evaluate and compare research paper recommender approaches. This leads to the unsatisfying situation that despite the many evaluations, the individual strengths and weaknesses of the proposed approaches remain largely unknown. Out of 89 reviewed approaches, 21% were not evaluated. Of the evaluated approaches, 19% were not evaluated against a baseline. Almost all evaluations that compared against a baseline, compared against trivial baselines. Only 10% of the reviewed approaches were compared against at least one state-of-the-art approach.

In addition, runtime information was only provided for 11% of the approaches, despite this information being crucial for assessing algorithm practicability. In one case, runtimes differed by factor 600. Details on the proposed algorithms were often sparse, which makes a re-implementation difficult in many cases. Only five approaches (7%) were evaluated using online evaluations. The majority of authors conducted offline evaluations (69%). The most frequent sources for retrieving offline datasets were CiteSeer (29%), ACM (10%), and CiteULike (10%). However, the majority (52%) of evaluations were conducted using other datasets and even the datasets from CiteSeer, ACM, and CiteULike differed, since they were all fetched at different times and pruned differently. Because of the different datasets used, individual study outcomes are not comparable. Of the approaches evaluated with a user study (34%), the majority (58%) of these studies had less than 16 participants. In addition, user studies sometimes contradicted results of offline evaluations. These observations question the validity of offline evaluations, and demand further research.

Given the circumstances, an identification of the most promising approaches for recommending research papers is not possible, and neither is a replication for most evaluations. We consider this a major problem for the advancement of research paper recommender systems. Researchers cannot evaluate their novel approaches against a state-of-the-art baseline because no state-of-the-art baseline exists. Similarly, providers of academic services, who wish to implement a recommender system, have no chance of knowing which of the 89 approaches they should implement.

If you are interested in more details, read our current pre-print of the paper.

New pre-print: “Research Paper Recommender System Evaluation: A Quantitative Literature Survey”

Published by Joeran Beel on 23rd September 2013

Joeran Beel

0 Comments

Leave a Reply Cancel reply

OmniRec 1.0.0: Stronger Provenance, More Reliable Experiments, and Flexible Hyperparameter Search

Philipp Meister presents OmniRec at ECIR 2026: A Unified Framework for RecSys Research

Mit ChatGPT als ‘KI-Anwalt’ gegen Rossmann: Ranziger Geruch beim Baby-Grieß? Geld zurück? Rossmann sagt 2 x „Nein“ – Gericht urteilt anders!

New pre-print: “Research Paper Recommender System Evaluation: A Quantitative Literature Survey”

Published by Joeran Beel on 23rd September 2013

Joeran Beel

0 Comments

Leave a Reply Cancel reply

Related Posts

OmniRec 1.0.0: Stronger Provenance, More Reliable Experiments, and Flexible Hyperparameter Search

Philipp Meister presents OmniRec at ECIR 2026: A Unified Framework for RecSys Research

Mit ChatGPT als ‘KI-Anwalt’ gegen Rossmann: Ranziger Geruch beim Baby-Grieß? Geld zurück? Rossmann sagt 2 x „Nein“ – Gericht urteilt anders!