Over the past semester, several Bachelor students at the University of Siegen undertook a bold challenge in our Machine Learning Praktikum: they didn’t just learn algorithms—they tried to reproduce published machine learning or recommender systems research. Each student or team selected a recent paper, rebuilt the experimental pipeline, validated (or critiqued) the claims, and—crucially—published their full results on arXiv or OSF.io to maximize transparency and accountability.
The following chapters provide a brief summary of each study. All projects were conducted as part of the course Machine Learning Praktikum, University of Siegen (2025) and are citable as standalone technical reports.
1. Reproducing the AI Scientist: How Close Are We to Automated Research?
Artur Papoyan
This study critically reproduced The AI Scientist by Sakana.ai, a system that claims to generate full scientific papers autonomously. Artur recreated eight complete research cycles and revealed several limitations: 38% of experiments failed due to code errors; novelty detection mislabeled well-known ideas as original; and generated manuscripts contained placeholder images and outdated references. While visually convincing, the papers lacked substance.
Still, the experiment confirmed how astonishingly fast and cheap such agents can be—each paper cost around $9 in API usage and ran in ~3 hours. The work confirms that while automation in science is feasible at the surface level, core scientific reasoning remains elusive.
Citation metadata:
Title: AI Scientist in Practice: Reproducing and Evaluating Autonomous Scientific Discovery
Author: Artur Papoyan
Year: 2025
Venue: Machine Learning Praktikum, University of Siegen
@techreport{papoyan2025ai,
title = {AI Scientist in Practice: Reproducing and Evaluating Autonomous Scientific Discovery},
author = {Artur Papoyan},
year = {2025},
institution = {Machine Learning Praktikum, University of Siegen},
}
2. Rethinking Dataset Selection: A Replication of the APS Framework
Abdelrahman Al-Taslaq
Recommender system benchmarks often use datasets by tradition, not by relevance. This project reproduced the Algorithm Performance Spaces (APS) method, which proposes selecting datasets based on how differently algorithms perform on them.
Abdelrahman implemented APS with 15 algorithms on 17 Kaggle datasets and visualized the results using over 200 APS diagrams. His findings confirm the original insight: algorithm rankings differ only minimally when datasets are too similar. APS-based selection yields more diverse, informative evaluation—and thus more trustworthy conclusions.
Citation metadata:
Title: Informed Dataset Selection with APS – Reproduced
Author: Abdelrahman Al-Taslaq
Year: 2025
Venue: Machine Learning Praktikum, University of Siegen
@techreport{altaslaq2025aps,
title = {Informed Dataset Selection with APS – Reproduced},
author = {Abdelrahman Al-Taslaq},
year = {2025},
institution = {Machine Learning Praktikum, University of Siegen},
}
3. Can We Recommend Sustainably? Revisiting “Green Recommender Systems”
Murat Ergün, Frederic Lück-Reuße, Leon Ulrich Pieper
This team reproduced and extended the “Green Recommender Systems” paper, which claims that models can be trained on less data to reduce energy costs without sacrificing performance.
They tested 17 algorithms on five datasets and classified them into high-, medium- and low-sensitivity to downsampling. Simple models like Bias or Popularity performed nearly as well when trained on just 10–30% of the data, while more complex models degraded sharply. Their findings confirm and extend the original: smart pruning enables greener ML without killing accuracy.
Citation metadata:
Title: A Reproduction of Green Recommender Systems
Authors: Murat Ergün, Frederic Lück-Reuße, Leon Ulrich Pieper
Year: 2025
Venue: Machine Learning Praktikum, University of Siegen
@techreport{erguen2025green,
title = {A Reproduction of Green Recommender Systems},
author = {Murat Ergün and Frederic Lück-Reuße and Leon Ulrich Pieper},
year = {2025},
institution = {Machine Learning Praktikum, University of Siegen}
}
4. Time Is Relative: Evaluating Recommenders Over Time
Nina Kühn, Hannes Wunderlich
This study revisits the claim that recommender performance metrics change significantly over time—a scenario overlooked by many papers that report only single-number scores.
By reproducing and extending Scheidt & Beel’s time-dependent evaluation, the authors analyzed eight datasets and nine algorithms. They found that algorithm rankings frequently change over the course of a system’s life cycle. The message is clear: time-aware benchmarking must become standard practice.
Citation metadata:
Title: A Replication & Reproduction of Time-Dependent Evaluation of Recommender Systems
Authors: Nina Kühn, Hannes Wunderlich
Year: 2025
Venue: Machine Learning Praktikum, University of Siegen
@techreport{kuehn2025timedep,
title = {A Replication & Reproduction of Time-Dependent Evaluation of Recommender Systems},
author = {Nina Kühn and Hannes Wunderlich},
year = {2025},
institution = {Machine Learning Praktikum, University of Siegen}
}
5. Temporal Dynamics in Practice: Scaling Time-Aware Evaluation
Fiona Nlend, Florian Paesler, Jonas Reising
Building on the same topic of time-dependent evaluation, this second team evaluated recommender performance over time using ten datasets (including Food.com, BeerAdvocate, Amazon Software) and seven algorithms.
They confirmed the volatility of rankings, especially in broad, long-running datasets like Amazon Electronics. nDCG fluctuated more than recall, suggesting metric choice also matters. Their conclusion: temporal robustness is a critical dimension of recommender quality—and deserves more research attention.
Citation metadata:
Title: Temporal Evaluation of Recommender Algorithms: A Replication Study
Authors: Fiona Nlend, Florian Paesler, Jonas Reising
Year: 2025
Venue: Machine Learning Praktikum, University of Siegen
@techreport{nlend2025temporal,
title = {Temporal Evaluation of Recommender Algorithms: A Replication Study},
author = {Fiona Nlend and Florian Paesler and Jonas Reising},
year = {2025},
institution = {Machine Learning Praktikum, University of Siegen}}
6. Saving Compute Without Sacrificing Accuracy: e-Fold Cross-Validation
Nick Petker
Can we save energy by skipping parts of cross-validation if the performance estimate has already stabilized? Nick Petker replicated the “e-Fold Cross-Validation” method that does exactly that.
Using 14 algorithms on six datasets, he found that the method typically stops after 5.88 of 10 folds—saving around 40% computation. In over 95% of cases, the error remained within the 95% confidence interval of the full CV result. The method is robust but sensitive to parameter tuning: overly aggressive settings can undermine reliability.
Citation metadata:
Title: A Reproduction Study and Evaluation of E-Fold Cross-Validation
Author: Nick Petker
Year: 2025
Venue: Machine Learning Praktikum, University of Siegen
@techreport{petker2025efold,
title = {A Reproduction Study and Evaluation of E-Fold Cross-Validation},
author = {Nick Petker},
year = {2025},
institution = {Machine Learning Praktikum, University of Siegen}
}
7. Why Reproducibility Projects Are a Perfect Fit for a Machine‑Learning Praktikum
There is no better playground for budding machine‑learning engineers than a reproducibility study. First, it flips the usual classroom dynamic: instead of blindly following polished tutorials, students must read between the lines of real papers, reverse‑engineer half‑documented code, and figure out where hidden assumptions lurk. That detective work sharpens critical reading skills far more than any lecture on “how to write related work.”
Second, reproducibility projects touch every layer of the ML stack. Students wrangle data, re‑implement baselines, configure GPUs, tune hyper‑parameters, manage random seeds, track experiments, and—when inevitable discrepancies pop up—debug them systematically. It is end‑to‑end engineering, not siloed homework problems. By semester’s end, teams know Git, CI pipelines, and experiment‑tracking tools because their projects would not run without them.
Third, the assignment forces a healthy scepticism that is essential for scientific maturity. When a replication fails, students have to ask why: is the original claim overstated, did we miss a preprocessing step, or does the result hinge on a lucky random split? Navigating these questions trains statistical thinking better than any canned dataset ever could.
Fourth, publishing every artefact on arXiv or OSF.io turns the course into a live contribution to the research community. Students experience the thrill—and vulnerability—of releasing code for strangers to inspect. They learn about licences, persistent identifiers, and how open science accelerates progress. Their reports may even become the first citations on their future CVs.
Finally, reproducibility work is inherently collaborative. One teammate hunts down obscure library versions, another writes evaluation scripts, a third polishes the narrative. Success is impossible without communication and shared ownership—exactly the soft skills graduates need in industry and academia alike.
In short, a reproducibility study compresses the realities of real‑world machine learning into a single, semester‑long quest: equal parts puzzle, engineering sprint, and scientific dialogue. That is why it remains the crown jewel of our Machine Learning Praktikum—and why our students leave the course not just as model builders, but as responsible, open‑science practitioners.
0 Comments