We are delighted to announce that the 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science has accepted 4 of our 4 submissions. We are particularly proud that three of the four submissions resulted from the work of Bachelor and Master students whom we supervised this year.
The four publications are as follows (pre-prints follow soon):
Multi-stream Data Analytics for Enhanced Performance Prediction in Fantasy Football
Nicholas Bonello; Joeran Beel; Seamus Lawless; Jeremy Debattista
Fantasy Premier League (FPL) performance predictors tend to base their algorithms purely on historical statistical data. The main problem with this approach is that external factors such as injuries, managerial decisions, and other tournament match statistics can never be factored into the ﬁnal predictions. In this paper, we present a new method for predicting future player performances by automatically incorporating human feedback into our model. Through statistical data analysis such as previous performances, upcoming ﬁxture diﬃculty ratings, betting market analysis, opinions of the general public and experts alike via social media and web articles, we can improve our understanding of who is likely to perform well in upcoming matches. When tested on the English Premier League 2018/19 season, the model outperformed regular statistical predictors by over 300 points, an average of 11 points per week, ranking within the top 0.5% of players – rank 30,000 out of over 6.5 million players.
GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing
Mark Grennan; Martin Schibel; Andrew Collins; and Joeran Beel
Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random ﬁelds (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labelled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to signiﬁcantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.
Predicting the Outcome of Judicial Decisions made by the European Court of Human Rights
Conor O’Sullivan; Joeran Beel
In this study, machine learning models were constructed to predict whether judgements made by the European Court of Human Rights (ECHR) would lead to a violation of an Article in the Convention on Human Rights. The problem is framed as a binary classiﬁcation task where a judgement can lead to a “violation” or “non-violation” of a particular Article. Using auto-sklearn, an automated algorithm selection package, models were constructed for 12 Articles in the Convention. To train these models, textual features were obtained from the ECHR Judgment documents using N-grams, word embeddings, and paragraph embeddings. Additional documents, from the ECHR, were incorporated into the models through the creation of a word embedding (echr2vec) and a doc2vec model. The features obtained using the echr2vec embedding provided the highest cross-validation accuracy for 5 of the Articles. The overall test accuracy, across the 12 Articles, was 68.83%. As far as we could tell, this is the ﬁrst estimate of the accuracy of such machine learning models using a realistic test set. This provides an important benchmark for future work. As a baseline, a simple heuristic of always predicting the most common outcome in the past was used. The heuristic achieved an overall test accuracy of 86.68% which is 29.7% higher than the models. Again, this was seemingly the ﬁrst study that included such a heuristic with which to compare model results. The higher accuracy achieved by the heuristic highlights the importance of including such a baseline.
NaïveRole: Author-Contribution Extraction and Parsing from Biomedical Manuscripts
Dominika Tkaczyk; Andrew Collins; Joeran Beel
Information about the contributions of individual authors to scientific publications is important for assessing authors’ achievements. Some biomedical publications have a short section that describes the authors’ roles and contributions. It is usually written in natural language and hence author contributions cannot be trivially extracted in a machine-readable format. In this paper, we present 1) A statistical analysis of roles in author contributions sections, and 2) NaïveRole, a novel approach to extract structured authors’ roles from author contribution sections. For the first part, we used co-clustering techniques, as well as Open Information Extraction, to semi-automatically discover the popular roles within a corpus of 2,000 contributions sections from PubMed Central. The discovered roles were used to automatically build a training set for NaïveRole, our role extractor approach, based on Naïve Bayes. NaïveRole extracts roles with a micro-averaged precision of 0.68, recall of 0.48 and F1 of 0.57. It is, to the best of our knowledge, the first attempt to automatically extract author roles from research papers. This paper is an extended version of a previous poster published at JCDL 2018.