Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers

Our paper “Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers” got recently accepted and will be presented at Joint Conference on Digital Libraries 2018.

Abstract: Bibliographic reference parsing refers to extracting machine-readable metadata, such as the names of the authors, the title, or journal name, from bibliographic reference strings. Many approaches to this problem have been proposed so far, including regular expressions, knowledge bases and supervised machine learning. Many open source reference parsers based on various algorithms are also available. In this paper, we apply, evaluate and compare ten reference parsing tools in a specific business use case. The tools are Anystyle-Parser, Biblio, CERMINE, Citation, Citation-Parser, GROBID, ParsCit, PDFSSA4MET, Reference Tagger and Science Parse, and we compare them in both their out-of-the-box versions and versions tuned to the project-specific data. According to our evaluation, the best performing out-of-the-box tool is GROBID (F1 0.89), followed by CERMINE (F1 0.83) and ParsCit (F1 0.75). We also found that even though machine learning-based tools and tools based on rules or regular expressions achieve on average similar precision (0.77 for ML-based tools vs. 0.76 for non-ML-based tools), applying machine learning-based tools results in a recall three times higher than in the case of non-ML-based tools (0.66 vs. 0.22). Our study also confirms that tuning the models to the task-specific data results in the increase in the quality. The retrained versions of reference parsers are in all cases better than their out-of-the-box counterparts; for GROBID F1 increased by 3% (0.92 vs. 0.89), for CERMINE by 11% (0.92 vs. 0.83), and for ParsCit by 16% (0.87 vs. 0.75).

The results of the comparison of 10 open-source bibliographic reference parsers

GIANT 2019, Reference Parsing, Deep Citation Parsing, Dataset, Cover

Information Extraction

GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [pre-print]

This is the pre-print of: Mark Grennan, Martin Schibel, Andrew Collins, and Joeran Beel. “GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing.” In 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Read more…

The workflow of author contributions extraction

Information Extraction

Who Did What? Identifying Author Contributions in Biomedical Publications using Naïve Bayes

Our paper “Who Did What? Identifying Author Contributions in Biomedical Publications using Naïve Bayes” got recently accepted and will be presented at Joint Conference on Digital Libraries 2018. Abstract: Creating scientific publications is a complex Read more…

Docear

Update for Docear’s “Google Scholar Parser” Library to Fetch Metadata for PDF files

Update 2018-07-31: We updated the Dropbox Link Google Scholar recently changed its layout, and as a consequence, Docear couldn’t fetch metadata anymore from Google Scholar for PDF files. Fortunately, one of our users (“Silberzwiebel”) adjusted Docear’s Read more…

Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers

Published by dominika_tkaczyk on 25th May 2018

0 Comments

Leave a Reply Cancel reply

GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [pre-print]

Who Did What? Identifying Author Contributions in Biomedical Publications using Naïve Bayes

Update for Docear’s “Google Scholar Parser” Library to Fetch Metadata for PDF files

Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers

Published by dominika_tkaczyk on 25th May 2018

0 Comments

Leave a Reply Cancel reply

Related Posts

GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [pre-print]

Who Did What? Identifying Author Contributions in Biomedical Publications using Naïve Bayes

Update for Docear’s “Google Scholar Parser” Library to Fetch Metadata for PDF files