ABSTRACT

Citation parsing, particularly with deep neural networks, suffers from a lack of training data as available datasets typically contain only a few thousand training instances. Manually labelling citation strings is very time-consuming, hence synthetically created training data could be a solution. However, as of now, it is unknown if synthetically created reference-strings are suitable to train machine learning algorithms for citation parsing. To find out, we train Grobid, which uses Conditional Random Fields, with a) human-labelled reference strings from ‘real’ bibliographies and b) synthetically created reference strings from the GIANT dataset. We find that both synthetic and organic reference strings are equally suited for training Grobid (F1 = 0.74). We additionally find that retraining Grobid has a notable impact on its performance, for both synthetic and real data (+30% in F1). Having as many types of labelled fields as possible during training also improves effectiveness, even if these fields are not available in the evaluation data (+13.5% F1). We conclude that synthetic data is suitable for training (deep) citation parsing models. We further suggest that in future evaluations of reference parsers both evaluation data similar and dissimilar to the training data should be used for more meaningful evaluations.

Original Publication: https://www.aclweb.org/anthology/2020.wosp-1.4/

Meta Data

BibTeX

@InProceedings{Grennan2020, author = {Grennan, Mark and Beel, Joeran}, booktitle = {Proceedings of the 8th International Workshop on Mining Scientific Publications}, title = {Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with {GROBID}, {GIANT} and {CORA}}, pages = {27--35}, publisher = {Association for Computational Linguistics}, url = {https://www.aclweb.org/anthology/2020.wosp-1.4}, abstract = {Citation parsing, particularly with deep neural networks, suffers from a lack of training data as available datasets typically contain only a few thousand training instances. Manually labelling citation strings is very time-consuming, hence synthetically created training data could be a solution. However, as of now, it is unknown if synthetically created reference-strings are suitable to train machine learning algorithms for citation parsing. To find out, we train Grobid, which uses Conditional Random Fields, with a) human-labelled reference strings from {`}real{'} bibliographies and b) synthetically created reference strings from the GIANT dataset. We find that both synthetic and organic reference strings are equally suited for training Grobid (F1 = 0.74). We additionally find that retraining Grobid has a notable impact on its performance, for both synthetic and real data (+30{\%} in F1). Having as many types of labelled fields as possible during training also improves effectiveness, even if these fields are not available in the evaluation data (+13.5{\%} F1). We conclude that synthetic data is suitable for training (deep) citation parsing models. We further suggest that in future evaluations of reference parsers both evaluation data similar and dissimilar to the training data should be used for more meaningful evaluations.}, address = {Wuhan, China}, month = {05 } # aug, year = {2020}, }

CCS CONCEPTS

• Information Retrieval • Information Extraction • Document Analysis

KEYWORDS

Reference Parsing, Information Extraction, Citation Analysis

ACM Reference format

Mark Grennan and Joeran Beel. 2020. Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and CORA. 8th International Workshop on Mining Scientific Publications (WOSP), co-located with the ACM/IEEE Joint Conference on Digital Libraries (JCDL).

1 Introduction_[1]

Accurate citation data is needed by publishers, academic search engines, citation & research-paper recommender systems and others to calculate impact metrics [3, 21], rank search results [5, 6] generate recommendations [4, 11–13, 22, 25] and other applications e.g. in the field of bibliometric-enhanced information retrieval [8]. Citation data is typically parsed from unstructured non-machine-readable text, which often originates from bibliographies found in PDF files on the Web (Figure 1). To facilitate the parsing process, a dozen [38] open source tools were developed including ParsCit [10], Grobid [26, 27], and Cermine [35], with Grobid typically being considered the most effective one [38]. There is ongoing research that continuously leads to novel citation-parsing algorithms including deep learning algorithms [1, 7, 30–33, 41] and meta-learned ensembles [39, 40].

Most citation parsing tools apply supervised machine learning [38]. Consequently, labelled data is required to train the algorithms. However, training data is rare compared to other disciplines where datasets may have millions of instances. To the best of our knowledge, existing citation-parsing datasets typically contain a few thousand instances and are domain specific (Table 1). This may be sufficient for traditional machine learning algorithms but not for deep learning, which shows a lot of potential for citation parsing [1, 30–33, 41]. Even for traditional machine learning, existing datasets may not be ideal as they often lack diversity in terms of citation styles.

Figure 1: Illustration of a ‘Bibliography’ with four ‘Reference Strings’, each with a number of ‘Fields’. A reference parser receives a reference string as input, and outputs labelled fields, e.g. <author>C. Lemke<\author> … <title> Metalearning: a survey … <\title> …

Recently, we published GIANT, a synthetic dataset with nearly 1 billion annotated reference strings [19]. More precisely, the dataset contains 677,000 unique reference strings, each in around 1,500 citation styles (e.g. APA, Harvard, ACM). The dataset was synthetically created. This means, the reference strings are not ‘real’ reference strings extracted from ‘real’ bibliographies. Instead, we downloaded 677,000 references in XML format from CrossRef, and used Citeproc-JS [14] with 1,500 citation styles to convert the 677,000 references into a total of 1 billion annotated citation strings (1,500 * 677,000)[2].

We wonder how suitable a synthetic dataset like GIANT is to train machine learning models for citation parsing. Therefore, we pursue the following research question:

How will citation parsing perform when trained on synthetic reference strings, compared to being trained on real reference strings?
To what extent does citation-parsing (based on machine learning) depend on the amount of training data?
How important is re-training a citation parser for the specific data it should be used on? Or, in other words, how does performance vary if the test data differs (not) from the training data?
Is it important to have many different fields (author, year, …) for training, even if the fields are not available in the final data?

Potentially, synthetic data could lead to higher citation parsing performance, as synthetic datasets may contain more data and more diverse data (more citation styles). Synthetic datasets like GIANT could also advance (deep) citation parsing, which currently suffers from a lack of ‘real’ annotated bibliographies at large scale.

2 Related Work

We are aware of eleven datasets with annotated reference strings, the most popular ones probably being Cora and CiteSeer, and authors also often use variations of PubMed (Table 1). Several datasets are from the same authors, and many datasets include data from other datasets. For instance, the Grobid dataset is based on some data from Cora, PubMed, and others [28]. New data is continuously added to Grobid’s dataset. As such, there is not “the one” Grobid dataset. GIANT is the largest and most diverse dataset in terms of citation styles, but GIANT is, as mentioned, synthetically created.

Cora is one of the most widely used datasets but has potential shortcomings [2, 10, 31]. Cora is homogeneous with citation strings only from Computer Science. It is relatively small and only has labels for “coarse-grained ﬁelds” [2]. For example, the author ﬁeld does not label each author separately. Prasad et al. conclude that a “shortcoming of [citation parsing research] is that the evaluations have been largely limited to the Cora dataset, which is […] unrepresentative of the multilingual, multidisciplinary scholastic reality” [31].

Dataset Name	# Instances	Domain
Cora [29]	1,295	Computer Science
CiteSeer [16]	1,563	Artificial Intelligence
Umass [2]	1,829	STEM
FLUX-CiM CS [20]	300	Computer Science
FLUX-CiM HS [20]	2,000	Health Science
GROBID [26–28]	6,835	Multi-Domain (Cora, arXiv, PubMed…)
PubMed (Central) [9, 17]	Varies	Biomedical
GROTOAP2 (Cermine) [35–37]	6,858	Biomedical & Computer Science
CS-SW [20]	578	Semantic Web Conferences
Venice [33]	40,000	Humanities
GIANT [19]	991 million	Multi-Domain (~1,500 Citation Styles)

Table 1: List of Citation Datasets

3 Methodology

To compare the effectiveness of synthetic vs. real bibliographies, we used Grobid. Grobid is the most effective citation parsing tool [38] and, the most easy to use tool based on our experience. Grobid uses conditional random fields (CRF) as machine learning algorithm. Of course, in the long-run, it would be good to conduct our experiments with different machine learning algorithms, particularly deep learning algorithms, but for now we concentrate on one tool and algorithm. Given that all major citation-parsing tools — including Grobid, Cermine and ParsCit – use CRF we consider this sufficient for an initial experiment. Also, we attempted to re-train Neural ParsCit [31] but failed doing so, which indicates that the ease-of-use of the rather new deep-learning methods is not yet as advanced as the established citation parsing tools like Grobid.

We trained Grobid, the CRF respectively, on two datasets. Train_Grobid denotes a model trained on 70% (5,460 instances) of the dataset that Grobid uses to train its out-of-the box version. We slightly modified the dataset, i.e. we removed labels for ‘pubPlace’, ‘note’ and ‘institution’ as this information is not contained in GIANT, and hence a model trained on GIANT could not identify these labels[3]. Train_GIANTdenotes the model trained on a random sample (5,460 instances) of GIANT’s 991,411,100 labeled reference strings. Our expectation was that both models would perform similar, or, ideally, Train_GIANT would even outperform Train_Grobid.

To analyze how the amount of training data affects performance, we additionally trained Train_GIANT,on 1k, 3k, 5k, 10k, 20k, and 40k instances of GIANT.

We evaluated all models on four datasets. Eval_Grobidcomprises of the remaining 30% of Grobid’s dataset (2,340 reference strings). Eval_Cora denotes the Cora dataset, which comprises, after some cleaning, of 1,148 labelled reference strings from the computer science domain. Eval_GIANTcomprises of 5,000 random reference strings from GIANT.

These three evaluation datasets are potentially not ideal as evaluations are likely biased towards one of the trained models. Evaluating the models on Eval_GIANT likely favors Train_GIANT since the data for both Train_GIANT and Eval_GIANT is highly similar, i.e. it originates from the same dataset. Similarly, evaluating the models on Eval_Grobid likely favors Train_Grobidas Train_Grobid was trained on 70% of the original Grobid dataset and this 70% of the data is highly similar to the remaining 30% that we used for the evaluation. Also, the Cora dataset is somewhat biased, because Grobid’s dataset contains parts of Cora. We therefore created another evaluation dataset.

Eval_WebPDFis our ‘unbiased’ dataset with 300 manually annotated citation strings from PDFs found on the Web. To create Eval_WebPDF, we chose twenty different words from the homepages of some universities[4]. Then, we used each of the twenty words as a search term in Google Scholar. From each of these searches, we downloaded the ﬁrst four available PDFs. Of each PDF, we randomly chose four citation strings. This gave approximately sixteen citation strings for each of the twenty keywords and in total, there were 300 citation strings. We consider this dataset to be a realistic, though relatively small, dataset for citation parsing in the context of a web-based academic search engine or recommender system.

We measure performance of all models with precision, recall, F1 (Micro Average) and F1 (Macro Average) on both field level and token level. We only report ‘F1 Macro Average on field level’ as all metrics led to similar results.

All source code, data (including the WebPDF dataset), images, and an Excel sheet with all results (including precision and recall and token level results) is available on GitHub https://github.com/BeelGroup/GIANT-The-1-Billion-Annotated-Synthetic-Bibliographic-Reference-String-Dataset/.

4 Results

The models trained on Grobid (Train_Grobid) and GIANT (Train_GIANT) perform as expected when evaluated on the three ‘biased’ datasets Eval_Grobid, Eval_Cora and Eval_GIANT(Figure 2). When evaluated on Eval_Grobid, Train_Grobid outperforms Train_GIANT by 35% with an F1 of 0.93 vs. 0.69. When evaluated on Eval_GIANT, results are almost exactly the opposite: Train_GIANT outperforms Train_Grobid by 32% with an F1 of 0.91 vs. 0.69. On Eval_Cora, the difference is less strong but still notable. Train_Grobid outperforms Train_GIANTby 19% with an F1 of 0.74 vs. 0.62. This is not surprising as Grobid’s training data includes some Cora data. While these results generally might not be surprising, they imply that both synthetic and real data lead to very similar results and ‘behave’ similarly when used to train models that are evaluated on data being (not) similar to the training data.

Also interesting is the evaluation on the WebPDF dataset. The model trained on synthetic data (Train_GIANT) and the model trained on real data (Train_Grobid) perform alike with an F1 of 0.74 each (Figure 2)[5]. In other words, synthetic and human-labelled data perform equally well for training our machine learning models.

Figure 2: F1 of the two models (Train_Grobid and Train_GIANT) on the four evaluation datasets.

Looking at the data in more detail reveals that some fields are easier to parse than others (Figure 3). For instance, the ‘date’ field (i.e. year of publication) has a constantly high F1 across all models and evaluation datasets (min=0.86; max=1.0). The ‘author’ field also has a high F1 throughout all experiments (min=0.75; max=0.99). In contrast, parsing ‘booktitle’ and ‘publisher’ seems to strongly benefit from training based on samples similar to the evaluation dataset. When the evaluation data is similar to the training data (e.g. Train_GIANT—Eval_GIANTorTrain_Grobid—Eval_Grobid), F1 is relatively high (typically above 0.7). If the evaluation data is different (e.g. Train_GIANT— Eval_Grobid), F1 is low (0.15 and 0.16 for Train_Grobid and Train_GIANT respectivelyon Eval_WebPDF). The difference in F1 for parsing the book-title is around factor 6.5, with an F1 of 0.97 (Train_Grobid) and 0.15 respectively (Train_GIANT) when evaluated on Eval_Grobid.

Figure 3: F1 for different fields (title, author, …), evaluation dataset and training data.

Similarly, F1 for parsing the book-title on Eval_GIANT differs by around factor 3 with an F1 of 0.75 (Train_GIANT) and 0.27 (Train_Grobid) respectively. While it is well known, and quite intuitive, that different fields are differently difficult to parse, we are first to show that field accuracy varies for different fields differently depending on whether or not the model was trained on data (not) being similar to the evaluation data.

Figure 4: Performance (F1) of Train_GIANT on the four evaluation datasets, by the number of training instances.

In a side experiment, we trained a new model Train_Grobid+ with additional labels for institution, note and pubPlace (those we removed for the other experiments). Train_Grobid+ outperformed Train_Grobid notably with an F1 of 0.84 vs. 0.74 (+13.5%) when evaluated on Eval_WebPDF. This indicates that the more fields are available for training, the better the parsing of all fields becomes even if the additional fields are not in the evaluation data. This finding seems plausible to us and confirms statements by Anzaroot and McCallum [2] but, to the best of our knowledge, we are first to quantify the benefit. It is worth noting that citation parsers do not always use the same fields (Table 2). For instance, Cermine extracts relatively few fields, but is one of few tools extracting the DOI field.

Our assumption that more training data would generally lead to better parsing performance – and hence GIANT could be useful for training standard machine learning algorithms – was not confirmed. Increasing training data from 1,000 to 10,000 instances improved F1 by 6% on average over the four evaluation datasets (Figure 4). More precisely, increasing data from 1,000 to 3,000 instances improved F1, on average, by 2.4%; Increasing from 3,000 to 5,000 instances improved F1 by another 2%; Increasing further to 10,000 instances improved F1 by another 1.6%. However, increasing to 20,000 or 40,000 instances leads to no notable improvement, and in some cases even to a decline in F1 (Figure 4).

5 SUMMARY & DISCUSSION

In summary, both models – the one trained on synthetic data (GIANT) and the one trained on ‘real’ reference strings annotated by humans (Grobid) – performed very similar. On the main evaluation dataset (WebPDF) both models achieved an F1 of 0.74. Similarly, if a model was evaluated on data different from its training data, F1 was between 0.6 and 0.7. If a model was evaluated on data similar to the training data, F1 was above 0.9 (+30%). F1 only increased up to a training size of around 10,000 instances (+6% compared to 1,000 instances). Additional fields (e.g. pubplace) in the training data increased F1 notably (+13.5%), even if these additional fields were not in the evaluation data.

These results lead us to the following conclusions. First, there seems to be little benefit in using synthetic data (i.e. GIANT) for training traditional machine learning models (i.e. conditional random fields). The existing datasets with a few thousand training instances seem sufficient.

Citation Parser	Approach	Extracted Fields
Biblio	Regular Expressions	author, date, editor, genre, is- sue, pages, publisher, title, volume, year
BibPro	Template Matching	author, title, venue, volume, is- sue, page, date, journal, booktitle, techReport
CERMINE	Machine Learning (CRF)	author, issue, pages, title, volume, year, DOI, ISSN
GROBID	authors, booktitle, date, editor, issue, journal, location, note, pages, publisher, title, volume, web, institution
ParsCit	author, booktitle, date, editor, institution, journal, location, note, pages, publisher, tech, title, volume
Neural ParsCit	Deep Learning	author, booktitle, date, editor, institution, journal, location, note, pages, publisher, tech, title, volume

Table 2: The approach and extracted ﬁelds of six popular open-source citation parsing tools

Second, citation parsers should, if possible, be (re)trained on data that is similar to the data that should actually be parsed. Such a re-training increased performance by around 30% in our experiments. This finding may also explain why researchers often report excellent performance of their tools and approaches with e.g. F1’s of over 0.9. These researchers typically evaluate their models on data highly similar to the training data. This might be considered a realistic scenario for those cases when re-training is possible. However, reporting such results creates unrealistic expectations for scenarios without the option to re-train, i.e. for users who just want to use a citation parser like Grobid out-of-the-box. Therefore, we propose that future evaluations of citation parsing algorithms should be conducted on at least two datasets: One dataset that is similar to the training dataset, and one out-of-sample dataset that differs from the training data.

Third, citation parsers should be trained with as many labelled field types as possible, even if these fields will not be in the data that should be parsed. Such a fine-grained training improved F1 by 13.5% in our experiments.

Fourth, having ten times as much training data (10,000 vs. 1,000) improved the parsing performance by 6%, without notable improvements beyond 10,000 instances. Annotating a few thousand instances should be feasible for many scenarios. Hence, businesses and organizations who want the maximum accuracy should annotate their own data for training as this likely will lead to large increases in accuracy (+30%, see conclusion 3).

Fifth, given how similar synthetic and traditionally annotated data perform, synthetic data likely is suitable to train deep neural networks for citation parsing. This, of course, has yet to be empirically to be shown. However, if our assumption holds true, deep citation parsers could greatly benefit from synthetic data like GIANT.

For the future, we see the need to extend our experiments to different machine learning algorithms and datasets (e.g. unarXive [34] or CORE [23]). It would also be interesting to analyze if and to what extend synthetic data could improve related disciplines. This may include citation-string matching, i.e. analyzing whether two different reference strings refer to the same document [15], or the extraction of mathematical formulae [18] or titles [24] from scientific articles.

6 Acknowledgements

We are grateful for the support received by Martin Schibel, Andrew Collins and Dominika Tkaczyk in creating the GIANT dataset [19]. We would also like to acknowledge that this research was partly conducted with the financial support of the ADAPT SFI Research Centre at Trinity College Dublin. The ADAPT SFI Centre for Digital Media Technology is funded by Science Foundation Ireland through the SFI Research Centres Programme and is co-funded under the European Regional Development Fund (ERDF) through Grant #13/RC/2106.

REFERENCES

[1] An, D. et al. 2017. Citation Metadata Extraction via Deep Neural Network-based Segment Sequence Labeling. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (2017), 1967–1970.

[2] Anzaroot, S. and McCallum, A. 2013. A new dataset for fine-grained citation field extraction. ICML Workshop on Peer Reviewing and Publishing Models. (2013).

[3] Bakkalbasi, N. et al. 2006. Three options for citation tracking: Google Scholar, Scopus and Web of Science. Biomedical Digital Libraries. 3, (2006).

[4] Beel, J. et al. 2016. Research Paper Recommender Systems: A Literature Survey. International Journal on Digital Libraries. 4 (2016), 305–338.

[5] Beel, J. and Gipp, B. 2009. Google Scholar’s Ranking Algorithm: An Introductory Overview. Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09) (Rio de Janeiro (Brazil), 2009), 230–241.

[6] Beel, J. and Gipp, B. 2009. Google Scholar’s Ranking Algorithm: The Impact of Citation Counts (An Empirical Study). Proceedings of the 3rd IEEE International Conference on Research Challenges in Information Science (RCIS’09) (Fez (Morocco), 2009), 439–446.

[7] Bhardwaj, A. et al. 2017. DeepBIBX: Deep Learning for Image Based Bibliographic Data Extraction. International Conference on Neural Information Processing (2017), 286–293.

[8] Cabanac, G. et al. 2020. Bibliometric-enhanced Information Retrieval (BIR) 10th Anniversary Workshop Edition. arXiv preprint arXiv:2001.10336. (2020).

[9] Canese, K. and Weis, S. 2013. PubMed: the bibliographic database. The NCBI Handbook [Internet]. 2nd edition. National Center for Biotechnology Information (US).

[10] Councill, I.G. et al. 2008. ParsCit: An open-source CRF reference string parsing package. Proceedings of LREC (2008), 661–667.

[11] Eto, M. 2019. Extended co-citation search: Graph-based document retrieval on a co-citation network containing citation context information. Information Processing & Management. 56, 6 (2019), 102046.

[12] Färber, M. et al. 2018. CITEWERTs: A System Combining Cite-Worthiness with Citation Recommendation. European Conference on Information Retrieval (2018), 815–819.

[13] Färber, M. and Jatowt, A. 2020. Citation Recommendation: Approaches and Datasets. arXiv preprint arXiv:2002.06961. (2020).

[14] Frank G. Bennett, J. 2011. The citeproc-js Citation Processor.

[15] Ghavimi, B. et al. 2019. An Evaluation of the Effect of Reference Strings and Segmentation on Citation Matching. International Conference on Theory and Practice of Digital Libraries (2019), 365–369.

[16] Giles, C.L. et al. 1998. CiteSeer: An automatic citation indexing system. Proceedings of the 3rd ACM conference on Digital libraries (1998), 89–98.

[17] Gollner, K. and Canese, K. 2017. PubMed: Redesigning citation data management. Journal Article Tag Suite Conference (JATS-Con) Proceedings 2017 [Internet] (2017).

[18] Greiner-Petter, A. et al. 2020. Discovering Mathematical Objects of Interest–A Study of Mathematical Notations. arXiv preprint arXiv:2002.02712. (2020).

[19] Grennan, M. et al. 2019. GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing. 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science (2019), 101–112.

[20] Groza, T. et al. 2012. Reference information extraction and processing using random conditional fields. Information Technology and Libraries. 31, 2 (2012), 6–20.

[21] Jacso, P. 2008. Testing the calculation of a realistic h-index in Google Scholar, Scopus, and Web of Science for FW Lancaster. Library Trends. 56, 4 (2008), 784–815.

[22] Jia, H. and Saule, E. 2018. Graph Embedding for Citation Recommendation. arXiv preprint arXiv:1812.03835. (2018).

[23] Knoth, P. and Zdrahal, Z. 2012. CORE: three access levels to underpin open access. D-Lib Magazine. 18, 11/12 (2012).

[24] Lipinski, M. et al. 2013. Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents. Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries (JCDL’13) (2013), 385–386.

[25] Livne, A. et al. 2014. CiteSight: supporting contextual citation recommendation using differential search. Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. (2014), 807–816.

[26] Lopez, P. 2013. GROBID, GitHub Repository. https://github.com/kermitt2/grobid/. (2013).

[27] Lopez, P. 2009. GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. International conference on theory and practice of digital libraries (2009), 473–474.

[28] Lopez, P. 2020. Training data query #535. GitHub https://github.com/kermitt2/grobid/issues/535. (2020).

[29] McCallum, A. 2017. Cora dataset. https://people.cs.umass.edu/ mccallum/data.html. (2017).

[30] Nasar, Z. et al. 2018. Information extraction from scientific articles: a survey. Scientometrics. 117, 3 (2018), 1931–1990.

[31] Prasad, A. et al. 2018. Neural ParsCit: a deep learning-based reference string parser. International Journal on Digital Libraries. 19, 4 (Nov. 2018), 323–337.

[32] Rizvi, S.T.R. et al. 2019. DeepBiRD: An Automatic Bibliographic Reference Detection Approach. arXiv preprint arXiv:1912.07266. (2019).

[33] Rodrigues Alves, D. et al. 2018. Deep reference mining from scholarly literature in the arts and humanities. Frontiers in Research Metrics and Analytics. 3, (2018), 21.

[34] Saier, T. and Färber, M. 2020. unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata. Scientometrics. (2020), 1–24.

[35] Tkaczyk, D. et al. 2015. CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR). 18, 4 (2015), 317–335.

[36] Tkaczyk, D. et al. 2014. GROTOAP2-the methodology of creating a large ground truth dataset of scientific articles. D-Lib Magazine. 20, 11/12 (2014).

[37] Tkaczyk, D. 2015. GROTOAP-citations. CEON RepOD. https://repod.pon.edu.pl/dataset/grotoap-citations. (2015).

[38] Tkaczyk, D. et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (Fort Worth, Texas, USA, 2018), 99–108.

[39] Tkaczyk, D. et al. 2018. ParsRec: A Meta-Learning Recommender System for Bibliographic Reference Parsing Tools. Proceedings of the 12th ACM Conference on Recommender Systems (RecSys) (2018).

[40] Tkaczyk, D. et al. 2018. ParsRec: A Novel Meta-Learning Approach to Recommending Bibliographic Reference Parsers. Proceedings of the 26th Irish Conference on Artificial Intelligence and Cognitive Science (AICS) (2018), 162–173.

[41] Zhang, Y. 2018. Towards highly accurate publication information extraction from academic homepages. (2018).

Footnotes

[1] The work presented in this manuscript is based on Mark Grennan’s Master thesis “1 Billion Citation Dataset and Deep Learning Citation Extraction” at Trinity College Dublin, Ireland, 2018/2019

[2] We use the terms ‘citation parsing’, ‘reference parsing’, and ‘reference-string parsing’ interchangeably.

[3] This is a shortcoming of GIANT. However, the purpose of our current work is to generally compare ‘real’ vs synthetic data. Hence, both datasets should be as similar as possible in terms of available fields to make a fair comparison. Therefore, we removed all fields that were not present in both datasets.

[4] The words were: bone, recommender systems, running, war, crop, monetary, migration, imprisonment, hubble, obstetrics, photonics, carbon, cellulose, evolutionary, revolutionary, paleobiology, penal, leadership, soil, musicology.

[5] All results are based on the Macro Average F1. Looking at the Micro Average F1 shows a slightly better performance for Train_Grobid than for Train_GIANT (0.82 vs. 0.80), but the difference is neither large nor statistically significant (p<0.05).

Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and CORA [pre-print]

Published by Joeran Beel on 8th September 20208th September 2020

ABSTRACT

Meta Data

BibTeX

CCS CONCEPTS

KEYWORDS

ACM Reference format

1 Introduction_[1]

2 Related Work

3 Methodology

4 Results

5 SUMMARY & DISCUSSION

6 Acknowledgements

Footnotes

Joeran Beel

0 Comments

Leave a Reply Cancel reply

Conferences

Our Publication at the ICDM 2022: Estimating the Pruned Search Space Size of Subgroup Discovery

Mr. DLib

Document Embeddings vs. Keyphrases vs. Terms: An Online Evaluation in Digital Library Recommender Systems

Machine Learning

An Empirical Comparison of Syllabuses for Curriculum Learning (Pre-Print)

Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and CORA [pre-print]

Published by Joeran Beel on 8th September 20208th September 2020

ABSTRACT

Meta Data

BibTeX

CCS CONCEPTS

KEYWORDS

ACM Reference format

1 Introduction[1]

2 Related Work

3 Methodology

4 Results

5 SUMMARY & DISCUSSION

6 Acknowledgements

Footnotes

Joeran Beel

0 Comments

Leave a Reply Cancel reply

Related Posts

Conferences

Our Publication at the ICDM 2022: Estimating the Pruned Search Space Size of Subgroup Discovery

Mr. DLib

Document Embeddings vs. Keyphrases vs. Terms: An Online Evaluation in Digital Library Recommender Systems

Machine Learning

An Empirical Comparison of Syllabuses for Curriculum Learning (Pre-Print)

1 Introduction_[1]