Ian will be joining us for 6 months starting today. Ian is currently enrolled as a master’s student of Computer and Information Science at the University of Konstanz in the south of Germany. Ian’s research interests include machine learning, natural language processing, information retrieval and recommender systems.
Ian is currently working on a web-based LaTeX formula annotation facilitation and recommendation tool for STEM documents and will continue to do so while visiting us. The project addresses retrieving data — especially mathematical concepts — from STEM documents (science, technology, engineering and maths). In most of the current information retrieval approaches mathematical formulae are not considered, even though they are very common in documents within STEM fields.
Retrieving the mathematical concepts encoded in formulae is important for the analysis of STEM documents, since they contain a lot of relevant information that may not be found in the surrounding text. The recognition of formula concepts is the process of classifying a formula as being an instance of a certain concept (e.g. the formula E=mc^2 has the concept energy-mass equivalence associated with it). Machine Learning has proven time and time again to be extremely useful in classification tasks. However, very large amounts of labeled data are necessary to train machine learning methods. Currently, there is no large enough labelled dataset containing mathematical formulae annotated with their semantics available, that could be used to train machine learning models.
The objective of Ian’s project is to implement an annotation recommendation tool – AnnoMathTeX – that can be used by the authors of scientific documents to annotate the mathematical formulae occurring in the document they are writing with the intended formula concept. The author should be able to directly add the Wikidata QID (a unique number) to the formula. The tool will start providing recommendations based on a Big Data analysis of the NTCIR12 arXiv dataset (http://research.nii.ac.jp/ntcir/ntcir-12/).
If the tool becomes available and is used, it will result in the construction of a dataset of mathematical formulae annotated with their corresponding formula concepts, and their identifiers, which would be the first step towards enabling machine learning to automatically determine the concept of a formula. A number of different fields could benefit from the creation of such a dataset. The creation of tools to facilitate users reading scientific documents (or web pages) that contain mathematical content would be enabled. Furthermore, search engines could exploit such a dataset to enhance the search for documents that contain mathematical formulae.