pdf metadata extraction

ParsRec: A Novel Meta-Learning Approach to Recommending Bibliographic Reference Parsers

Our manuscript “ParsRec: A Novel Meta-Learning Approach to Recommending Bibliographic Reference Parsers” was accepted for publication at the 26th Irish Conference on Artificial Intelligence and Cognitive Science (AICS). It is an extended version of our recently presented poster “ParsRec: Meta-Learning Recommendations for Bibliographic Reference Parsing” at the ACM RecSys conference. The Read more…

By Joeran Beel, 8 years ago

Docear

Update for Docear’s “Google Scholar Parser” Library to Fetch Metadata for PDF files

Update 2018-07-31: We updated the Dropbox Link Google Scholar recently changed its layout, and as a consequence, Docear couldn’t fetch metadata anymore from Google Scholar for PDF files. Fortunately, one of our users (“Silberzwiebel”) adjusted Docear’s Google Scholar Parser, and now everything works as usual. However, we have not yet integrated Read more…

By Joeran Beel, 9 years5th October 2017 ago

Information Extraction

Docear 1.1 stable released with strongly improved PDF metadata extraction

Finally, after releasing the alpha and beta, today we release Docear 1.1 stable. If you have tried already one of the previous versions, there is not much news. Otherwise, read on.

Thanks to all the generous donors, our student Christoph could work on an improved PDF metadata retrieval for Docear. The new Docear 1.1 is able to extract the title of a PDF and fetch metadata from Google Scholar for that title. To do so, select a PDF in your mind-map and chose “Create or Update reference”, …

… and the following new dialog appears. The dialog shows the file name of your PDF file, and the extracted title. In the background, the extracted title is sent to Google Scholar and metadata for the first two search results are shown in the dialog. If the title was extracted incorrectly, you can manually correct it. You may also chose to use the PDF’s file name for the search. For instance, when you named your PDF already according to the title, select the radio button with the file name, and the file name is sent as search query to Google Scholar (you may also manually correct the file name before it’s sent to Google Scholar). Of course, all other options you already know are still available, such as creating a blank entry, or importing the XMP data of PDFs. Btw. Docear remembers your choice, i.e. when you select to create a blank entry, the option will be pre-selected when open that dialog the next time. It might happen, that your IP will be blocked by Google Scholar when you use the service too frequently. If this happens, a captcha should appear, and after solving it, you should be able to proceed. We did not yet test this thoroughly. Please let us know your experiences.

The precision of our metadata tool depends on two factors, A) the precision of the title extraction and B) the coverage of Google Scholar. According to a recent experiment, title extraction of our tool is around 70%. However, the final result very much depends on the format of your research articles. In my research field (i.e. recommender systems), I would say that our tool extracts the title correctly for about 90% of the articles in my personal library. In addition, almost all articles that are relevant for my research are indexed by Google Scholar (i would estimate, more than 90%). This means, for around 80% of my PDFs the correct metadata is retrieved fully automatically. Given that I provide the title manually, for even more than 90% the metadata may be retrieved. Please let us know your experience (and your research field). (more…)

By Joeran Beel, 12 years ago

Docear

Docear 1.1 Beta Released: New PDF Metadata Extraction, Better Zotero and Mendeley BibTeX support, and Bug Fixes

If you have tested the Preview of Docear 1.1 you may already know about some of Docear’s new features. With your feedback and the mind maps, log files and BibTeX files you shared with us, these features have matured. We are proud to introduce the first (and hopefully only) Beta release of Docear 1.1.

The new key features of Docear 1.1

Improved metadata retrieval

Thanks to your donations, our student Christoph greatly enhanced Docear’s PDF metadata retrieval. For us, it works really great, and with Docear 1.1 Beta the last bugs have been fixed. Btw. if you like what Christoph did, and if you are using LibreOffice, or OpenOffice, please also read our call for donation to develop an add-on for these two text processing tools.

Improved support for Zotero / Mendeley BibTeX files

(more…)

By Joeran Beel, 12 years ago

Docear

Preview of Docear 1.1 with PDF Metadata Retrieval from Google Scholar

Thanks to all the generous donors, our student Christoph could work on an improved PDF metadata retrieval for Docear, and today it’s time to present the first preview. The new Docear 1.1 (preview) is able to extract the title of a PDF and fetch appropriate metadata from Google Scholar. Whenever you select a PDF in your mind-map and chose “Create or Update reference”, the following new dialog appears.

The dialog shows the file name of your PDF file, and the extracted title. In the background, the extracted title is sent to Google Scholar and metadata for the first three search results are shown in the dialog. If the title was extracted incorrectly, you can manually correct it. You may also chose to use the PDF’s file name for the search. For instance, when you named your PDF already according to the title, select the radio button with the file name, and the file name is sent as search query to Google Scholar (you may also manually correct the file name before it’s sent to Google Scholar). Of course, all other options you already know are still available, such as creating a blank entry, or importing the XMP data of PDFs. Btw. Docear remembers your choice, i.e. when you select to create a blank entry, the option will be pre-selected when open that dialog the next time. It might happen, that your IP will be blocked by Google Scholar when you use the service too frequently. If this happens, a captcha should appear, and after solving it, you should be able to proceed. We did not yet test this thoroughly. Please let us know your experiences.

The precision of our metadata tool depends on two factors, A) the precision of the title extraction and B) the coverage of Google Scholar. According to a recent experiment, title extraction of our tool is around 70%. However, the final result very much depends on the format of your research articles. In my research field (i.e. recommender systems), I would say that our tool extracts the title correctly for about 90% of the articles in my personal library. In addition, almost all articles that are relevant for my research are indexed by Google Scholar (i would estimate, more than 90%). This means, for around 80% of my PDFs the correct metadata is retrieved fully automatically. Given that I provide the title manually, for even more than 90% the metadata may be retrieved. Please let us know your experience (and your research field). (more…)

By Joeran Beel, 12 years ago

Docear

Call for donation was successful: 1800 Euros donated to improve Docear’s PDF metadata retrieval function

One month ago, we started a call for donation and asked our users for money so we could pay our student Christoph to improve Docear’s PDF metadata retrieval. We asked for 1800 Euros (~2500 US$) and today we achieved our goal. We would like to thank all donors who Read more…

By Joeran Beel, 12 years ago

Call for donation

Call for Donation: (Automatic) PDF Metadata Extraction and Renaming

Done! We’ve got all the money we need, thank you very much!!!!!!!! Read on here…

One of Docear’s biggest disadvantages, compared to other reference managers, is the rather poor PDF metadata extraction capability. As such, it is no surprise that the second most popular feature request is to add decent PDF metadata extraction and file renaming to Docear. However, adding such a function is a lot of work and we currently do not really have the manpower for this. Fortunately, one of our best students – i.e. Christoph, who already did a lot of work for us – wants a paid job for his semester breaks. If we could pay him 1,800 Euros, he would love to implement the PDF metadata extraction method in his semester breaks, and we have no doubts that he is capable of doing it. The problem is, we don’t have the funds to pay him.

Therefore, we would like to start a call for donation: If you want decent PDF metadata extraction in Docear, please donate, before February 28, 2014. We need 1,800 Euros to pay Christoph for four weeks, almost full-time, starting the end of February.

AUDCADEURGBPJPYUSDNZDCHFHKDSGDSEKDKKPLNNOKHUFCZKILSMXNBRLMYRPHPTWDTHBTRYRUB

(more…)

By Joeran Beel, 12 years ago