We need your help (i.e. a server) to build a repository for academic PDF files

It’s a while ago that we started crawling the Web for academic PDFs to index them and use them for Docear’s research paper recommender system. Meanwhile, we have collected quite a few PDFs. Unfortunately, in a foreseeable future, our servers’ disks will be full and the load of our servers is too high already (that’s why you sometimes won’t get recommendations in Docear – our servers simply are too busy).

Since our budget is tight and we don’t want to spend too much time for server administration neither, we are asking for your help: Do you have a server that you could spare? What we need is the following

Storage space for backing up the indexed PDFs.
Right now we delete a PDF once it’s indexed. However, we would like to keep them in case we need to re-index them. Therefore, we need a server with lots of disk space (10 TB or more) and there will be about 20-30 GB traffic a day to backup the PDFs to the server. Otherwise there are no special requirements to that server.
PDF-Caching Server for making the PDFs publicly available
We would love to cache the PDFs because many of them are deleted from the Web after a while. That means, we would like to enable our users downloading the PDFs from our/your server. So, the requirements for this server would be the same as for the backup server (10TB+ storage) plus the option to have an Apache running and maybe a Tomcat, too. In addition, there would be more traffic (the 20-30 GB upload plus the downloads from the users).
PDF Indexing Server to download and index PDF files
Right now, the bottleneck of the entire system is the download and processing of the PDFs (this takes a few seconds per PDF). What we need is a really powerful server, especially in terms of CPU-power. The server would have to download PDFs 24/7 (URLs to the PDFs would be delivered from Docear’s main server), process them (convert to text, extract title, extract references), add the PDF to our Lucene index and make a backup of the PDF to the backup server. Accordingly, storage requirements are rather low (a few Gigabyte should be enough, an SSD would be awesome) but there will be about 40-60 GB traffic a day and the CPU load will be close to 1 most of the time.

Since we may use the facilities at our university, we would be able to host the server at our university’s data center. However, if you would host the server your self and just provide us with the log-in data, that would be even better :-).

If you think, you could help, please send us an email to info@docear.org and let us know how we could give something in return (e.g. your logo on our website).

Categories: Help Wanted

Tags: backupdigital libraryhelp wantedpdfpdf processingpdf storageResearch-Paper Recommender-Systemsserver

Joeran Beel

Please visit https://isg.beel.org/people/joeran-beel/ for more details about me.

10 Comments

pete shaw · 23rd November 2012 at 15:57

Hey Joeran,

As far as the server space goes, have you considered simply hosting the pdf storage/processing using a cloud based solution such as ACS and Amazon Glacier? One approach would be to simply ask for an academic discount on the cloud services, and then to look for the financial support in providing these services, as a separate step. So I would be willing to help set up the services, basically providing the server and storage, but I wouldn’t be able to actually pay for it (sorry), you would still need a sponsor to help with that.

Regards
-Pete

Joeran [Docear] · 23rd November 2012 at 16:06

Hi Pete, thank you very much for your offer but cloud storage is too expensive for only storing some PDF files. 1GB storage costs 1 US Cents a month, that is 10US$ per TB per month. So, 10 TB a year would be 1200 US$, not including the traffic costs and we also couldn’t offer the PDFs to the users for download. I guess we just have to wait for a generous sponsor or limit the capabilities of our library

ababakar · 5th November 2012 at 20:49

What about a teamup with researchgate?
As far as i know they are also located in germany – and they also build up a database.

Maybe you can take a look (www.researchgate.net)

cheers

Joeran [Docear] · 6th November 2012 at 16:39

thank you, we will think about this

Alessandro · 29th October 2012 at 02:48

I’m sort of stating the obvious here, but what about partnering with Google? They surely have drive space to spare and they might be happy about giving you some in exchange for publicity (i.e. their logo on your website). Just a thought!

Joeran [Docear] · 29th October 2012 at 10:16

I believe before Google would have an interest in sponsoring us, we would need about 100,000 times as many users as we have right now 😉

Dennis Groves · 16th November 2012 at 05:38

Hello Joeran,

I know for a fact that google has many things they offer non-profits http://www.google.com/nonprofits/ and open source projects http://code.google.com/opensource/. OWASP has benefited significantly from our relationship with google. I realise this is about open-data, however I recomend checking out the other two links – there are some really valuable things there you can still take advantage of.

Google has this for open data
http://opendatakit.org/about/deployments/

and the open knowledge foundation has this
http://thedatahub.org/

I am not certain that either could be exactly what your are looking for, but perhaps with some creativity it can reach the same ends. I alwasy thought it would be best if academic bibliographic information was part of a globaly distributed ‘disk’ that each user participated by running the bibliographic software – just like torrents; only for a good thing!

Dennis

Joeran [Docear] · 20th November 2012 at 15:22

ok, thank you, we will have a look at this!

BobKnocker · 25th October 2012 at 07:31

I dont know how your reco system works , but you could introduce Google Drive integration with the option to allow users to include thier pdfs in the index. That way their computer can do the indexing like with zotero, or use drives integrated indexing and forward the data to a central hub. Also, integrating Drive would enable live collaboration by design of their product. It really is something quite extraordinary.

I would really look at using google cloud storage. Its insanly cheap. No server administration. You gain access to easy to use api’s that are right up your alley.

I would dedicate it all to see this project through because to me Docear represents a component in the next phase in comprehension. Honestly I feel the functionality needs to be abstracted out to well defined html+js components so that they can be integrated in general computing much like I feel Mozilla Popcorn will. The ability to extract out highlights and notes from pdfs is one component. Another would be the mind-mapping functionality and yet another would be the ability to export this to a paper outline with a bibliography. By separating out these well defined functionalities new levels in your creations evolution can be realized such as the ability to use any piece of media rather than a pdf and integration with new touch and gesture technologies.

The ability to monetize your work would extend from consulting businesses on new forms of productivity that have been vetted by thier effects on education. Not to mention massive grants to integrate this tech into a the education systems of various countries. eww… I sound like my grandfather

Joeran [Docear] · 25th October 2012 at 08:25

thank you very much for all your ideas. we are indeed looking into cloud hosting but more for the collaborative part (working together on mind maps etc.), not for the recommender system. But it really might be a good idea to “outsource” the PDF processing (extracting titles, text, …) to our users. We will think about this!