List of 6513 stop-words for 17 languages (English, German, French, Italian, and many others)

To optimize Docear’s research paper recommender system I was looking for an extensive stop word list –  a list of words that is ignored for the analysis of your mind maps and research papers (for instance ‘the’, ‘and’, ‘or’, …). It’s easy to find some lists for some languages but I couldn’t find one extensive list for several languages. So I created one based on the stop lists from

  • http://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html
  • http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
  • http://members.unine.ch/jacques.savoy/clef/
  • http://norm.al/2009/04/14/list-of-english-stop-words/
  • http://snowball.tartarus.org/algorithms/english/stop.txt
  • http://solariz.de/649/deutsche-stopwords.htm
  • http://www.lextek.com/manuals/onix/
  • http://www.ranks.nl/resources/stopwords.html
  • http://www.textfixer.com/resources/common-english-words.php
  • http://www.translatum.gr/forum/index.php?topic=2476.0

In case anyone else needs such a stop word list: Here it is, 6513 stop words for English, French, German, Catalan, Czech, Danish, Dutch, Finish, Norwegian, Polish, Portuguese, Rumanian, Spanish, Swedish, and Turkish. I believe that some words have an encoding problem. If you discover an error, please let me know and I will correct it. Also, I wouldn’t be surprised to learn that a stop word from one language is an important word in another language.  If you discover some words in the list that should not be ignored by our research paper recommender system… please let us know 🙂

(more…)