Multilingual entity linking


I created an efficient and more accurate version of the multilingual entity linking system DBpedia Spotlight.

Raw data

Download the raw data here

We provide the raw data that is used to create entity extraction models in various languages. This data is the result of running pignlproc on the latest Wikipedia dumps.

If you use this data in your research, please cite the following paper:

  title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
  author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
  year = {2013},
  booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}

Download here

Data format

  • URIs are encoded in DBpedia format but redirects are not yet resolved. To do this, you can use the class WikipediaToDBpediaClosure, as used in CreateSpotlightModel.scala:
  val wikipediaToDBpediaClosure = new WikipediaToDBpediaClosure(
    new FileInputStream(new File(rawDataFolder, "redirects.nt")),
    new FileInputStream(new File(rawDataFolder, "disambiguations.nt"))


DBpedia URI                             Count
--------------------------------------------------------------        21        7        20


Surface form         Count annotated    Count total
Berlin               49068              105915
Berloz               2                  6
9z                   -1                 1
  • if no total string occurrence count is available, the 3rd column may be empty or -1
  • if the annotated count is -1, then this is not a surface form that has been observed with any DBpedia resource. Lines with an annotated count of -1 are there to output the total count of the lowercase representations of surface forms.


Surface form     DBpedia URI                                         Count
Berlin          2
Berlin       9
Berlin      1


Wikipedia URI                   Stemmed token counts
----------------------------------------------------------------------------!  {(renam,76),(intel,14),...,(plai,2),(auf,2)}
  • All tokens are stemmed