Publications

For a full list of publications, please see Google Scholar.

Machine Translation

My current work focuses on linguistically-informed statistical models for MT.

  • Universal Reordering via Linguistic Typology
    Daiber, J., Stanojević, M., and Sima'an, K. In COLING, 2016.

    Universal Reordering via Linguistic Typology

    In this paper we explore the novel idea of building a single universal reordering model from English to a large number of target languages. To build this model we exploit typological features of word order for a large number of target languages together with source (English) syntactic features and we train this model on a single combined parallel corpus representing all (22) involved language pairs. We contribute experimental evidence for the usefulness of linguistically defined typological features for building such a model. When the universal reordering model is used for preordering followed by monotone translation (no reordering inside the decoder), our experiments show that this pipeline gives comparable or improved translation performance with a phrase-based baseline for a large number of language pairs (12 out of 22) from diverse language families.

    Abstract

    Universal Reordering via Linguistic Typology

    @InProceedings{daiber-stanojevic-simaan:2016:COLING,
      author    = {Daiber, Joachim  and  Stanojevi\'{c}, Milo\v{s}  and  Sima'an, Khalil},
      title     = {Universal Reordering via Linguistic Typology},
      booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
      month     = {December},
      year      = {2016},
      address   = {Osaka, Japan},
      publisher = {The COLING 2016 Organizing Committee},
      pages     = {3167--3176},
      abstract  = {In this paper we explore the novel idea of building a single universal
      reordering model from English to a large number of target languages. To build
      this model we exploit typological features of word order for a large number of
      target languages together with source (English) syntactic features and we train
      this model on a single combined parallel corpus representing all (22) involved
      language pairs. We contribute experimental evidence for the usefulness of
      linguistically defined typological features for building such a model. When the
      universal reordering model is used for preordering followed by monotone
      translation (no reordering inside the decoder), our experiments show that this
      pipeline gives comparable or improved translation performance with a
      phrase-based baseline for a large number of language pairs (12 out of 22) from
      diverse language families.},
      url       = {http://aclweb.org/anthology/C16-1298}
    }
    
    BibTeX

    Universal Reordering via Linguistic Typology

    Download poster.
    Poster
  • Examining the Relationship between Preordering and Word Order Freedom in Machine Translation
    Daiber, J., Stanojević, M., Aziz, W. and Sima'an, K. In Conference on Machine Translation (WMT), 2016.

    Examining the Relationship between Preordering and Word Order Freedom in Machine Translation

    We study the relationship between word order freedom and preordering in statistical machine translation. To assess word order freedom, we first introduce a novel entropy measure which quantifies how difficult it is to predict word order given a source sentence and its syntactic analysis. We then address preordering for two target languages at the far ends of the word order freedom spectrum, German and Japanese, and argue that for languages with more word order freedom, attempting to predict a unique word order given source clues only is less justified. Subsequently, we examine lattices of n-best word order predictions as a unified representation for languages from across this broad spectrum and present an effective solution to a resulting technical issue, namely how to select a suitable source word order from the lattice during training. Our experiments show that lattices are crucial for enabling empirical performance improvements for languages with freer word order (English–German) and can provide additional improvements for fixed word order languages (English–Japanese).

    Abstract

    Examining the Relationship between Preordering and Word Order Freedom in Machine Translation

    @InProceedings{daiber-EtAl:2016:WMT,
      author    = {Daiber, Joachim  and  Stanojevi\'{c}, Milo\v{s}  and  Aziz, Wilker  and  Sima'an, Khalil},
      title     = {Examining the Relationship between Preordering and Word Order Freedom in Machine Translation},
      booktitle = {Proceedings of the First Conference on Machine Translation, Volume 1: Research Papers},
      month     = {August},
      year      = {2016},
      address   = {Berlin, Germany},
      publisher = {Association for Computational Linguistics},
      pages     = {118--130},
      url       = {http://www.aclweb.org/anthology/W/W16/W16-2213}
    }
    
    BibTeX

    Examining the Relationship between Preordering and Word Order Freedom in Machine Translation

    Download slides.
    Talk
  • Machine Translation with Source-Predicted Target Morphology
    Daiber, J. and Sima'an, K. In MT Summit, 2015.

    Machine Translation with Source-Predicted Target Morphology

    We propose a novel pipeline for translation into morphologically rich languages which consists of two steps: initially, the source string is enriched with target morphological features and then fed into a translation model which takes care of reordering and lexical choice that matches the provided morphological features. As a proof of concept we first show improved translation performance for a phrase-based model translating source strings enriched with morphological features projected through the word alignments from target words to source words. Given this potential, we present a model for predicting target morphological features on the source string and its predicate-argument structure, and tackle two major technical challenges: (1) How to fit the morphological feature set to training data? and (2) How to integrate the morphology into the back-end phrase-based model such that it can also be trained on projected (rather than predicted) features for a more efficient pipeline? For the first challenge we present a latent variable model, and show that it learns a feature set with quality comparable to a manually selected set for German. And for the second challenge we present results showing that it is possible to bridge the gap between a model trained on a predicted and another model trained on a projected morphologically enriched parallel corpus. Finally we exhibit final translation results showing promising improvement over the baseline phrase-based system.

    Abstract

    Machine Translation with Source-Predicted Target Morphology

    @inproceedings{daiber2015machine,
      title={Machine Translation with Source-Predicted Target Morphology},
      author={Daiber, Joachim and Sima’an, Khalil},
      booktitle={Proceedings of the 15th Machine Translation Summit (MT Summit 2015)},
      pages={283--296},
      address={Miami, USA},
      year={2015}
    }
    
    BibTeX

    Machine Translation with Source-Predicted Target Morphology

    Download poster.
    Poster

    Machine Translation with Source-Predicted Target Morphology

    Download slides.
    Talk
  • Splitting Compounds by Semantic Analogy
    Daiber, J., Quiroz, L., Wechsler, R. and Frank, S. In Deep Machine Translation Workshop, 2015.

    Splitting Compounds by Semantic Analogy

    Compounding is a highly productive word-formation process in some languages that is often problematic for natural language processing applications. In this paper, we investigate whether distributional semantics in the form of word embeddings can enable a deeper, i.e., more knowledge-rich, processing of compounds than the standard string-based methods. We present an unsupervised approach that exploits regularities in the semantic vector space (based on analogies such as “bookshop is to shop as bookshelf is to shelf”) to produce compound analyses of high quality. A subsequent compound splitting algorithm based on these analyses is highly effective, particularly for ambiguous compounds. German to English machine translation experiments show that this semantic analogy-based compound splitter leads to better translations than a commonly used frequency-based method.

    Abstract

    Splitting Compounds by Semantic Analogy

    @Inbook{W15-5703,
      author = 	"Daiber, Joachim and Quiroz, Lautaro and Wechsler, Roger and Frank, Stella",
      chapter = 	"Splitting Compounds by Semantic Analogy",
      title = 	"Proceedings of the 1st Deep Machine Translation Workshop",
      year = 	"2015",
      publisher = 	"{\'U}FAL MFF UK",
      pages = 	"20--28",
      location = 	"Praha, Czechia",
      url = 	"http://aclweb.org/anthology/W15-5703"
    }
    
    BibTeX

    Splitting Compounds by Semantic Analogy

    Download slides.
    Talk Code
  • Delimiting Morphosyntactic Search Space with Source-Side Reordering Models
    Daiber, J. and Sima'an, K. In Deep Machine Translation Workshop, 2015.

    Delimiting Morphosyntactic Search Space with Source-Side Reordering Models

    Compounding is a highly productive word-formation process in some languages that is often problematic for natural language processing applications. In this paper, we investigate whether distributional semantics in the form of word embeddings can enable a deeper, i.e., more knowledge-rich, processing of compounds than the standard string-based methods. We present an unsupervised approach that exploits regularities in the semantic vector space (based on analogies such as “bookshop is to shop as bookshelf is to shelf”) to produce compound analyses of high quality. A subsequent compound splitting algorithm based on these analyses is highly effective, particularly for ambiguous compounds. German to English machine translation experiments show that this semantic analogy-based compound splitter leads to better translations than a commonly used frequency-based method.

    Abstract

    Delimiting Morphosyntactic Search Space with Source-Side Reordering Models

    @Inbook{W15-5704,
      author = 	"Daiber, Joachim and Sima'an, Khalil",
      chapter = 	"Delimiting Morphosyntactic Search Space with Source-Side Reordering Models",
      title = 	"Proceedings of the 1st Deep Machine Translation Workshop",
      year = 	"2015",
      publisher = 	"{\'U}FAL MFF UK",
      pages = 	"29--38",
      location = 	"Praha, Czechia",
      url = 	"http://aclweb.org/anthology/W15-5704"
    }
    
    BibTeX

    Delimiting Morphosyntactic Search Space with Source-Side Reordering Models

    Download slides.
    Talk

Entity Extraction

I have worked on fast and accurate multilingual models for entity linking.

  • Improving Efficiency and Accuracy in Multilingual Entity Extraction
    Daiber, J. and Jakob, M. and Hokamp, C. and Mendes, P.N. In International Conference on Semantic Systems, 2013.

    Improving Efficiency and Accuracy in Multilingual Entity Extraction

    There has recently been an increased interest in named entity recognition and disambiguation systems at major conferences such as WWW, SIGIR, ACL, KDD, etc. However, most work has focused on algorithms and evaluations, leaving little space for implementation details. In this paper, we discuss some implementation and data processing challenges we encountered while developing a new multilingual version of DBpedia Spotlight that is faster, more accurate and easier to configure. We compare our solution to the previous system, considering time performance, space requirements and accuracy in the context of the Dutch and English languages. Additionally, we report results for 7 additional languages among the largest Wikipedias. Finally, we present challenges and experiences to foment the discussion with other developers interested in recognition and disambiguation of entities in natural language text.

    Abstract

    Improving Efficiency and Accuracy in Multilingual Entity Extraction

    @inproceedings{Daiber:2013:IEA:2506182.2506198,
     author = {Daiber, Joachim and Jakob, Max and Hokamp, Chris and Mendes, Pablo N.},
     title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
     booktitle = {Proceedings of the 9th International Conference on Semantic Systems},
     series = {I-SEMANTICS '13},
     year = {2013},
     isbn = {978-1-4503-1972-0},
     location = {Graz, Austria},
     pages = {121--124},
     numpages = {4},
     url = {http://doi.acm.org/10.1145/2506182.2506198},
     doi = {10.1145/2506182.2506198},
     acmid = {2506198},
     publisher = {ACM},
     address = {New York, NY, USA},
     keywords = {entity linking, information extraction, named entity recognition},
    }
    
    BibTeX Code Data Demo

Dependency Parsing

I am also interested in robust models of dependency parsing.

  • The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions
    Daiber, J. and van der Goot, R. In LREC, 2016.

    The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions

    We introduce the Denoised Web Treebank: a treebank including a normalization layer and a corresponding evaluation metric for dependency parsing of noisy text, such as Tweets. This benchmark enables the evaluation of parser robustness as well as text normalization methods, including normalization as machine translation and unsupervised lexical normalization, directly on syntactic trees. Experiments show that text normalization together with a combination of domain-specific and generic part-of-speech taggers can lead to a significant improvement in parsing accuracy on this test set.

    Abstract

    The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions

    @InProceedings{DAIBER16.86,
      author = {Joachim Daiber and Rob van der Goot},
      title = {The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions},
      booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
      year = {2016},
      month = {may},
      date = {23-28},
      location = {Portorož, Slovenia},
      editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
      publisher = {European Language Resources Association (ELRA)},
      address = {Paris, France},
      isbn = {978-2-9517408-9-1},
      language = {english}
    }
    
    BibTeX

    The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions

    Download poster.
    Poster Code Data