# The Denoised Web Treebank

We introduce the Denoised Web Treebank: a treebank including a normalization layer and a corresponding evaluation metric for dependency parsing of noisy text, such as Tweets. This benchmark enables the evaluation of text normalization methods, including normalization as machine translation and unsupervised lexical normalization, directly on syntactic trees.

## Source code

All source code used in this paper is available under Apache 2.0 license on Github: jodaiber/ithaka.

• dev/test.ids: IDs of the tweets that were the source of the sentence.
• dev/test.tokens: The original noisy tokens.
• dev/test.normalized: The manually normalized tokens including the alignments.
• dev/test.conll: Manually annotated dependency trees for each sentence (CONLL X format).