EM 2nd period report - Charles University in Prague
Activity report - CU
Within WP2, we participated ACL WMT 08 with a further improved version of our
factored English-to-Czech translation, achieving results comparable to a
commercial MT system.
Within WP3, we implemented a tree-to-tree transfer system, heuristical
extraction of treelet pairs from a parallel treebank, and finished two
implementations (Java and Perl) of a tree-to-tree aligner based on the STSG
model as previously designed. The Perl implementation is a complete rewrite of
the aligner, more concise and with slightly lower memory requirements. Due to
natural and random structural divergence between Czech and English sentences,
both the heuristic extraction as well as the STSG-based aligner produce treelet
translation dictionaries with insufficient coverage. Moreover, the STSG-based
aligner has extreme memory requirements and we are examining several pruning
methods to scale up the method. The transfer step at the tectogrammatical layer
currently faces a combinatorial explosion of possible output attribute values
and we are re-designing the search to overcome the problem.
Additionally, we evaluated the correlation of various automatic MT metrics with
human judgements for English-to-Czech translation. The results agree with
observations for English as the target language, BLEU correlates moderately
well and there are better performing metrics.
Translation revision, preprocessing and annotation of the parallel Czech-English
treebank continued both for Czech and for English.
Broadening our scope, we also collected a small Czech-English-Russian
pairwise-aligned corpus.