This document (http://pi7.fernuni-hagen.de/GIRT/tagged_GIRT.html) describes the Part-of-Speech (PoS) tagged files for the German texts in the GIRT-4 collection (i.e. GIRT4-DE) developed by Michael Kluck. The tagged files are produced by IICS at the FernUniversität in Hagen and are distributed on the web site that distributes GIRT-4. See the CLEF Domain-Specific Track for details on this collection. We provide the tagged files without any guarantees. We hope to foster research in information retrieval that uses deeper methods from computational linguistics. If you have plans or ideas to improve or extend the files, please contact us at IICS, FernUniversität in Hagen.
The PoS tagger is a combination of two taggers (namely MET and T3) provided by Ingo Schröder's ACOPOST package. We have combined them by means of TiMBL 5.1 so that a sequential tagger architecture results. The training material is the NEGRA corpus (version 2 plus our corrections). The PoS tag set is STTS except for one simplification also found elsewhere, e.g. in the TIGER corpus: the tag PIDAT is mapped to the tag PIAT.
Each file in the directory named ces+pos corresponds to one entry in the original SGML files for GIRT. Files are encoded according to the Corpus Encoding Standard (CES), but without any header information. Words (w element) and sentences (s element) are automatically marked up with the NLP technology at IICS, FernUniversität in Hagen. The Part-of-Speech tag is added as the value of the pos attribute on <w> elements.
The first sentence of a tagged text is from the TITLE-DE element in GIRT. The remaining sentences are from the ABSTRACT-DE element in GIRT.
All steps have been done automatically without any manual intervention. On newspaper corpora, we achieve a word tag accuracy of 96.6% (see below); on GIRT texts, one must expect a somewhat lower accuracy due to the domain-specific nature, the mixed abstract quality, and the encoding which is sometimes problematic.
Here ist the beginning of one sample file (GIRT-DE19900054.ces.gz) from the collection of tagged GIRT4-DE files:
<s id="p1.s1"> <w offset="0" pos="ART">Der</w> <w offset="4" pos="NN">Einfluß</w> <w offset="12" pos="ART">des</w> <w offset="16" pos="NN">EG-Binnenmarktes</w> <w offset="33" pos="CARD">1992</w> <w offset="38" pos="APPR">auf</w> <w offset="42" pos="ART">die</w> <w offset="46" pos="NN">Schadstoffbelastung</w> <w offset="66" pos="ART">der</w> <w offset="70" pos="NN">Fließgewässer</w> <w offset="84" pos="APPR">in</w> <w offset="87" pos="ART">der</w> <w offset="91" pos="NN">Bundesrepublik</w> <w offset="106" pos="NE">Deutschland</w> <w offset="117" pos="$.">.</w> </s> <s id="p1.s2"> <w offset="119" pos="NN">Ziel</w> <w offset="124" pos="ART">der</w> <w offset="128" pos="NN">Untersuchung</w> <w offset="141" pos="VAFIN">ist</w> <w offset="145" pos="ART">eine</w> <w offset="150" pos="NN">Abschätzung</w> <w offset="162" pos="ART">des</w> <w offset="166" pos="NN">Einflusses</w> <w offset="177" pos="ART">der</w> <w offset="181" pos="NN">Realisierung</w> <w offset="194" pos="ART">des</w> <w offset="198" pos="NN">EG-Binnenmarktes</w> <w offset="215" pos="APPR">auf</w> <w offset="219" pos="ART">die</w> <w offset="223" pos="NN">Schadstoffbelastung</w> <w offset="243" pos="ART">der</w> <w offset="247" pos="NN">Fließgewässer</w> <w offset="261" pos="APPR">in</w> <w offset="264" pos="ART">der</w> <w offset="268" pos="NN">Bundesrepublik</w> <w offset="282" pos="$.">.</w> </s>
Below are some of our correctness results on the NEGRA corpus version 2 (plus our corrections; training data: folds 1 to 9, test data: fold 0). The tagger listed in the last table row was used for annotating GIRT.
Tagger | Sentence Cor. | Word Cor. | Known Word Cor. | Unknown Word Cor. |
acopost-T3 | 55.922% | 96.074% | 97.775% | 84.570% |
acopost-MET | 55.680% | 96.100% | 97.237% | 88.412% |
acopost-et | 46.845% | 94.649% | 96.585% | 81.561% |
combi-tagger (MET+T3, 5 words) | 58.689% | 96.553% | 97.814% | 88.026% |
combi-tagger (MET+T3, 7 words) | 58.883% | 96.551% | 97.802% | 88.087% |
combi-tagger (MET+T3, 5 words, extended lexicon) | 58.544% | 96.577% | 97.959% | 87.233% |
53 PoS tags occur in the output of the tagger. Only the STTS tags UNKNOWN and PIDAT are missing (see above). The following absolute tag frequencies have been calculated:
6 VMPP 8 PPOSS 15 VAIMP 109 ITJ 351 PTKANT 1388 APPO 4378 PTKA 5036 VVIMP 5925 VMINF 8509 APZR 9444 VAPP 9674 PRELAT 12306 PWS 17366 KOUI 21257 PWAT 26605 XY 30258 VVIZU 34882 PDS 39327 KOKOM 43229 FM 51508 PWAV 52189 PIS 59926 PTKNEG 68023 VAINF 86186 PIAT 86941 PTKVZ 92118 TRUNC 94300 PTKZU 95292 VMFIN 110058 PPOSAT 112111 PROAV 117325 PDAT 130693 PRF 130755 KOUS 131136 PRELS 164856 PPER 206867 VVINF 375819 CARD 403869 VVPP 408023 APPRART 426367 ADJD 501120 VAFIN 511278 ADV 585733 NE 617778 VVFIN 875986 $, 946400 KON 1175476 $( 1497258 $. 1929173 APPR 2023299 ADJA 2862394 ART 5881468 NNSven Hartrumpf