This document (http://pi7.fernuni-hagen.de/GIRT/tagged_GIRT.html) describes the Part-of-Speech (PoS) tagged files for the German texts in the GIRT-4 collection (i.e. GIRT4-DE) developed by Michael Kluck. The tagged files are produced by IICS at the FernUniversität in Hagen and are distributed on the web site that distributes GIRT-4. See the CLEF Domain-Specific Track for details on this collection. We provide the tagged files without any guarantees. We hope to foster research in information retrieval that uses deeper methods from computational linguistics. If you have plans or ideas to improve or extend the files, please contact us at IICS, FernUniversität in Hagen.
The PoS tagger is a combination of two taggers (namely MET and T3) provided by Ingo Schröder's ACOPOST package. We have combined them by means of TiMBL 5.1 so that a sequential tagger architecture results. The training material is the NEGRA corpus (version 2 plus our corrections). The PoS tag set is STTS except for one simplification also found elsewhere, e.g. in the TIGER corpus: the tag PIDAT is mapped to the tag PIAT.
Each file in the directory named ces+pos corresponds to one entry in the original SGML files for GIRT. Files are encoded according to the Corpus Encoding Standard (CES), but without any header information. Words (w element) and sentences (s element) are automatically marked up with the NLP technology at IICS, FernUniversität in Hagen. The Part-of-Speech tag is added as the value of the pos attribute on <w> elements.
The first sentence of a tagged text is from the TITLE-DE element in GIRT. The remaining sentences are from the ABSTRACT-DE element in GIRT.
All steps have been done automatically without any manual intervention. On newspaper corpora, we achieve a word tag accuracy of 96.6% (see below); on GIRT texts, one must expect a somewhat lower accuracy due to the domain-specific nature, the mixed abstract quality, and the encoding which is sometimes problematic.
Here ist the beginning of one sample file (GIRT-DE19900054.ces.gz) from the collection of tagged GIRT4-DE files:
<s id="p1.s1"> <w offset="0" pos="ART">Der</w> <w offset="4" pos="NN">Einfluß</w> <w offset="12" pos="ART">des</w> <w offset="16" pos="NN">EG-Binnenmarktes</w> <w offset="33" pos="CARD">1992</w> <w offset="38" pos="APPR">auf</w> <w offset="42" pos="ART">die</w> <w offset="46" pos="NN">Schadstoffbelastung</w> <w offset="66" pos="ART">der</w> <w offset="70" pos="NN">Fließgewässer</w> <w offset="84" pos="APPR">in</w> <w offset="87" pos="ART">der</w> <w offset="91" pos="NN">Bundesrepublik</w> <w offset="106" pos="NE">Deutschland</w> <w offset="117" pos="$.">.</w> </s> <s id="p1.s2"> <w offset="119" pos="NN">Ziel</w> <w offset="124" pos="ART">der</w> <w offset="128" pos="NN">Untersuchung</w> <w offset="141" pos="VAFIN">ist</w> <w offset="145" pos="ART">eine</w> <w offset="150" pos="NN">Abschätzung</w> <w offset="162" pos="ART">des</w> <w offset="166" pos="NN">Einflusses</w> <w offset="177" pos="ART">der</w> <w offset="181" pos="NN">Realisierung</w> <w offset="194" pos="ART">des</w> <w offset="198" pos="NN">EG-Binnenmarktes</w> <w offset="215" pos="APPR">auf</w> <w offset="219" pos="ART">die</w> <w offset="223" pos="NN">Schadstoffbelastung</w> <w offset="243" pos="ART">der</w> <w offset="247" pos="NN">Fließgewässer</w> <w offset="261" pos="APPR">in</w> <w offset="264" pos="ART">der</w> <w offset="268" pos="NN">Bundesrepublik</w> <w offset="282" pos="$.">.</w> </s>
Below are some of our correctness results on the NEGRA corpus version 2 (plus our corrections; training data: folds 1 to 9, test data: fold 0). The tagger listed in the last table row was used for annotating GIRT.
| Tagger | Sentence Cor. | Word Cor. | Known Word Cor. | Unknown Word Cor. |
| acopost-T3 | 55.922% | 96.074% | 97.775% | 84.570% |
| acopost-MET | 55.680% | 96.100% | 97.237% | 88.412% |
| acopost-et | 46.845% | 94.649% | 96.585% | 81.561% |
| combi-tagger (MET+T3, 5 words) | 58.689% | 96.553% | 97.814% | 88.026% |
| combi-tagger (MET+T3, 7 words) | 58.883% | 96.551% | 97.802% | 88.087% |
| combi-tagger (MET+T3, 5 words, extended lexicon) | 58.544% | 96.577% | 97.959% | 87.233% |
53 PoS tags occur in the output of the tagger. Only the STTS tags UNKNOWN and PIDAT are missing (see above). The following absolute tag frequencies have been calculated:
6 VMPP
8 PPOSS
15 VAIMP
109 ITJ
351 PTKANT
1388 APPO
4378 PTKA
5036 VVIMP
5925 VMINF
8509 APZR
9444 VAPP
9674 PRELAT
12306 PWS
17366 KOUI
21257 PWAT
26605 XY
30258 VVIZU
34882 PDS
39327 KOKOM
43229 FM
51508 PWAV
52189 PIS
59926 PTKNEG
68023 VAINF
86186 PIAT
86941 PTKVZ
92118 TRUNC
94300 PTKZU
95292 VMFIN
110058 PPOSAT
112111 PROAV
117325 PDAT
130693 PRF
130755 KOUS
131136 PRELS
164856 PPER
206867 VVINF
375819 CARD
403869 VVPP
408023 APPRART
426367 ADJD
501120 VAFIN
511278 ADV
585733 NE
617778 VVFIN
875986 $,
946400 KON
1175476 $(
1497258 $.
1929173 APPR
2023299 ADJA
2862394 ART
5881468 NN
Sven Hartrumpf