Part-of-Speech tagged GIRT4-DE

Sven Hartrumpf, 2004-10-18, last change: 2005-02-28

Introduction

This document (http://pi7.fernuni-hagen.de/GIRT/tagged_GIRT.html) describes the Part-of-Speech (PoS) tagged files for the German texts in the GIRT-4 collection (i.e. GIRT4-DE) developed by Michael Kluck. The tagged files are produced by IICS at the FernUniversität in Hagen and are distributed on the web site that distributes GIRT-4. See the CLEF Domain-Specific Track for details on this collection. We provide the tagged files without any guarantees. We hope to foster research in information retrieval that uses deeper methods from computational linguistics. If you have plans or ideas to improve or extend the files, please contact us at IICS, FernUniversität in Hagen.

Tagger

The PoS tagger is a combination of two taggers (namely MET and T3) provided by Ingo Schröder's ACOPOST package. We have combined them by means of TiMBL 5.1 so that a sequential tagger architecture results. The training material is the NEGRA corpus (version 2 plus our corrections). The PoS tag set is STTS except for one simplification also found elsewhere, e.g. in the TIGER corpus: the tag PIDAT is mapped to the tag PIAT.

Tagged Files

Each file in the directory named ces+pos corresponds to one entry in the original SGML files for GIRT. Files are encoded according to the Corpus Encoding Standard (CES), but without any header information. Words (w element) and sentences (s element) are automatically marked up with the NLP technology at IICS, FernUniversität in Hagen. The Part-of-Speech tag is added as the value of the pos attribute on <w> elements.

The first sentence of a tagged text is from the TITLE-DE element in GIRT. The remaining sentences are from the ABSTRACT-DE element in GIRT.

All steps have been done automatically without any manual intervention. On newspaper corpora, we achieve a word tag accuracy of 96.6% (see below); on GIRT texts, one must expect a somewhat lower accuracy due to the domain-specific nature, the mixed abstract quality, and the encoding which is sometimes problematic.

Sample File

Here ist the beginning of one sample file (GIRT-DE19900054.ces.gz) from the collection of tagged GIRT4-DE files:

<s id="p1.s1">
<w offset="0" pos="ART">Der</w>
<w offset="4" pos="NN">Einfluß</w>
<w offset="12" pos="ART">des</w>
<w offset="16" pos="NN">EG-Binnenmarktes</w>
<w offset="33" pos="CARD">1992</w>
<w offset="38" pos="APPR">auf</w>
<w offset="42" pos="ART">die</w>
<w offset="46" pos="NN">Schadstoffbelastung</w>
<w offset="66" pos="ART">der</w>
<w offset="70" pos="NN">Fließgewässer</w>
<w offset="84" pos="APPR">in</w>
<w offset="87" pos="ART">der</w>
<w offset="91" pos="NN">Bundesrepublik</w>
<w offset="106" pos="NE">Deutschland</w>
<w offset="117" pos="$.">.</w>
</s>
<s id="p1.s2">
<w offset="119" pos="NN">Ziel</w>
<w offset="124" pos="ART">der</w>
<w offset="128" pos="NN">Untersuchung</w>
<w offset="141" pos="VAFIN">ist</w>
<w offset="145" pos="ART">eine</w>
<w offset="150" pos="NN">Abschätzung</w>
<w offset="162" pos="ART">des</w>
<w offset="166" pos="NN">Einflusses</w>
<w offset="177" pos="ART">der</w>
<w offset="181" pos="NN">Realisierung</w>
<w offset="194" pos="ART">des</w>
<w offset="198" pos="NN">EG-Binnenmarktes</w>
<w offset="215" pos="APPR">auf</w>
<w offset="219" pos="ART">die</w>
<w offset="223" pos="NN">Schadstoffbelastung</w>
<w offset="243" pos="ART">der</w>
<w offset="247" pos="NN">Fließgewässer</w>
<w offset="261" pos="APPR">in</w>
<w offset="264" pos="ART">der</w>
<w offset="268" pos="NN">Bundesrepublik</w>
<w offset="282" pos="$.">.</w>
</s>

Tagger Results

Below are some of our correctness results on the NEGRA corpus version 2 (plus our corrections; training data: folds 1 to 9, test data: fold 0). The tagger listed in the last table row was used for annotating GIRT.

TaggerSentence Cor.Word Cor.Known Word Cor.Unknown Word Cor.
acopost-T355.922%96.074%97.775%84.570%
acopost-MET55.680%96.100%97.237%88.412%
acopost-et46.845%94.649%96.585%81.561%
combi-tagger (MET+T3, 5 words)58.689%96.553%97.814%88.026%
combi-tagger (MET+T3, 7 words)58.883%96.551%97.802%88.087%
combi-tagger (MET+T3, 5 words, extended lexicon)58.544%96.577%97.959%87.233%

53 PoS tags occur in the output of the tagger. Only the STTS tags UNKNOWN and PIDAT are missing (see above). The following absolute tag frequencies have been calculated:

      6 VMPP
      8 PPOSS
     15 VAIMP
    109 ITJ
    351 PTKANT
   1388 APPO
   4378 PTKA
   5036 VVIMP
   5925 VMINF
   8509 APZR
   9444 VAPP
   9674 PRELAT
  12306 PWS
  17366 KOUI
  21257 PWAT
  26605 XY
  30258 VVIZU
  34882 PDS
  39327 KOKOM
  43229 FM
  51508 PWAV
  52189 PIS
  59926 PTKNEG
  68023 VAINF
  86186 PIAT
  86941 PTKVZ
  92118 TRUNC
  94300 PTKZU
  95292 VMFIN
 110058 PPOSAT
 112111 PROAV
 117325 PDAT
 130693 PRF
 130755 KOUS
 131136 PRELS
 164856 PPER
 206867 VVINF
 375819 CARD
 403869 VVPP
 408023 APPRART
 426367 ADJD
 501120 VAFIN
 511278 ADV
 585733 NE
 617778 VVFIN
 875986 $,
 946400 KON
1175476 $(
1497258 $.
1929173 APPR
2023299 ADJA
2862394 ART
5881468 NN
Sven Hartrumpf
Intelligent Information and Communication Systems (IICS)
FernUniversität in Hagen
58084 Hagen - Germany
http://pi7.fernuni-hagen.de

Valid HTML 4.01!