DARIAH-DKPro-Wrapper v0.4.7

This is a short user guide for the current version v0.4.7 of the DARIAH-DKPro-Wrapper.

System Requirements

To run the pipeline properly, a system equipped with and able to handle at least 4 GB RAM is recommended. The following operating systems have been tested:

Systems:

macOS (10.10 - 10.13)
Linux (Ubuntu 14.04 - 17.10)
Windows 7 - 10

Furthermore, the pipeline depends on an internet connection when running to download the models for the current configuration. It does not work off line!

The pipeline requires Java 1.8 or higher. You can download Java from the Oracle website. You can check your current Java version by running java -version in your command line. You should use the 64 bit version of Java in order to be able to allocate ≥ 4 GB of RAM.

Running the Pipeline

After downloading and unzipping the files, execute in your command line the following code:

java -Xmx4g -jar ddw-0.4.7.jar -input file.txt -output folder

If you do not specify the -language parameter, the pipeline is prepared to analyze English input. Support for the following languages are included in the current version of the DARIAH-DKPro-Wrapper: German (de), English (en), Spanish (es), and French (fr). If you want to work with Bulgarian (bg), Danish (da), Estonian (et), Finnish (fi), Galician (gl), Latin (la), Mongolian (mn), Polish (pl), Russian (ru), Slovakian (sk) or Swahili (sw) input, you have to install TreeTagger first. To run the pipeline for German, execute the following command:

java -Xmx4g -jar  ddw-0.4.7.jar -language de -input file.txt -output folder

Run the Full Pipeline

By default, the pipeline runs in a light mode, the memory and time intensive components for parsing and semantic role labeling are disabled.

If you like to use them, feel free to enable them in the default.properties or create a new .properties-File and pass the path to this file via the config-parameter.

Program Parameters

Run java -jar ddw-0.4.7.jar -help to get an overview of the possible command line arguments:

 -config <path>     Config file
 -help              print this message
 -input <path>      Input path
 -language <lang>   Language code for input file (default: en)
 -output <path>     Output path
 -reader <reader>   Either text (default) or xml
 -resume            Already processed files will be skipped

The pipeline supports a resume function. By adding the -resume argument to the exection of the pipeline, all files that were previously processed and have an according .csv-file in the output folder will be skipped.

File Reader

You can process either single files or also all files inside a directory. Patterns can be used to select specific files that should be processed.

Text Reader & XML Reader

The DARIAH-DKPro-Wrapper implements two base readers, one text reader and one XML-file reader. You can specify the reader that should be used with the -reader parameter. By default, the text reader is used. To use the XML reader, run the pipeline in the following way:

java -Xmx4g -jar  ddw-0.4.7.jar -reader xml -input file.xml -output folder

The XML reader skips XML tags and processes only text which is inside the XML tags. The XPath to each tag is conserved and stored in the column SectionId in the ouput format.

Reading Directories

You can also specify for the -input argument a directory instead of a file. If you run the pipeline in the following way:

java -Xmx4g -jar  ddw-0.4.7.jar -input folder/With/Files/ -output folder

the pipeline will process all files with a .txt extension for the Text-reader. For the XML-reader, it will process all files with a .xml extension.

You can speficy also patterns to read in only certain files or files with certain extension. For example to read in only .tei with the XML reader, you must start the pipeline in the following way:

java -Xmx4g -jar  ddw-0.4.7.jar -reader xml -input "folder/With/Files/*.tei" -output folder

Note: If you use patterns (i.e. paths containing an *), you must set it into quotes to prevent shell globbing.

To read all files in all subfolders, you can use a pattern like this:

java -Xmx4g -jar  ddw-0.4.7.jar -input "folder/With/Subfolders/\**/*.txt" -output folder

This will read in all .txt files in all subfolders. Note that the subfolder path will not be maintained in the output folder.

Write Your Own Config Files

The pipeline can be configurated via properties-files that are stored in the configs folder. In this folder you find a default.properties, the most basic configuration file. For the different supported languages, you can find further properties-files, for example default_de.properties for German, default_en.properties for English and so on.

If you like to write your own config file, just create your own .properties file. You can run the pipeline with your .properties-file by setting the command argument.

java -Xmx4g -jar ddw-0.4.7.jar -config /path/to/my/config/myconfigfile.properties -input file.txt -output folder

In case you store your myconfigfile.properties in the configs folder, you can run the pipeline via:

java -Xmx4g -jar ddw-0.4.7.jar -config myconfigfile.properties -input file.txt -output folder

You can split your config file into different parts and pass them all to the pipeline by seperating the paths using comma or semicolons. The pipeline examines all passed config files and derives the final configuration from all files. The config-file passed as last arguments has the highest priority, i.e. it can overwrite the values for all previous config files:

java -Xmx4g -jar ddw-0.4.7.jar -config myfile1.properties,myconfig2.properties,myfile3.properties -input file.txt -output folder

Note: The system always uses the default.properties and default_[langcode].properties as basic configuration files. All further config files are added on top of these files.

In case you like to use the full-version and also want to change the POS-tagger, you can run the pipeline in the following way:

java -Xmx4g -jar ddw-0.4.7.jar -config myFullVersion.properties,myPOSTagger.properties -input file.txt -output folder

In myPOSTagger.properties you just add the configuration for the different POS-tagger.

Note: The properties-files must use the ISO-8859-1 encoding. If you like to include UTF-8 characters, you must encode them using \u[HEXCode].

Understanding the Argument Parameter

Most components can be equipped with arguments so specifcy for example the model that should be used. Arguments are passed to the pipeline in a 3 tuple format. In the default.properties you can find the following line:

constituencyParserArguments = writeDependency,boolean,false

Here we specify the argument writeDependency with the boolean value false. As type you can use boolean, integer, and string.

Using TreeTagger

Due to copyright issues, TreeTagger cannot directly be accessed from the DKPro repository. Instead, you have first to download and to install TreeTagger to able to use it with DKPro.

Installation

Go to the TreeTagger website
From the download section, download the correct tagger package, i.e. PC-Linux, OS X or Windows
1. Extract the .tar.gz and .zip archive, respectively
2. Create a new directory tree-tagger containing two folders bin and lib on your hard drive, e.g. C:/tree-tagger/bin and C:/tree-tagger/lib
3. Copy the tree-tagger/bin/tree-tagger file from the previously downloaded archive to your recently created directory tree-tagger into the folder bin
From the parameter file section, download the correct model. For the example below download Latin parameter file (latin-par-linux-3.2-utf8.bin.gz)
1. Unzip the file (e.g. gunzip latin-par-linux-3.2-utf8.bin.gz or alternatively use a program like 7zip or WinRar)
2. Copy the extracted file latin.par into the folder lib in your created directory tree-tagger

Configuration

After downloading the correct executable and correct model, we must configure our pipeline in order to be able to use TreeTagger. You can find an example configuration in the configs folder treetagger-example.properties:

posTagger =  de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger
posTaggerArguments = executablePath,string,C:/tree-tagger/bin/tree-tagger.exe,\
	modelLocation,string,C:/tree-tagger/lib/latin.par,\
	modelEncoding,string,utf-8

# Treetagger adds lemmas, no need for an additional lemmatizer
useLemmatizer = false

Change the paths for the parameter executablePath and modelLocation to the correct paths on your machine. Beware these values are case sensitive even on Windows – when unsure, copy & paste the paths from Explorer.

You can then use TreeTagger in your pipeline using the -config argument:

java -Xmx4g -jar ddw-0.4.7.jar -config treetagger-example.properties -language la -input file.txt -output folder

Check the output of the pipeline that TreeTagger is used. The output of your pipeline should look something like this:

POS-Tagger: true
POS-Tagger: class de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger
POS-Tagger: executablePath, C:/tree-tagger/bin/tree-tagger.exe, modelLocation, C:/tree-tagger/lib/latin.par, modelEncoding, utf-8

Output files

The output files are UTF-8 encoded tab separated values that have a field heading line and don’t use quoting. Each line represents one token. The fields do not contain whitespace, but they may contain " or ' characters if these are in the text, so if you have trouble importing the output files, check your CSV reader’s settings. See the Tutorial for a usage example.

The wrapper’s output format is described in Fotis Jannidis, Stefan Pernes, Steffen Pielström, Isabella Reger, Nils Reimers, Thorsten Vitt: "DARIAH-DKPro-Wrapper Output Format (DOF) Specification". DARIAH-DE Working Papers Nr. 20. Göttingen: DARIAH-DE, 2016. URN: urn:nbn:de:gbv:7-dariah-2016-6-2.

Logging and reporting errors

The pipeline will only display terse status and error information on the screen in order to not overload users with useless information. Detailed information will be written to a log file, ddw.log — when you report bugs, please always provide that log file. The log file contains status information that is written to the screen, but also output that otherwise would be written to the screen by other components, together with source information and timestamps. Existing files will be appended to.

Experts might want to fine-tune what is displayed and what is logged — you can do so by providing your own log4j2 configuration file. To do so, download and modify our default log4j2 configuration file and run the pipeline using:

java -Dlog4j.configurationFile=your-log4j2.xml -Xmx4g ddw-0.4.7.jar [options]

See the log4j2 manual for more information.