This is a short user guide for the current version v0.4.7 of the DARIAH-DKPro-Wrapper.
System Requirements
To run the pipeline properly, a system equipped with and able to handle at least 4 GB RAM is recommended. The following operating systems have been tested:
-
macOS (10.10 - 10.13)
-
Linux (Ubuntu 14.04 - 17.10)
-
Windows 7 - 10
Furthermore, the pipeline depends on an internet connection when running to download the models for the current configuration. It does not work off line!
The pipeline requires Java 1.8 or higher. You can download Java from the Oracle website. You can check your current Java version by running java -version
in your command line. You should use the 64 bit version of Java in order to be able to allocate ≥ 4 GB of RAM.
Running the Pipeline
After downloading and unzipping the files, execute in your command line the following code:
java -Xmx4g -jar ddw-0.4.7.jar -input file.txt -output folder
If you do not specify the -language
parameter, the pipeline is prepared to analyze English input. Support for the following languages are included in the current version of the DARIAH-DKPro-Wrapper: German (de), English (en), Spanish (es), and French (fr). If you want to work with Bulgarian (bg), Danish (da), Estonian (et), Finnish (fi), Galician (gl), Latin (la), Mongolian (mn), Polish (pl), Russian (ru), Slovakian (sk) or Swahili (sw) input, you have to install TreeTagger first. To run the pipeline for German, execute the following command:
java -Xmx4g -jar ddw-0.4.7.jar -language de -input file.txt -output folder
Run the Full Pipeline
By default, the pipeline runs in a light mode, the memory and time intensive components for parsing and semantic role labeling are disabled.
If you like to use them, feel free to enable them in the default.properties
or create a new .properties
-File and pass the path to this file via the config
-parameter.
Program Parameters
Run java -jar ddw-0.4.7.jar -help
to get an overview of the possible command line arguments:
-config <path> Config file -help print this message -input <path> Input path -language <lang> Language code for input file (default: en) -output <path> Output path -reader <reader> Either text (default) or xml -resume Already processed files will be skipped
The pipeline supports a resume function. By adding the -resume
argument to the exection of the pipeline, all files that were previously processed and have an according .csv
-file in the output folder will be skipped.
File Reader
You can process either single files or also all files inside a directory. Patterns can be used to select specific files that should be processed.
Text Reader & XML Reader
The DARIAH-DKPro-Wrapper implements two base readers, one text reader and one XML-file reader. You can specify the reader that should be used with the -reader
parameter. By default, the text reader is used. To use the XML reader, run the pipeline in the following way:
java -Xmx4g -jar ddw-0.4.7.jar -reader xml -input file.xml -output folder
The XML reader skips XML tags and processes only text which is inside the XML tags. The XPath to each tag is conserved and stored in the column SectionId in the ouput format.
Reading Directories
You can also specify for the -input argument a directory instead of a file. If you run the pipeline in the following way:
java -Xmx4g -jar ddw-0.4.7.jar -input folder/With/Files/ -output folder
the pipeline will process all files with a .txt extension for the Text-reader. For the XML-reader, it will process all files with a .xml extension.
You can speficy also patterns to read in only certain files or files with certain extension. For example to read in only .tei with the XML reader, you must start the pipeline in the following way:
java -Xmx4g -jar ddw-0.4.7.jar -reader xml -input "folder/With/Files/*.tei" -output folder
Note: If you use patterns (i.e. paths containing an *), you must set it into quotes to prevent shell globbing.
To read all files in all subfolders, you can use a pattern like this:
java -Xmx4g -jar ddw-0.4.7.jar -input "folder/With/Subfolders/\**/*.txt" -output folder
This will read in all .txt files in all subfolders. Note that the subfolder path will not be maintained in the output folder.
Write Your Own Config Files
The pipeline can be configurated via properties-files that are stored in the configs
folder. In this folder you find a default.properties
, the most basic configuration file. For the different supported languages, you can find further properties-files, for example default_de.properties
for German, default_en.properties
for English and so on.
If you like to write your own config file, just create your own .properties
file. You can run the pipeline with your .properties
-file by setting the command argument.
java -Xmx4g -jar ddw-0.4.7.jar -config /path/to/my/config/myconfigfile.properties -input file.txt -output folder
In case you store your myconfigfile.properties
in the configs
folder, you can run the pipeline via:
java -Xmx4g -jar ddw-0.4.7.jar -config myconfigfile.properties -input file.txt -output folder
You can split your config file into different parts and pass them all to the pipeline by seperating the paths using comma or semicolons. The pipeline examines all passed config files and derives the final configuration from all files. The config-file passed as last arguments has the highest priority, i.e. it can overwrite the values for all previous config files:
java -Xmx4g -jar ddw-0.4.7.jar -config myfile1.properties,myconfig2.properties,myfile3.properties -input file.txt -output folder
Note: The system always uses the default.properties and default_[langcode].properties as basic configuration files. All further config files are added on top of these files.
In case you like to use the full-version and also want to change the POS-tagger, you can run the pipeline in the following way:
java -Xmx4g -jar ddw-0.4.7.jar -config myFullVersion.properties,myPOSTagger.properties -input file.txt -output folder
In myPOSTagger.properties
you just add the configuration for the different POS-tagger.
Note: The properties-files must use the ISO-8859-1 encoding. If you like to include UTF-8 characters, you must encode them using \u[HEXCode].
Understanding the Argument Parameter
Most components can be equipped with arguments so specifcy for example the model that should be used. Arguments are passed to the pipeline in a 3 tuple format. In the default.properties
you can find the following line:
constituencyParserArguments = writeDependency,boolean,false
Here we specify the argument writeDependency with the boolean value false. As type you can use boolean, integer, and string.
Using TreeTagger
Due to copyright issues, TreeTagger cannot directly be accessed from the DKPro repository. Instead, you have first to download and to install TreeTagger to able to use it with DKPro.
Installation
-
Go to the TreeTagger website
-
From the download section, download the correct tagger package, i.e. PC-Linux, OS X or Windows
-
Extract the .tar.gz and .zip archive, respectively
-
Create a new directory
tree-tagger
containing two foldersbin
andlib
on your hard drive, e.g.C:/tree-tagger/bin
andC:/tree-tagger/lib
-
Copy the
tree-tagger/bin/tree-tagger
file from the previously downloaded archive to your recently created directorytree-tagger
into the folderbin
-
-
From the parameter file section, download the correct model. For the example below download Latin parameter file (latin-par-linux-3.2-utf8.bin.gz)
-
Unzip the file (e.g.
gunzip latin-par-linux-3.2-utf8.bin.gz
or alternatively use a program like 7zip or WinRar) -
Copy the extracted file latin.par into the folder
lib
in your created directorytree-tagger
-
Configuration
After downloading the correct executable and correct model, we must configure our pipeline in order to be able to use TreeTagger. You can find an example configuration in the configs folder treetagger-example.properties:
posTagger = de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger posTaggerArguments = executablePath,string,C:/tree-tagger/bin/tree-tagger.exe,\ modelLocation,string,C:/tree-tagger/lib/latin.par,\ modelEncoding,string,utf-8 # Treetagger adds lemmas, no need for an additional lemmatizer useLemmatizer = false
Change the paths for the parameter executablePath and modelLocation to the correct paths on your machine. Beware these values are case sensitive even on Windows – when unsure, copy & paste the paths from Explorer.
You can then use TreeTagger in your pipeline using the -config
argument:
java -Xmx4g -jar ddw-0.4.7.jar -config treetagger-example.properties -language la -input file.txt -output folder
Check the output of the pipeline that TreeTagger is used. The output of your pipeline should look something like this:
POS-Tagger: true POS-Tagger: class de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger POS-Tagger: executablePath, C:/tree-tagger/bin/tree-tagger.exe, modelLocation, C:/tree-tagger/lib/latin.par, modelEncoding, utf-8
Output files
The output files are UTF-8 encoded tab separated values that have a field heading line and don’t use quoting. Each line represents one token. The fields do not contain whitespace, but they may contain "
or '
characters if these are in the text, so if you have trouble importing the output files, check your CSV reader’s settings. See the Tutorial for a usage example.
Logging and reporting errors
The pipeline will only display terse status and error information on the screen in order to not overload users with useless information. Detailed information will be written to a log file, ddw.log
— when you report bugs, please always provide that log file. The log file contains status information that is written to the screen, but also output that otherwise would be written to the screen by other components, together with source information and timestamps. Existing files will be appended to.
Experts might want to fine-tune what is displayed and what is logged — you can do so by providing your own log4j2 configuration file. To do so, download and modify our default log4j2 configuration file and run the pipeline using:
java -Dlog4j.configurationFile=your-log4j2.xml -Xmx4g ddw-0.4.7.jar [options]
See the log4j2 manual for more information.