Corpora Processing Software

Obsah

Description of processing

Processing is divided into several steps. Scripts were created for each step to simplify the work using arguments and they allow the processing performed on one machine, or in parallel on multiple machines.

All source codes of programs and scripts are in repository corpora_processing_sw and potential missing libraries are in /mnt/minerva1/nlp/projects/corpproc (accessible from the KNOT servers - those who haven't access, can request us via email).

1. Distribution of programs or data for processing

Script enables reallocating data (files) between servers. Data are divided according to their size so all servers use similiar amounts of data. Parameter -a switch off reallocating and each file is copied to all servers. It is particularly suitable for the distribution of processing programs.

 ./processing_steps/1/distribute.py

Use of the script:

 ./distribute.py [-i INPUT_DIRECTORY] -o OUTPUT_DIRECTORY -s SERVER_LIST [-a] [-e ERRORS_LOG_FILE]
 -i   --input     input directory/file for distribution (if it is not set, file names are expected on stdin separated by '\n')
 -o   --output    output directory on target server, if directory doesn't exist script tries to create it
 -a   --all       each file will be copied to all servers (suitable for scripts/programs)
 -s   --servers   file containing list of servers where each line is hostname of one machine, if line contains tabulator it is
                  treated as separator and hostname is text before the separator (it is for compatibility of this configuration
                  file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line)                   
 -e   --errors    if it is set errors are logged to this file with current date and time

Examples:

 ./distribute.py -i ~/data_to_distribution/ -o /mnt/data/project/data/ -s ~/servers
 # All files from directory ~/data_to_distribution/ are reallocated between servers set by file ~/servers.
 # Output directory is /mnt/data/project/data/. If directory data on some machine doesn't exist, script tries to create it. 
 ./distribute.py -i ~/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s ~/servers -a
 # Copies program NLP-slave to servers set by file~/servers.
 # Output directory is /mnt/data/commoncrawl/software/. If directory  software on some machine doesn't exist, script tries to create it.

1. a) Downloading the dump of Wikipedia

There is a program for extraction plain text from Wikipedia called WikiExtractor (https://github.com/bwbaugh/wikipedia-extractor). Launching:

cd /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/
./download_wikipedia_and_extract_html.sh 20151002

Script in working directory expects file hosts.txt containing list of servers (1 on each line) on which it is supposed to run.

/mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/tools/WikiExtractor.py

Output is collection of files of size approx. 100MB divided into servers mentioned in

/mnt/data/wikipedia/enwiki-.../html_from_xml/enwiki=...

1. b) Downloading CommonCrawl

For downloading WARC you need to know the exact specification of CommonCrawl, e.g. "2015-18"

./processing_steps/1b/download_commoncrawl/dl_warc.sh

downloaded files are in

 /mnt/data/commoncrawl/CC-Commoncrawl_specification/warc/

support files are in

 /mnt/minerva1/nlp-2/download_commoncrawl/CC-Commoncrawl_specification/download/

Then it is possible to calculate URI statistics:

 ./processing_steps/1b/uri_stats.sh

result is saved in

 /mnt/minerva1/nlp-2/download_commoncrawl/CC-Commoncrawl_specification/uri/

on individual machines data are in

 /mnt/data/commoncrawl/CC-Commoncrawl_specification/uri/

1. c) Downloading web pages from RSS feeds

To get url from entered RSS source use:

 ./processing_steps/1c/collect.py

Use of the script:

 ./collect.py [-i INPUT_FILE] [-o OUTPUT_FILE [-a]] [-d DIRECTORY|-] [-e ERRORS_LOG_FILE]
 
 -i   --input     input file containing RSS url separated by '\n' (if it is not set, it is expected on stdin)
 -o   --output    output file for saving parsed url (if it is not set, it is printed to stdout)
 -a   --append    output file is in adding mode 
 -d   --dedup     deduplication of obtained links (according to matched url), you can optionally enter folder with files with list of already colled url, these url will be included in deduplication, in mode -a the output file is also included
 -e   --errors    if it is set, errors are logged in this file with current date and time

Examples:

 ./collect.py -i rss -o articles -a
 # To file articles adds url from RSS sources in file rss

For downloading websites according to entered list of url and saving to WARC archive is used:

 ./processing_steps/1c/download.py

Use of the script:

 ./download.py [-i INPUT_FILE] -o OUTPUT_FILE [-e ERRORS_LOG_FILE]
 
 -i   --input     input file containing url (if it is not set, it's expected on stdin)
 -o   --output    output file for saving warc archive
 -r   --requsts   limit the number of requests per minute on one domain (default is 10)
 -e   --errors    if it is set, errors are logged into this file with current date and time

Script downloads sites equally according to domain, not in order according to input file, to avoid possible "attack". Simultaneously it's possible to set limit of requests to domain per minute. When the limit is exhausted on all domains, downloading is paused until the limit is restored. The limit is restored every 6 seconds (1/10 minute) on 1/10 of total limit.

If some error occurs during the downloading (except code 200), the whole domain is excluded from downloading.

Example:

 ./download.py -i articles -o today.warc.gz
 # it downloads sites with url included in file articles into archive today.warc.gz

2. Verticalization

Vertical file format description

Input for verticalization program is file warc.gz. From this file the program unpacks particular pages (documents), filters HTML from it, filters non-English articles and makes tokenization. It can process also 1 webpage in .html and Wikipedia in preprocesed HTML. Output is saved with extension vert into target folder.

New verticalizer:

 ./processing_steps/2/vertikalizator/main.py

Compile with command:

 make

In case of error try to run:

 make configure

Use of the script:

 ./main.py [-i INPUT_FILE] -o OUTPUT_FILE [-l] [-d] [-w] -w STOPWORDS_LIST
 
 -i   --input        input file for verticalization (if it is not set, expected WARC on stdin)
 -o   --output       output file (if it is not set, it will use stdout)
 -n   --nolangdetect turns off detection of language (speeds it up with minimal difference in output)
 -d   --debug        prints debugging informations
 -w   --wiki         switches to wiki format of input
 -s   --stopwords    file with list of stop words
 -l   --log          file for storing of the log (debugging information), 
                     usable only if also parameter -d is used. For output to stdout, 
                     value STDOUT must be used, for output to stderr, value STDERR must be used.

DO NOT USE: - old verticalizer

For verticalization you should not use program NLP-slave (dependent on local langdetect), source code is in repository:

 ./processing_steps/2/NLP-slave

This program expects file warc.gz as input (first parameter) from which it gradually extracts particular records, filters HTML, filters articles not in English and does tokenization. It is also possible to process 1 website in .html and Wikipedia in preprocessed plaintext or HTML. Result is saved with extension vert into destination directory.

Compile with command:

 mvn clean compile assembly:single

Usage of old verticalizer:

 java -jar package [opts] input_file.html output_dir file_URI [langdetect_profiles]
 java -jar package [opts] input_file.txt output_dir [langdetect_profiles]
 java -jar package [opts] input_file-warc.gz output_dir [langdetect_profiles]
 java -jar package [opts] input_file output_dir [langdetect_profiles]

Script for launching in parallel on multiple machines:

 ./processing_steps/2/verticalize.py

Usage:

 ./verticalize.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-b BINARY] [-t THREADS] [-l] [-d] [-s STOPWORDS_LIST] [-e ERRORS_LOG]
 
 -i   --input     input directory with files determined to verticalization
 -o   --output    output directory, if it does not exist, script attepts its creation
 -s   --servers   file with list of servers with format HOSTNAME '\t' THREADS '\n' (one line per one machine)
 -e   --errors    if it is set errors are logged to this file with current date and time
 -b   --binary    argument specifying path to verticalizer (default is ./processing_steps/2/vertikalizator/main.py)
 -t   --threads   sets number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority)
 -l   --no-lang   turns off language detection
 -d   --debug     verticalizer prints debugging information
 -w   --stopwords file with list of stop words (default is ./processing_steps/2/vertikalizator/stoplists/English.txt)

Example:

 ./verticalize.py -i /mnt/data/commoncrawl/CC-2015-18/warc/ -o /mnt/data/commoncrawl/CC-2015-18/vert/ -s ~/servers -b /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 # Data from directory /mnt/data/project/CC-2015-18/warc/ verticalizates to directory/mnt/data/commoncrawl/CC-2015-18/vert/
 # verticalization will be executed on all servers determined by file -s ~/servers
 # It uses program /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar, which has to be on all machines

3. Deduplication

For deduplication programs dedup and server are used. Both are available to compile via Makefile in folder:

 processing_steps/3/dedup/
 in repository corpproc_dedup

Parameters for launching these programs are:

./server [-i=INPUT_FILE] [-o=OUTPUT_FILE [-d]] -w=WORKERS_COUNT [-s=STRUCT_SIZE] [-p=PORT] [-j=JOURNAL_FILE]

-i   --input     input file 
-o   --output    output file
-w   --workers   number of workers
-p   --port      port of server (default 1234)
-s   --size      change size of structure for saving hashes (default size is 300,000,000)
-d   --debug     simultaneously with output file is generated file with debugging dumps
-j   --journal   recovery of hashes from journal file - (use with -r at worker)
-k   --keep      keeping journal file after successful saving hashes in output file

Server runs until it's "killed". Reaction to signals SIGHUP and SIGTERM is that hashes are saved into output file (if path is entered).


./dedup -i=INPUT_DIR -o=OUTPUT_DIR -s=SERVERS_STRING [-p=PORT] [-t=THREADS] [-n] [-d] [-wl] [-uf=FILTER_FILE]

-i   --input       input folder with files for deduplication
-o   --output      output folder, if it does not exist, script attempts its creation
-s   --servers     list of servers separated by spaces
-p   --port        port of server (default 1234)
-t   --threads     sets number of threads (default 6)
-n   --near        uses algorithm "nearDedup"
-d   --debug       for every output file .dedup file .dedup.debug  containing debugging logs is generated 
-wl  --wikilinks   deduplication of format WikiLinks
-uf  --usefilter   input file with strings (one on line), which are supposed to filter useless documents - name/URL
-dr  --dropped     for each output file .dedup file .dedup.dropped containing deleted duplicates is generated
-f   --feedback    records in file .dedup.dropped containing reference to records responsible for its elimination (more below) 
-dd  --droppeddoc  to each output file *.dedup is created file .dedup.dd containing list of URL addresses
                         of completely eliminated documents
-r   --restore     continue in dedup after system crash - already processed files are skipped, unfinished ones are about to be done
                   (use with -j on server)

For easier launching there are scripts server.py and deduplicate.py (described below), which allow launching deduplication in parallel on more machines. Programs have to be pre-distributed on all used machines and have to be on the same place e.g.: /mnt/data/bin/dedup

Launching servers for deduplication. Script with argument start at first launches screens and in screens launches servers. If argument stop is entered script closes the screens. If start nor stop nor restart is not entered, then script firstly tests if screens are running and then servers.

 ./processing_steps/3/server.py

Use of the script:

./server.py [start|stop|restart] [-i INPUT_FILE] [-o OUTPUT_FILE [-d]] [-a] -s SERVERS_FILE [-t THREADS] [-p PORT]
[-e ERRORS_LOG_FILE] [-b BINARY] [-d] [-j] [-k] [-r SPARSEHASH_SIZE] 

-i   --input       input file 
-o   --output      output file
-s   --servers     file with list of servers, there is hostname of one machine on every line. If the line contains tabulator, it is taken as an separator and as hostname is taken only text before it (because of the compatibility of this configuration file with scripts included in which it is possible to specify number of threads for specific machine, there is possible format:
                         HOSTNAME \t THREADS for each line)
-t   --threads     number of threads containing workers (default = 384)
-p   --port        port of server (default 1234)
-r   --resize      change size of structure for saving hashes (default size is 300000000)
-e   --errors      if it is set, errors are logged into this file with current date and time
-b   --binary      argument specifying path to deduplication server (default is /mnt/data/commoncrawl/corpproc/bin/server)
-d   --debug       simultaneously with output file is generated file with debugging dumps
-j   --journal     recover hashes from journal file (suggested touse with -r at client)
-k   --keep        keeping journal file after successful save of hashes into output file
-a   --append      output file is the same as input file

Note: -j, --journal: to recover hashes from journal is necessary to enter path to input file, which was before server crash entered as output file. Script verifies, if there is file "input_file".backup. If input file does not exist on some servers, it will be created.


For example:

./server.py start -s ~/servers
# It launches screens and in it launches servers on machines specified by file ~/servers
# Servers are waiting for workers to connect
./server.py -s ~/servers
# It tests if there are launched screens and servers on machines specified by file ~/servers
./server.py stop -s ~/servers 
# It closes screens and servers on machines specified by file~/servers

Launching workers for deduplication. It is necessary to launch servers with the exact parameters -s and -p before it.

 ./processing_steps/3/deduplicate.py

Use of the script:

./deduplicate.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY -w WORKERS_FILE -s SERVERS_FILE [-p PORT] 
[-e ERRORS_LOG_FILE] [-b BINARY] [-n] [-d] [-wl] -uf FILTER_FILE

-i   --input     input folder with files for deduplication
-o   --output    output folder, if it does not exist, script attepts its creation
-s   --servers   file with list of servers, there is hostname of one machine on each line. If the line contains tabulator, it is taken as an separator and as hostname is taken only text before it (because of the compatibility of this configuration file with scripts included in which it is possible to specify number of threads for specific machine, there is possible format:
                       HOSTNAME \t THREADS for each line)
-w   --workers   file with list of workers, there is format: HOSTNAME '\t' THREADS '\n' (one line per one machine)
                   (beware - replacing tabulators with spaces will not work)
-p   --port      port of server (default 1234)
-e   --errors    if it is set, errors are logged into this file with current date and time
-b   --binary    argument specifying path to deduplication server (default is /mnt/data/commoncrawl/corpproc/bin/dedup)
-t   --threads   number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority)
-n   --near      it uses algorithm "nearDedup"
-d   --debug     simultaneously with output file .dedup is generated file .dedup.dropped containing removed duplicates and file .dedup.debug containing debugging dumps
-wl  --wikilinks deduplication of format Wikilinks
-uf  --usefilter input file with strings (one on line), which filter useless documents - name/URL
-dr  --dropped   for each output file .dedup is generated file .dedup.dropped containing deleted duplicates
-f   --feedback  records in file .dedup.dropped containing reference on records responsible for its elimination (see below) 
-dd  --droppeddoc in output directory will be created file "droppedDocs.dd" containing list of completely excluded documents
-r   --restore   continue in dedup after system crash - already processed files are skipped, unfinished ones are about to be done
                 (use with -j on server)

Deduplication of format wikilinks is implemented for counting hash for concatenation of columns 2, 3, 5, 6, all the columns have to be the same to evaluate row as duplicated. In neardedup counting of hashes with the help of N-grams is concatenation of columns 5, 3, 6 (in this order), then it counts hash of 2. column. The row is duplicated if both last entries found conjunction.

For example:

 ./deduplicate.py -i /mnt/data/commoncrawl/CC-2015-18/vert/ -o /mnt/data/commoncrawl/CC-2015-18/dedup/ -w ~/workers-s ~/servers 
 # Data from folder /mnt/data/commoncrawl/CC-2015-18/vert/ deduplicate to folder /mnt/data/commoncrawl/CC-2015-18/dedup/
 # Deduplicaton runs on machines specified in file ~/workers
 # it's expected that on machines specified by file ~/servers is launched server

3.1. Deduplication for Salomon

This step only differs for Salomon, because it deals with distributed calculation. In standart execution the communication is through sockets via standart TCP/IP. On Salomon it is not possible (InfiniBand), thats why library mpi is used. Source codes are in

 /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/salomon/mpidedup

It is highly recomended to use module OpenMPI/1.8.8-GNU-4.9.3-2.25 due to compatibility issues with some other versions.

Translation

 module load OpenMPI/1.8.8-GNU-4.9.3-2.25
 make

Parameters:

 -h --hash - relative number of servers holding particular subspaces of total hashing space
 -w --work - relative number of worker servers which do its own deduplication
 -i --input - input directory with files determined to deduplication
 -o --output - output directory
 -l --load - optional directory used to load existing hashes 
    (for incremental deduplication - obviously needs to be launched the same way 
    (infrastructure, number of workers and servers with hashing space), the way they were saved before)
 -s --store - optional directory to save hashes
 -r --resize - optional parameter to change size of structure to save hashes
 -d --debug - turns on debugging mode (generates logs and files with deleted strings)
 -n --near - switches to neardedup algorithm
 -j --journal - takes processed files and journals into consideration (unsaved hashes and unfinished files)
              - attempts to recover after crash and to continue deduplication

Launching

bash start.sh dedup 4 qexp

Launch options are set in v5/start.sh, setting of restore mode is in v5/dedup.sh

Launching on servers KNOT:

mpiexec dedup -h H -w W -i ~/vert/ -o ~/dedup/ -s ~/hash/

How restore mode works?

4. Tagging

Tagging is executed via program TT-slave (dependent on /opt/TreeTagger), which can be found in:

 ./processing_steps/4/TT-slave

Compile with command:

 mvn clean compile assembly:single

Usage:

 java -jar package [opts] input_file output_dir [treetagger.home]

Script for parallel execution on multiple servers:

 ./processing_steps/4/tag.py

usage:

 ./tag.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] [-t THREADS] [-d] [-u]
 
 -i   --input     input directory with files determined to tagging
 -o   --output    output folder, if it does not exist, script attempts its creation
 -s   --servers   file containing list of servers, with format HOSTNAME '\t' THREADS '\n' (each line per one machine)
 -e   --errors    if it is set, errors are logged to this file with current date and time 
 -b   --binary    argument specifying path to tagging .jar program (default is ./processing_steps/4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar)
 -t   --threads   sets number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority)
 -d   --debug     debugging listing
 -u   --uri       turns on deletion of URI from links

Example:

 ./tag.py -i /mnt/data/commoncrawl/CC-2015-18/dedup/ -o /mnt/data/commoncrawl/CC-2015-18/dedup/tagged/ -s ~/servers
 # tags files from folder/mnt/data/commoncrawl/CC-2015-18/dedup/ and saves them to
 # directory/mnt/data/commoncrawl/CC-2015-18/dedup/tagged/ on machines determined by file ~/servers

5. Parsing

Parsing is done by modified MDParser, which can be found in:

 ./processing_steps/5/MDP-package/MDP-1.0

Compile by command:

 ant make-mdp

Usage:

 java -jar package [opts] input_file output_dir [path_to_props]

Important files, if something was changed:

 ./processing_steps/5/MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/test/MDParser.java
 ./processing_steps/5/MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/outputformat/ConllOutput.java

Script for parallel execution on multiple servers:

 ./processing_steps/5/parse.py

Usage:

 ./parse.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] [-t THREADS] [-u] [-p XML_FILE]
 
 -i  --input    input directory with files designed for parsing
 -o  --output   output directory, if directory doesn't exist, script will attempt to create it
 -s  --servers  file with list of servers, where is format: HOSTNAME '\t' THREADS '\n' (one line for one machine)
 -e  --errors   if it is set, errors are logged to this file with current date and time
 -b  --binary   argument specifies path to parsing .jar program (default is ./processing_steps/5/MDP-package/MDP-1.0/build/jar/mdp.jar)
 -t  --threads  sets number or threads (default is 6, if numbers of threads is stored in file with list of servers, they have a higher priority
 -u  --uri      turns on deleting URI from links
 -p  --props    Path to .xml file with program parameters (default is ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml)

Example:

 ./parse.py -i /mnt/data/commoncrawl/CC-2015-18/tagged/ -o /mnt/data/commoncrawl/CC-2015-18/parsed/ -s ~/servers
 # parses files from folder /mnt/data/commoncrawl/CC-2015-18/tagged/ to 
 # folders /mnt/data/commoncrawl/CC-2015-18/parsed/ on servers determined by the file ~/servers
 ./parse.py -i /mnt/data/commoncrawl/CC-2015-18/tagged/ -o /mnt/data/commoncrawl/CC-2015-18/parsed/ -s ~/servers \
  -p ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml
 # parses files from folder /mnt/data/commoncrawl/CC-2015-18/tagged/ do 
 # folders /mnt/data/commoncrawl/CC-2015-18/parsed/ na strojích určených souborem ~/servers
 # simultaneously hands config file propsKNOT.xml to MDParser (must be available on all machines we're trying to run it)

6. SEC (NER)

SEC (see SEC) is used for recognition of named entities. Client sec.py can be parallely run on multiple servers by:

 ./processing_steps/6/ner.py

Usage:

 ./ner.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY]

 -i   --input     input directory 
 -o   --output    output directory
 -s   --servers   file with list of servers, where hostname of one of the machines is on each line, if the line includes tabulator, it's taken as a separator and text before separator is taken as a hostname (because of compability of this config file with scripts, where you can specify number of threads for specific machine, there is possible format: HOSTNAME \t THREADS for each line)
 -e   --errors    if it's set, errors are logged to this file with current date and time
 -b   --binary    change of path to the SEC client (default is /var/secapi/SEC_API/sec.py)

Example:

 ./ner.py -i /mnt/data/commoncrawl/CC-2015-18/parsed/ -o /mnt/data/commoncrawl/CC-2015-18/secresult/ -s ~/servers
 # processes files from folder /mnt/data/commoncrawl/CC-2015-18/parsed/ to 
 # folder /mnt/data/commoncrawl/CC-2015-18/secresult/ on machines selected by file ~/servers

7. MG4J indexation

Introduction to MG4J: http://www.dis.uniroma1.it/~fazzone/mg4j-intro.pdf

Updated documentation:

 1- JAVA_HOME=/usr/lib/jvm/java-8-oracle/ mvn package
 2- JAVA_HOME=/usr/lib/jvm/java-8-oracle/ /usr/lib/jvm/java-8-oracle/bin/java -jar corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar
 3- JAVA_HOME=/usr/lib/jvm/java-8-oracle/ /usr/lib/jvm/java-8-oracle/bin/java -jar corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar -i
 4- JAVA_HOME=/usr/lib/jvm/java-8-oracle/ /usr/lib/jvm/java-8-oracle/bin/java -jar corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar index -o /mnt/data/indexes/wikipedia/enwiki-20150901/final_new /mnt/data/indexes/wikipedia/enwiki-20150901/collPart007/


Source files of program designed for semantic indexing can be found in the directory:

 ./processing_steps/7/mg4j-big-5.2.1

Compiled by command (if there is an error 'missing definitions' than you can find more informations about this error at Mg4j_improvements):

 ant ivy-setupjars && ant
 mkdir /mnt/data/indexes/CC-2015-18/collPart001
 mkdir /mnt/data/indexes/CC-2015-18/final
 find /mnt/data/indexes/CC-2015-18/collPart001 -type f | java -cp $(echo ./processing_steps/7/mg4j/*.jar | tr ' ' ':') it.unimi.di.big.mg4j.document.CustomDocumentCollection /mnt/data/indexes/CC-2015-18/final/collPart001.collection
 java -cp $(echo ./processing_steps/7/mg4j/*.jar | tr ' ' ':') it.unimi.di.big.mg4j.tool.IndexBuilder -S /mnt/data/indexes/CC-2015-18/final/collPart001.collection /mnt/data/indexes/CC-2015-18/final/collPart001

Script for parallel execution on multiple servers: The script starts indexation so it creates 6 Shard , they are filled and a collection is created from them, than indexation on the collection is started . To do this, you need to specify the argument start . If not set, the status of the individual screens and state programs will be printed. The argument stop terminates screens. .

 ./processing_steps/7/index.py

Use of the script:

 ./index.py [start|stop] -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY]
 
 -i   --input     input directory
 -o   --output    output directory
 -s   --servers   file containing list of servers where each line is hostname of one machine, if line contains tabulator it 
                  is treated as separator and hostname is text before the separator (it is for compatibility of this configuration 
                  file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line)
 -e   --errors    if it is set errors are logged to this file with current date and time
 -b   --binary    path to directory containing .jar files (default is /mnt/data/wikipedia/scripts/mg4j/)

Examples:

 ./index.py -s ~/servers
 # displays the status of startup screens and program on all servers from file ~/servers
 ./index.py -i /mnt/data/commoncrawl/CC-2015-18/secresult/ -o /mnt/data/indexes/CC-2015-18/ -s ~/servers start
 # run indexation on selected servers
 ./index.py -s ~/servers stop
 # ends screens and thereby processes in all servers from file ~/servers


Columns: https://docs.google.com/spreadsheets/d/1S4sJ00akQqFTEKyGaVaC3XsCYDHh1xhaLtk58Di68Kk/edit?usp=sharing

Note: Its not clear if the special XML charactes in columns should be escaped. The current implementation of SEC doesn't escape them (respectively any escaping is removed). tagMG4JMultiproc.py doesn't escape them too. On the other hand it leaves escaping on the input ( only partial - probably just some columns).

If columns are changed, you have to change:

Daemon replying to requests (Servers running over indexes)

Runs:

 cd /mnt/data/indexes/CC-2015-18/final
 java -cp $(echo ./processing_steps/7/mg4j/*.jar | tr ' ' ':') it.unimi.di.big.mg4j.query.HttpJsonServer -p 12000 /mnt/data/indexes/CC-2015-18/final/collPart001.collection

Script for parallel execution on multiple servers:

Launches deamonds replying to requests. Starting takes place in the screen and parameter start must be set. Without parameter prints the state of screens and daemons . The stop termiantes screens .

 ./processing_steps/7/daemon.py

Use of the script :

 ./daemon.py [start|stop|restart] -i INPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY]
 
 -i   --input     input directory (automaticaly adds /final/ to the end of path)
 -p   --port      port (default is 12000)
 -s   --servers   file containing list of servers where each line is hostname of one machine, if line contains tabulatorit is
                  treated as separator and hostname is text before the separator (it is for compatibility of this configuration
                  file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line) 
 -e   --errors    if it is set errors are logged to this file with current date and time
 -b   --binary    path to directory containing .jar files (default is /mnt/data/wikipedia/scripts/mg4j/)

Examples:

 ./daemon.py -i /mnt/data/indexes/CC-2015-18/final -s ~/servers start
 # Runs screens where is deamon above collection in directory /mnt/data/indexes/CC-2015-18/final
 # on machines in file ~/servers
 ./daemon.py -s ~/servers
 # displays the status of running screens and deamons on all machines in file ~/servers
 ./daemon.py -s ~/servers stop
 # terminates screens on all machines in file ~/servers

8. a) Commandline query tool

The source codes of program are in a directory:

 ./processing_steps/8a/mg4jquery

Compilation:

 JAVA_HOME=/usr/lib/jvm/java-8-oracle/; mvn clean compile assembly:single

Examples of launching:

 JAVA_HOME=/usr/lib/jvm/java-8-oracle/; java -jar mg4jquery-0.0.1-SNAPSHOT-jar-with-dependencies.jar -h ../../servers.txt -m ../src/main/java/mg4jquery/mapping.xml -s ../src/main/java/mg4jquery/config.xml  -q "\"was killed\""
 JAVA_HOME=/usr/lib/jvm/java-8-oracle/; java -jar mg4jquery-0.0.1-SNAPSHOT-jar-with-dependencies.jar -h ../../servers.txt -m ../src/main/java/mg4jquery/mapping.xml -s ../src/main/java/mg4jquery/config.xml  -q "1:nertag:person < 2:nertag:person" -c "1.nerid != 2.nerid"

For example, this returns documents, where are at least two different persons. File with servers excpets address of server with port on each line, example:

 knot01.fit.vutbr.cz:12000

8. b) Web GUI

Launching:

Artifacts instalation:

 mvn install:install-file -Dfile=./processing_steps/8b/maven_deps/jars/vaadin-html5-widgets-1.0.jar -DgroupId=de.akquinet.engineering.vaadin -DartifactId=vaadin-html5-widgets -Dversion=1.0 -Dpackaging=jar
 mvn install:install-file -Dfile=./processing_steps/8b/maven_deps/jars/MyComponent-1.0-SNAPSHOT.jar -DgroupId=cz.vutbr.fit -DartifactId=MyComponent -Dversion=1.0-SNAPSHOT -Dpackaging=jar

Preparation:

 cd ./processing_steps/8b/mg4j-gui
 mvn install

Launching on port 8086:

 mvn jetty:run -Djetty.port=8086

In subdirectory maven_deps/src are sources of own gwt components for displaying dynamic tooltips. For successful retrieval results of request from the server , it is necessary to set in the options addresses of the servers (in the classical format domain_name_of_server:port). You can set number of results on page, behaviour of info windows (dynmic - after leaving with mouse from entity is window hidden, static - it stays visible until the user move his cursor over another entity , or otherwise change the status of the application ) and display type (default is corpus based, but you can switch on document based).

Requesting

It pretty similar as in mg4j, semantic index automaticaly remap requests to the same index, no need to write for example:

 "(nertag:person{{nertag-> token}}) killed"

This request is enough:

 "nertag:person killed"

Compared to mg4j there is expansion "Global constraints", which allows to tag token and do post-filter on the token.

 1:nertag:person < 2:nertag:person
 1.fof != 2.fof AND 1.nerid = 2.nerid

For example this returns documents, where is the same person in different forms (often name and coreference). In case of requesting on occurrences within the same sentence, you can use the difference operator.

 nertag:person < nertag:person - _SENT_
 nertag:person < nertag:person - _PAR_

These request searches two persons inside sentece (respectively paragraphs). Difference means that it will take part of the text, where is not the token.

Indexes

There are a lot of indexes which can be queried:

 position
 token
 tag
 lemma
 parpos
 function
 parword
 parlemma
 paroffset
 link
 length
 docuri
 lower
 nerid
 nertag
 person.name
 person.gender
 person.birthplace
 person.birthdate
 person.deathplace
 person.deathdate
 person.profession
 person.nationality
 artist.name
 artist.gender
 artist.birthplace
 artist.birthdate
 artist.deathplace
 artist.deathdate
 artist.role
 artist.nationality
 location.name
 location.country
 artwork.name
 artwork.form
 artwork.datebegun
 artwork.datecompleted
 artwork.movement
 artwork.genre
 artwork.author
 event.name
 event.startdate
 event.enddate
 event.location
 museum.name
 museum.type
 museum.estabilished
 museum.director
 museum.location
 family.name
 family.role
 family.nationality
 family.members
 group.name
 group.role
 group.nationality
 nationality.name
 nationality.country
 date.year
 date.month
 date.day
 interval.fromyear
 interval.frommonth
 interval.fromday
 interval.toyear
 interval.tomonth
 interval.today
 form.name
 medium.name
 mythology.name
 movement.name
 genre.name
 nertype
 nerlength

Warning: all request starting with number or containing character "*" must be surrounded by brackets or it will cause syntax error.

 nertag:event ^ event.startdate: (19*)

Requests on attributes on some named entity should be combined with request to index nertag (because indexes are overlaping each other, so it will save space). "Global constraints" can additionaly use request "fof", which is abbreviation of "full occurrence from".

File redistribution

Script allows to redistribute data (files) among given servers in order to equally use capacity of disc arrays.

 ./processing_steps/1/redistribute.py

Usage:

 ./redistribute.py -i INPUT_DIRECTORY [-o OUTPUT_DIRECTORY [-d DISTRIB_DIRECTORY]] -s SERVER_LIST [-p RELATED_PATHS] [-x EXTENSION] [-r] [-m] [-e ERRORS_LOG_FILE]

Final redistributed scripts have to be launched via parallel ssh and have to be removed afterwards.

Example:

 python /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/1/redistribute.py -i /mnt/data/commoncrawl/CC-2015-14/warc -s /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/servers.txt -m -o /home/idytrych/redistributionScripts -d /mnt/data/commoncrawl/software/redistributionScripts -x "-warc.gz" -p /home/idytrych/CC-2015-14-rel.txt >moves.txt

where CC-2015-14-rel.txt contains:

 /mnt/data/commoncrawl/CC-2015-14/uri	-warc.domain
 /mnt/data/commoncrawl/CC-2015-14/uri	-warc.domain.srt
 /mnt/data/commoncrawl/CC-2015-14/uri	-warc.netloc
 ...

afterwards scripts in screen are launched:

 parallel-ssh -h servery_b9_idytrych.txt -t 0 -A -i "bash /mnt/data/commoncrawl/software/redistributionScripts/\$HOSTNAME.sh"

and removed

 parallel-ssh -h servery_b9_idytrych.txt -t 0 -A -i "rm /mnt/data/commoncrawl/software/redistributionScripts/\$HOSTNAME.sh"


Note: On servers, where nothing will be moved onto, is no script. Error will be printed.

Where you can test it?

There is running search engine on server athena1, you can try for example. [http://athena1.fit.vutbr.cz:8088/#1:nertag:%28artist%20OR%20person%29%20%3C%20lemma:%28inspire%20OR%20influence%20OR%20admiration%20OR%20tutelage%20OR%20tribute%20OR%20homage%29%20%3C%202:nertag:%28artist%20OR%20person%29%20-%20_SENT_;1.nerid%20!=%202.nerid]

Indexes are running almost on all servers (for this search engine). For wikipedia you can restart deamons like this(only the person who launched the deamons can actualy restart them):

 python daemon.py restart -i /mnt/data/indexes/wikipedia/enwiki-20150901/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIW.txt -b /mnt/data/wikipedia/software/mg4j

For CC (In order of incremental deduplication):

 python daemon.py restart -i /mnt/data/indexes/CC-2015-32/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC32.txt -b /mnt/data/commoncrawl/software/mg4j -p 12001
 python daemon.py restart -i /mnt/data/indexes/CC-2015-35/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC35.txt -b /mnt/data/commoncrawl/software/mg4j -p 12002
 python daemon.py restart -i /mnt/data/indexes/CC-2015-40/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC40.txt -b /mnt/data/commoncrawl/software/mg4j -p 12003
 python daemon.py restart -i /mnt/data/indexes/CC-2015-27/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC27.txt -b /mnt/data/commoncrawl/software/mg4j -p 12004

Get it working on Salomon

  1. Clone repository with SEC API (minerva1.fit.vutbr.cz:/mnt/minerva1/nlp/repositories/decipher/secapi) to home on Salomon
  2. Download KB (in directory secapi/NER ./downloadKB.sh)
  3. Copy to home on Salomon:
  4. Look up in all files in home (and in subdirectories) string "idytrych", eventually "smrz" and adjust absolute paths
  5. Build mpidedup (cd mpidedup; make)
  6. Create working directory by script createWorkDirs_single.sh
  7. File warcDownloadServers.cfg contains list of nodes, on those CommonCrawl will be downloaded - it is appropriate to check its up-to-dateness.
  8. For downloading of CommonCrawl it is necessary to somehow get file .s3cfg and put it into home.

For 4. version of scripts:

  1. ...
  2. Build mpidedup (cd mpidedup; make)
  3. Create working directory by script createWorkDirs_single.sh
  4. File warcDownloadServers.cfg contains list of nodes, on those CommonCrawl will be downloaded - it is appropriate to check its up-to-dateness.
  5. For downloading of CommonCrawl it is necessary to somehow get file .s3cfg and put it into home.

Launching on our servers

Processing of Wikipedia

Complete sequence for launching (not tested yet):

 (dump is labeled RRRRMMDD)
 cd 1a
 ./download_wikipedia_and_extract_html.sh RRRRMMDD
 cd ..
 
 python ./1/distribute.py -i ./2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a
 python ./2/verticalize.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/html_from_xml/AA/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/vert/ -s servers.txt -b /mnt/data/wikipedia/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 python ./1/distribute.py -i ./4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a
 python ./4/tag.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/vert/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/tagged/ -s servers.txt -b /mnt/data/wikipedia/software/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 python ./1/distribute.py -i ./5/MDP-package/MDP-1.0/build/jar/mdp.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a
 python ./5/parse.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/tagged/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/parsed/ -s servers.txt -b /mnt/data/wikipedia/software/mdp.jar
 python ./6/ner.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/parsed/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/secresult/ -s servers.txt
 python ./1/distribute.py -i ./7/mg4j/ -o /mnt/data/wikipedia/software/mg4j/ -s servers.txt -a
 python ./7/index.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/secresult/ -o /mnt/data/indexes/enwiki-RRRRMMDD/ -s servers.txt start

Launch of demons corresponding to queries:

 python ./daemon.py -i /mnt/data/indexes/enwiki-RRRRMMDD/final -s servers.txt -b /mnt/data/wikipedia/software/mg4j/ start

Processing of CommonCraawl

Complete sequence for launching (not tested yet):

 (označení CC je RRRR-MM)
 cd processing_steps/1b/download_commoncrawl
 ./dl_warc.sh RRRR-MM
 cd ../..
 python ./1/distribute.py -i ./2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a
 python ./2/verticalize.py -i /mnt/data/commoncrawl/CC-RRRR-MM/warc/ -o /mnt/data/commoncrawl/CC-RRRR-MM/vert/ -s servers.txt -b /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 python ./1/distribute.py -i ./3/dedup/server -o /mnt/data/commoncrawl/software/dedup/ -s servers.txt -a
 python ./1/distribute.py -i ./3/dedup/dedup -o /mnt/data/commoncrawl/software/dedup/ -s servers.txt -a
 cd 3
 parallel-ssh -h servers_only.txt -t 0 -i "mkdir /mnt/data/commoncrawl/CC-RRRR-MM/hashes/"
 (chceme-li načíst hashe z předchozího zpracování, je v následujícím příkazu nutné přidat -i /mnt/data/commoncrawl/CC-RRRR-MM/hashes/)
 python ./server.py start -s servers.txt -w workers.txt -o /mnt/data/commoncrawl/CC-RRRR-MM/hashes/ -b /mnt/data/commoncrawl/software/dedup/server
 python ./deduplicate.py -i /mnt/data/commoncrawl/CC-RRRR-MM/vert/ -o /mnt/data/commoncrawl/CC-RRRR-MM/dedup/ -w workers.txt -s servers.txt -b /mnt/data/commoncrawl/software/dedup/dedup
 python ./server.py stop -s servers.txt -w workers.txt -b /mnt/data/commoncrawl/software/dedup/server
 cd ..
 python ./1/distribute.py -i ./4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a
 python ./4/tag.py -i /mnt/data/commoncrawl/CC-RRRR-MM/dedup/ -o /mnt/data/commoncrawl/CC-RRRR-MM/dedup/tagged/ -s servers.txt -b /mnt/data/commoncrawl/software/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 python ./1/distribute.py -i ./5/MDP-package/MDP-1.0/build/jar/mdp.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a
 python ./5/parse.py -i /mnt/data/commoncrawl/CC-2RRRR-MM/tagged/ -o /mnt/data/commoncrawl/CC-RRRR-MM/parsed/ -s servers.txt -b /mnt/data/commoncrawl/software/mdp.jar
 python ./6/ner.py -i /mnt/data/commoncrawl/CC-RRRR-MM/parsed/ -o /mnt/data/commoncrawl/CC-RRRR-MM/secresult/ -s servers.txt
 python ./1/distribute.py -i ./7/mg4j/ -o /mnt/data/commoncrawl/software/mg4j/ -s servers.txt -a
 python ./7/index.py -i /mnt/data/commoncrawl/CC-RRRR-MM/secresult/ -o /mnt/data/indexes/CC-RRRR-MM/ -s servers.txt start

Launch of demons corresponding to queries:

 python ./daemon.py -i /mnt/data/indexes/CC-RRRR-MM/final -s servers.txt -b /mnt/data/commoncrawl/software/mg4j/ start

Launching on Salomon

 ls -1 /scratch/work/user/idytrych/warc | sed 's/\-warc.gz//' > ~/namelist
 or
 ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
 seq 14 > numtasks
 bash createCollectionsList_single.sh 192
 #PBS -q qexp
 for example by
 #PBS -q qprod
 #PBS -A IT4I-9-16

and add

 #PBS -l walltime=48:00:00
 bash prepare_dl_warc_single.sh 2015-3
 Download CC (manually on login nodes):
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login1_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login2_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login3_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login4_all.sh
 ls /scratch/work/user/idytrych/CC-2015-32/warc/ | sed 's/\-warc.*//g' > ~/namelist
 mv /scratch/work/user/idytrych/CC-2015-32/warc /scratch/work/user/idytrych/
 wc -l namelist
 (printed number of rows in namelist is NNN - will be needed below)
 (Count NU = NNN / MU, where MU is maximal number of nodes, can be used in parallel according to Salomon's documentation)
 seq NU >numtasks
 qsub -N vert -J 1-NNN:NU vert.sh
 qsub dedup.sh
 qsub -N tag -J 1-NNN:NU tag.sh
 qsub -N parse -J 1-NNN:NU parse.sh
 secapi/SEC_API/salomon/v3/start.sh 3
 (M is number of collections - should be ca. 6 for each destination server)
 bash createCollectionsList_single.sh M
 (it is assumed that there is enough nodes to process all collections in parallel in case of 24 as well as in case of 8 processes)
 seq 24 >numtasks
 qsub -N createShards -J 1-M:24 createShards.sh
 qsub -N populateShards -J 1-M:24 populateShards.sh
 qsub -N makeCollections -J 1-M:24 makeCollections.sh
 seq 8 >numtasks
 qsub -N makeIndexes -J 1-M:8 makeIndexes.sh
 (NP is number of parts on which we want Wikipedia to be divided - which means number of destination servers)
 bash createWikiParts_single.sh NP
 (Not tested part: )
   bash download_and_split_wikipedia_single.sh 20150805
   qsub -N extract_wikipedia extract_wikipedia_html.sh
 ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
 wc -l namelist
 (printed number of rows in namelist is NNN - will be needed below)
 (Count NU = MAX(14, (NNN / MU)), where MU is maximal number of nodes, which can be used in parallel according to Salomon's documentation - so it makes no sense to use less than 14 per node)
 qsub -N vertWiki -J 1-NNN:NU vertWiki.sh
 qsub -N tagWiki -J 1-NNN:NU tag.sh
 qsub -N parseWiki -J 1-NNN:NU parseWiki.s
 (Count NUL = MAX(3, (NNN / MUL)), where MUL is maximal number of nodes, which can be used in parallel in qlong according to Salomon's documentation - so it makes no sense to use less than 3 per node)
 secapi/SEC_API/salomon/v3/start.sh NUL
 (M is number of collections - should be one for each destination server)
 bash createCollectionsList_single.sh M
 (it is assumed that there is enough nodes to process all collections in parallel in case of 24 as well as in case of 8 processes)
 seq 24 >numtasks
 qsub -N createShards -J 1-M:24 createShards.sh
 qsub -N populateShards -J 1-M:24 populateShards.sh
 qsub -N makeCollections -J 1-M:24 makeCollections.sh
 seq 8 >numtasks
 qsub -N makeIndexes -J 1-M:8 makeIndexes.sh

Launching on Salomon with new scripts

 ls -1 /scratch/work/user/idytrych/warc | sed 's/\-warc.gz//' > ~/namelist
 or
 ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
 bash prepare_dl_warc_single.sh 2015-32
 Download CC (manually for login uzlech):
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login1_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login2_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login3_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login4_all.sh
 ls /scratch/work/user/idytrych/CC-2015-32/warc/ | sed 's/\-warc.*//g' > ~/namelist
 mv /scratch/work/user/idytrych/CC-2015-32/warc /scratch/work/user/idytrych/
 bash start.sh vert 10 qprod
 (can be repeated N-times and add more nodes - instead of 10 it is possible to use 20 - but overall it is not right to use over 70 nodes)
 bash start.sh dedup 4 qprod
 (nodes cannot be added)
 bash start.sh tag 10 qprod
 (can be repeated N-times and add more nodes - instead of 10 it is possible to use 20 - but overall it is not right to use over 40 nodes)
 bash start.sh parse 10 qprod
 (can be repeated N-times and add more nodes - instead of 10 it is possible to use 25 - but overall it is not right to use over 90 nodes)
 bash start.sh sec 1 qprod
 (can be repeated N-times and add more nodes - instead of 1 it is possible to use 50 - but for the fist time after 1. 10 min. pause for build and overall it is not right to use over 100 nodes)
 mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index/final
 (MMM is number of desired collections - number of destination servers * 6 might be appropriated)
 bash startIndexing.sh cList MMM
 bash startIndexing.sh cShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 bash startIndexing.sh pShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 bash startIndexing.sh colls qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 bash startIndexing.sh indexes qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 ...
 bash start.sh vert 1:f qprod
 bash start.sh vert 50 qprod
 (1. note is launched separately and it needs to be guaranteed that they will run for the whole time of processing- alternatively use qlong)
 (another nodes can be launched at any speed and at unlimited quantity - even though it does not make much sense to use over 150 - discs cannot handle it)
 bash start.sh dedup 4 qprod
 bash start.sh tag 1:f qprod
 bash start.sh tag 50 qprod
 bash start.sh parse 1:f qprod
 bash start.sh parse 50 qprod
 bash start.sh sec 1:f qlong
 bash start.sh sec 20 qprod
 bash startCheckSec.sh qprod namelist 2 1 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult
 bash startCheckSec.sh qprod namelist 2 2 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult
 (2 1 is number of parts and which one should be launched - with multiple parts the result can be achieved quicker; number of parts is given to another command by -n)
 python remove.py -n 2 -s /scratch/work/user/idytrych/secsgeresult >rm.sh
 sed "s/secsgeresult/sec_finished/;s/.vert.dedup.parsed.tagged.mg4j//" rm.sh >rmf.sh
 bash rm.sh
 bash rmf.sh
 rm rm.sh
 rm rmf.sh
 bash create_restlist.sh namelist /scratch/work/user/idytrych/secsgeresult >restlist
 rm /scratch/work/user/idytrych/counter/*
 bash start.sh sec 1:f qlong restlist
 bash start.sh sec 20 qprod restlist
 bash startCheckIndexes.sh /home/idytrych/collectionlist /scratch/work/user/idytrych/CC-2015-32/mg4j_index "/home/idytrych/check_i.txt"
 (on each line in check_i.txt should be the right number of files of index and at the same time there should not be any exception in any error log)

Data format

Manatee

Good example of manatee format can be downloaded here:
http://nlp.fi.muni.cz/trac/noske/wiki/Downloads
Specifically corpus SUSANNE.
This one differs from ours - it has only 4 columns (we have 27). All tags that begins with < keeps this format, nothing is transformed such as in case of MG4J.
Considering necessary changes, it will only not add GLUE as a option of token and it will not generate things such as %%#DOC PAGE PAR SEN. In manatee instead of of empty anotation underscore is used (in mg4j 0). In addition in manatee configuration file, which defines tags and determines path to vertical file, has to be created in order to make it possible to index via program encodevert.

Elasticsearch

Format ElasticSearch used for sematic anotations looks as follows:

 Word[anotation1;anotation2...] and[anotation1;anotation2...] other[...;anotation26;anotation27] word[...;anotation26;anotation27]

Forms of anotations may be arbitrary, however only alfanumeric symbols and underscore are allowed.
At the moment, this format is used - each anotation is in form typeOfAnotation_value

Types of anotations are
position token tag lemma parpos function parword parlemma paroffset link length docuri lower nerid nertag param0 param1 param2 param3 param4 param5 param6 param7 param8 param9 nertype nerlength

Actual anotated text looks as follows:

 Word[position_1;token_Word...] 

For sematic questioning typical Lucenic queries are used(see testing query in project directory).