Downloading, processing and indexing of large text corpus

1 Description of processing
2 Getting it working on Salomon
3 Launching on our servers
4 Launching on Salomon
5 Launching on Salomon with new scripts
6 Launching on Salomon with new scripts v5
7 Data format
- 7.1 Manatee
- 7.2 Elasticsearch
8 Preparation of new tagging and parsing
9 Description of files in directory old
10 Processing of CommonCrawl 2014-35
- 10.1 Statistics

1 Description of processing

Processing is divided into several steps. Scripts were created for each step to simplify the work using arguments and they allow the processing performed on one machine, or in parallel on multiple machines.

All source codes of programs and scripts are in repository corpora_processing_sw and potential missing libraries are in /mnt/minerva1/nlp/projects/corpproc. Anywhere else than here you might find old nonfunctional versions!

1.1 Distribution of programs or data for processing

Script enables reallocating data (files) between servers. Data are divided according to their size so all servers use similiar amounts of data. Parameter -a switches reallocating off and each file is copied to all servers. This is particularly suitable for the distribution of processing programs.

 ./processing_steps/1/distribute.py

Usage:

 ./distribute.py [-i INPUT_DIRECTORY] -o OUTPUT_DIRECTORY -s SERVER_LIST [-a] [-e ERRORS_LOG_FILE]

 -i   --input     input directory/file for distribution (if it is not set, file names are expected on stdin separated by '\n')
 -o   --output    output directory on target server, if directory doesn't exist script tries to create it
 -a   --all       each file will be copied to all servers (suitable for scripts/programs)
 -s   --servers   file containing list of servers where each line is hostname of one machine, if line contains tabulator it is
                  treated as a separator and with the text before the separator as hostname (it is for compatibility of this configuration
                  file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line)                   
 -e   --errors    if set, errors are logged to this file with current date and time

Examples:

 ./distribute.py -i ~/data_to_distribution/ -o /mnt/data/project/data/ -s ~/servers
 # All files from directory ~/data_to_distribution/ are reallocated between servers set by file ~/servers.
 # Output directory is /mnt/data/project/data/. If directory data on some machine doesn't exist, script tries to create it. 

 ./distribute.py -i ~/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s ~/servers -a
 # Copies program NLP-slave to servers set by file~/servers.
 # Output directory is /mnt/data/commoncrawl/software/. If directory "software" doesn't exist on a machine, the script creates it.

1.1.1 Downloading the dump of Wikipedia

There is a program for extraction plain text from Wikipedia called WikiExtractor

Launching:

 cd /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/
 ./download_wikipedia_and_extract_html.sh 20151002

Script in working directory expects file hosts.txt containing list of servers (1 on each line) on which it is supposed to run.

 /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/tools/WikiExtractor.py

Output is collection of files of size approx. 100MB divided into servers mentioned in /mnt/data/wikipedia/enwiki-.../html_from_xml/enwiki=...

1.1.2 Downloading CommonCrawl

To download WARC you need to know the exact specification of CommonCrawl, e.g. "2015-18".

 ./processing_steps/1b/download_commoncrawl/dl_warc.sh

downloaded files are located in:

 /mnt/data/commoncrawl/CC-Commoncrawl_specification/warc/

support files are located in:

 /mnt/minerva1/nlp-2/download_commoncrawl/CC-Commoncrawl_specification/download/

Consequently it is possible to calculate URI statistics using:

 ./processing_steps/1b/uri_stats.sh

result is saved in:

 /mnt/minerva1/nlp-2/download_commoncrawl/CC-Commoncrawl_specification/uri/

data on individual machines are located in:

 /mnt/data/commoncrawl/CC-Commoncrawl_specification/uri/

1.1.3 Downloading web pages from RSS feeds

To get url from entered RSS source use:

 ./processing_steps/1c/collect.py

Usage:

 ./collect.py [-i INPUT_FILE] [-o OUTPUT_FILE [-a]] [-d DIRECTORY|-] [-e ERRORS_LOG_FILE]
 
 -i   --input     input file containing RSS url separated by '\n' (if not set, it is expected on stdin)
 -o   --output    output file for saving parsed url (if not set, it is printed to stdout)
 -a   --append    output file is in adding mode 
 -d   --dedup     deduplication of obtained links (according to matched url), you can optionally enter folder with files with list of already colled url, these url will be included in deduplication, in mode -a the output file is also included
 -e   --errors    if set, errors are logged in this file with current date and time

Examples:

 ./collect.py -i rss -o articles -a
 # Adds the url from RSS sources in the file articles, urls found in file rss

To download websites according to entered list of url and save to WARC archive use:

 ./processing_steps/1c/download.py

Usage:

 ./download.py [-i INPUT_FILE] -o OUTPUT_FILE [-e ERRORS_LOG_FILE]
 
 -i   --input     input file containing url (if it is not set, it's expected on stdin)
 -o   --output    output file for saving warc archive
 -r   --requsts   limit the number of requests per minute on one domain (default is 10)
 -e   --errors    if set, errors are logged into this file with current date and time

Script downloads sites equally according to domain, not in order according to input file, to avoid possible "attack". Simultaneously it's possible to set limit of requests to domain per minute. When the limit is reached for all domains, downloading is paused until the limit is restored. The limit is restored every 6 seconds (1/10 minute) on 1/10 of total limit.
If errors occur during the download (except for code 200), the whole domain is excluded.

Example:

 ./download.py -i articles -o today.warc.gz
 # it downloads sites with url included in file articles into archive today.warc.gz

1.2 Verticalization

Vertical file format description

Input for verticalization program is file warc.gz. From this file the program unpacks individual records (documents), filters HTML from it, filters non-English articles and executes tokenization. It can process also 1 webpage in .html and Wikipedia in preprocesed HTML. Output is saved with extension vert into target folder.

New verticalizer:

 ./processing_steps/2/vertikalizator/main.py

Compile using the following command:

 make

In case of error during compiling justextcpp (aclocal.m4), run the following commands in justextcpp/htmlcxx:

 touch aclocal.m4 configure Makefile.in
 ./configure

Salomon may require modules (some of the following modules and dependencies):

 module load OpenMPI/1.8.8-GNU-4.9.3-2.25
 module load Autoconf/2.69
 module load Automake/1.15
 module load Autotools/20150215
 module load Python/2.7.9
 module load GCC/4.9.3-binutils-2.25

If LZMA-compressed WARC output is required, backports.lzma module has to be installed:

 pip install --user backports.lzma

This module's installation can fail during compilation, which means lzma library is not installed on the server. In that case run:

 make lzma

liblzma-dev package needs to be installed, use command:

 apt-get install -y liblzma-dev

Usage:

./main.py main.py [-h] [-i INPUT] [-o OUTPUT] [-n] [-t INPUTTYPE] -s STOPWORDS [-l LOG] [-a WARCOUTPUT] [-d]

-i --input input file for verticalization (if it is not set, expected WARC on stdin)
-o --output output file (if it is not set, it will use stdout)
-n --nolangdetect turns off detection of language (speeds it up with minimal difference in output)
-t --inputtype determines expected input type, see Input types
-s --stopwords file with list of stop words, required form boilerplate removal using Justext algorithm
-l --log logfile containing debugging information, STDOUT or STDERR need to be set to print the debugging information to their respective output.
If not present, no log is stored or displayed.
-a --warcoutput path to output WARC file. This file will contain HTTP response records from the input file,
whose contents were not completely deleted by justext algorithm or language detection.
No WARC output is generated if this parameter is not set. The parameter has no effect if the input
for the verticalization is one HTML file or a Wikipedia archieve. Does not cancel standard vertical format
output generation, the verticalization process will have 2 different outputs at the same time.
Compression will be used if the filename ends with .gz(GZip) or .xz(LZMA).
-d --dedup vertical output will be deduplicated. File dedup_servers.txt has to be modified to contain hostnames of the servers where deduplication server processes should run.
Deduplication servers will keep running even after the verticalization process ends, so they can be used by other verticalization processes. They have to be shut down manually.
-m --map configuration map for deduplication(now only in hash redistribution branch).

Stdin input:

Verticalizator is faster with stdin input than input file. For example:

 xzcat file | python main.py [-o OUTPUT] [-n] -s STOPWORDS [-l LOG] [-a WARCOUTPU] [-d]

1.2.1 Warcreader

Warcreader is now located in the verticalizator folder.

Old warcreader When upgrading to newer version of verticalizer it might be required to reinstall the warcreader package, which is developed simultaneously with the verticalizer, but published on Python Package Index as a standalone library. This package is installed to current user's home directory using make command and cannot be removed or upgraded by pip utility. It has to be removed manually using: '

 rm -r ~/.local/lib/pythonX.X/site-packages/warcreader*

where X.X is version of Python, usually 2.7. Then you can install new version of warcreader using pip:

1.2.2 Input types

Verticalizer can process several types of input. Expected input type must be defined by the -t/--inputtype argument. Possible values are:

warc - verticalizer will expect WARC archive. The archive can be gzip-compressed, but in that case .gz suffix is required in the input file name
wiki - verticalizer will expect wikipedia pages archive in internal research group format
html - verticalizer will expect HTML source code of a single web page
universal - verticalizer will expect Universal Verticalization Format input. The file can be gzip-compressed, but in that case .gz suffix is required in the input file name
if no value is given, WARC achieve will be expected

DO NOT USE: - old verticalizer

Originally, NLP-slave was used for verticalization (dependent on local langdetect), source code is in repository:

 ./processing_steps/2/NLP-slave

This program expects file warc.gz as input (first parameter) from which it gradually extracts particular records, filters HTML, filters articles not in English and does tokenization. It is also possible to process 1 website in .html and Wikipedia in preprocessed plaintext or HTML. Result is saved with extension vert into destination directory.

Compile using this command:

 mvn clean compile assembly:single

Usage of old verticalizer:

 java -jar package [opts] input_file.html output_dir file_URI [langdetect_profiles]
 java -jar package [opts] input_file.txt output_dir [langdetect_profiles]
 java -jar package [opts] input_file-warc.gz output_dir [langdetect_profiles]
 java -jar package [opts] input_file output_dir [langdetect_profiles]

package - binary file of program, e.g. ./processing_steps/2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
opts are switches:
- -d - turns on logging of debugging information
- -z - turns on setting of the document URI for links to zero
- -o - if it is set, there will not be any document URI in links (column will disappear)
input_file.html - is input file in uncompressed HTML (extension .html)
input_file-warc.gz - is input file in compressed format warc (tail of name is warc and extension is .gz)
input_file.txt - is input file from Wikipedia in preprocessed plaintext (extension .txt)
input_file - is input file from Wikipedia in preprocessed HTML (extension is not .html, .txt, nor warc.gz)
output_dir - path to output directory w/o slash in the end
file_URI - is URI of file (unable to get directly from file, because it is not warc)
langdetect_profiles - optional parameter with path to profiles langdetect - default is /usr/share/langdetect/langdetect-03-03-2014/profiles

Script for launching in parallel on multiple machines:

 ./processing_steps/2/verticalize.py

Usage:

 ./verticalize.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-b BINARY] [-t THREADS] [-l] [-d] [-s STOPWORDS_LIST] [-e ERRORS_LOG]
 
 -i   --input     input directory containing files to verticalize
 -o   --output    output directory, if it does not exist, script attempts its creation
 -s   --servers   file with list of servers with format HOSTNAME '\t' THREADS '\n' (one line per one machine)
 -e   --errors    if it is set errors are logged to this file with current date and time
 -b   --binary    argument specifying path to verticalizer (default is ./processing_steps/2/vertikalizator/main.py)
 -t   --threads   sets number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority)
 -l   --no-lang   turns off language detection
 -d   --debug     verticalizer prints debugging information
 -w   --stopwords file with list of stop words (default is ./processing_steps/2/vertikalizator/stoplists/English.txt)

Example:

 ./verticalize.py -i /mnt/data/commoncrawl/CC-2015-18/warc/ -o /mnt/data/commoncrawl/CC-2015-18/vert/ -s ~/servers -b /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 # Data from directory /mnt/data/project/CC-2015-18/warc/ verticalizates to directory/mnt/data/commoncrawl/CC-2015-18/vert/
 # verticalization will be executed on all servers determined by file -s ~/servers
 # It uses program /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar, which has to be on all machines

1.2.3 Profiling scripts

All scripts can be found in verticalization folder profiling.

files_compare.py Script for comparing two files. There are two arguments. Both are paths to files which we want compare.

Run example:

 python files_compare.py testregularpuvodni.txt testregular11.vert

profiling.py

Script to simplify profiling. All variants to profile are uploaded to the folder. If there are multiple occurences of files of the same type e.g. tokenizer.py, add "-" before the endpoint, and the rest of the text.

Example:

 tokenizer-2.py or tokenizer-slouceniVyrazu.py

First verticalizator runs the original version of verticalizator to compare profiling result. The script then takes individual files, rewrites test file and runs verticalization. When verticalization ends, the original file is taken from reset folder and is rewritten into verticalizator. (Precaution in case verticalize.py is tested first and then tokenizer.py, so that the edited verticalize.py is not used.)

Requirements: Create folder reset and copy the original files here. Then create folder profiling. This is where the results will be saved. Script and folders must be located in folder verticalizator.

Script requires two arguments. Path to the input file and file type.

Run example:

 python profiling.py /mnt/data/commoncrawl/CC-2016-40/warc/1474738659833.43_20160924173739-00094-warc.xz warc

profiling_results.py

Script used to display the results. The script requires three arguments. Folder with results, sorting criteria and number of items to display for each file.

Run example:

 python profiling_result.py profiling tottime 20

Most important sorting options:

ncalls - for the number of calls
tottime - for the total time spent in the given function (and excluding time made in calls to sub-functions)
cumtime - is the cumulative time spent in this and all subfunctions (from invocation till exit). This figure is accurate even for recursive functions.

1.3 Deduplication

Detailed documentation of deduplication can be found here.

Programs dedup and server are used for deduplication. Both are available to compile via Makefile in folder:

 processing_steps/3/dedup/
 in repository corpproc_dedup

Launch parameters:

 ./server -m=~/hashmap.conf [-i=INPUT_FILE] [-o=OUTPUT_FILE [-d]] [-s=STRUCT_SIZE] [-p=PORT] [-j=JOURNAL_FILE] [-k] [a] [-P=RPORT]

 -i   --input     input file 
 -o   --output    output file
 -h   --help      lists launch arguments and their use
 -p   --port      server port (default 1234)
 -s   --size      change size of structure for saving hashes (default size is 300,000,000)
 -d   --debug     simultaneously with output file a  file containing debugging dumps is generated
 -j   --journal   recovery of hashes from journal file - (use with -j at client)
 -k   --keep      archiving journal file after successfully saving hashes to output file
 -m   --map       hash distribution map
 -a   --altered   distribution map changed, enable hash migration
 -P   --rport     port for hash migration (default -p + 1)

Server runs until it's "killed". hashes are saved into output file (if specified) as a reaction to signals SIGHUP and SIGTERM.

 ./dedup -i=INPUT_DIR -o=OUTPUT_DIR -s=SERVERS_STRING [-p=PORT] [-t=THREADS] [-n] [-d] [-wl] [-uf=FILTER_FILE]

 -i   --input       input folder with files for deduplication
 -o   --output      output folder, if it does not exist, script attempts its creation
 -p   --port        server port (default 1234)
 -t   --threads     sets number of threads (default 6)
 -n   --near        uses algorithm "nearDedup"
 -d   --debug       for every output file .dedup file .dedup.debug containing debugging logs is generated 
 -wl  --wikilinks   deduplication of format WikiLinks
 -dr  --dropped     for each output file .dedup file .dedup.dropped containing deleted duplicates is generated
 -f   --feedback    records in file .dedup.dropped containing reference to records responsible for its elimination (more below) 
 -dd  --droppeddoc  for each output file *.dedup a .dedup.dd file containing list of URL addresses of completely eliminated documents is created                        
 -j   --journal     continue in dedup after system crash - already processed files are skipped, unfinished ones are about to be processed
                    (use with -j on server)
 -h   --help        prints help
 -m   --map         hash distribution map

For easier launching there are scripts server.py and deduplicate.py (described below), which allow launching deduplication in parallel on more machines. Programs have to be pre-distributed on all used machines and have to be on the same place e.g.: /mnt/data/bin/dedup

Launching servers for deduplication

Script with argument start first launches screens and then servers in them. If argument stop is used script closes the screens. If start nor stop nor restart is not entered, then script firstly tests if screens are running and then servers.

 ./processing_steps/3/server.py

Usage:

 ./server.py [start|stop|restart|migrate] -m MAP_FILE [-i INPUT_FILE] [-o OUTPUT_FILE [-d]] [-a] [-t THREADS] [-p PORT] [-e ERRORS_LOG_FILE] [-b BINARY] [-d] [-j] [-k] [-r SPARSEHASH_SIZE] [-P rport]


 start             start servers
 stop              stop servers, wait for hash serialization
 restart           restart servers
 migrate           migrate hashes and exit. 
 -i   --input      input file 
 -o   --output     output file
 -t   --threads    number of threads containing workers (default = 384)
 -p   --port       server port (default 1234)
 -r   --resize     change size of structure for saving hashes (default size is 300000000)
 -e   --errors     if set, errors are logged into this file with current date and time
 -b   --binary     argument specifying path to deduplication server (default is /mnt/data/commoncrawl/corpproc/bin/server)
 -d   --debug      simultaneously with output file a file containing debugging dumps is generated
 -j   --journal    recover hashes from journal file (suggested to use with -j at client)
 -k   --keep       archiving journal file after successful save of hashes to output file
 -v   --valgrind   Runs server in valgrind (emergency debugging)
 -P   --rport      Port for hash migration service (default is -p + 1)
 -E   --excluded   List of excluded servers for hash migration (old->new)
 -a   --altered    Enables hash migration. see -a at server
 -m   --map        hash distribution

Note: -j, --journal: path to input file to recover hashes from journal is required, which was specified as output file before server crash. Script verifies, if file "input_file".backup exists. If input file does not exist on a servers, it will be created.

For example:

 ./server.py start -m ~/hashmap
 # Launches screens and then launches servers in them on machines specified in file ~/hashmap
 # Servers are waiting for workers to connect

 ./server.py -m ~/hashmap
 # Tests if screens and servers are launched on machines specified in file ~/hashmap

 ./server.py stop -m ~/hashmap -E ~/excluded.list
 # Closes screens and servers on machines specified in file~/hashmap and ~/excluded.list

 ./server.py migrate -m ~/hashmap -E ~/excluded.list -i ~/input.hash -o ~/output.hash
 # Runs hash migration according to distribution map, hashes are saved and servers are stopped when migration ends.

Launching workers for deduplication

It is necessary to launch servers with the exact parameters -m and -p.

 ./processing_steps/3/deduplicate.py

Usage:

 ./deduplicate.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY -w WORKERS_FILE -m MAP_FILE [-p PORT] [-e ERRORS_LOG_FILE] [-b BINARY] [-n] [-d] [-wl]

 -i   --input      input folder with files for deduplication
 -o   --output     output folder, if it does not exist, script attepts its creation
 -w   --workers    file with list of workers, the format is: HOSTNAME '\t' THREADS '\n' (one line per one machine)                   
                   (beware - replacing tabulators with spaces will not work)
 -p   --port       server port (default 1234)
 -e   --errors     if set, errors are logged into this file with current date and time
 -b   --binary     argument specifying path to deduplication program (default is /mnt/data/commoncrawl/corpproc/bin/dedup)
 -t   --threads    number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority)
 -n   --near       it uses algorithm "nearDedup"
 -d   --debug      simultaneously with output file .dedup a .dedup.dropped containing removed duplicates and file .dedup.debug containing debugging dumps are generated
 -wl  --wikilinks  deduplication of format Wikilinks
 -dr  --dropped    for each output file .dedup a .dedup.dropped containing deleted duplicates is generated
 -f   --feedback   records in file .dedup.dropped containing reference on records responsible for its elimination (see below) 
 -dd  --droppeddoc file "droppedDocs.dd" containing list of completely excluded documents will be created in the output directory
 -j   --journal    continue dedup after system crash - already processed files are skipped, unfinished ones are about to be processed
                   (use with -j on server)
 -m   --map        hash distribution map

Deduplication of format wikilinks is implemented for computing hash for concatenation of columns 2, 3, 5, 6, all the columns have to be the same to evaluate row as duplicated. In neardedup computing of hashes using the N-grams is concatenation of columns 5, 3, 6 (in this order), then it computes hash of 2. column. The row is duplicated if both last entries found conjunction.

For example:

 ./deduplicate.py -i /mnt/data/commoncrawl/CC-2015-18/vert/ -o /mnt/data/commoncrawl/CC-2015-18/dedup/ -w ~/workers-s ~/servers 
 # Data from folder /mnt/data/commoncrawl/CC-2015-18/vert/ deduplicate to folder /mnt/data/commoncrawl/CC-2015-18/dedup/
 # Deduplicaton runs on machines specified in file ~/workers
 # Running servers on machines specified in file ~/servers are expected

1.3.1 Deduplication for Salomon

This step only differs for Salomon, because it deals with distributed calculation. In standart execution the communication is through sockets via standard TCP/IP. On Salomon it is not possible (InfiniBand), thats why library mpi is used. Source codes are in:

 /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/salomon/mpidedup

It is highly recomended to use module OpenMPI/1.8.8-GNU-4.9.3-2.25 due to compatibility issues with some other versions.

Compilation:

 module load OpenMPI/1.8.8-GNU-4.9.3-2.25
 make

Launching on salomon

Deduplication is easiest to run using the v5/start.sh script, launch parameters are set in the script. It is recommended to use 4 nodes:

 bash start.sh dedup 4 qexp

Parameters:

 -h --hash - relative number of servers holding particular subspaces of total hashing space
 -w --work - relative number of worker servers executing their own deduplication
 -i --input - input directory with files for deduplication
 -o --output - output directory
 -l --load - optional directory used to load existing hashes 
    (for incremental deduplication - obviously needs to be launched the same way 
    (infrastructure, number of workers and servers with hashing space), the way they were saved before)
 -s --store - optional directory to save hashes
 -r --resize - optional parameter to change size of structure to save hashes
 -d --debug - turns on debugging mode (generates logs)
 -p --dropped - generates *.dropped file for each input vert, containing removed paragraphs
 -c --droppeddoc - generates *.dd file for each input vert, containing list of processed documents
 -n --near - switches to neardedup algorithm
 -j --journal - takes processed files and journals into consideration (unsaved hashes and unfinished files)
              - attempts to recover after crash and to continue deduplication
 -m --map - loads distribution map from KNOT servers and use it

Program cyclically creates h hash of servers and w worker servers until number of processes is exceeded. If we have number of nodes n and number of MPI processes m, the following must be true (n * m) % (h + w) = 0 if we want to make sure that each node will have be h hash servers and w worker servers, m = h + w needs to be set (if we have 24 cores per server, we can create 736 workers for 32 hash servers and use 32 nodes- alternatively hash server can be placed on every second node (2 * m = h + w) and so on). It is necessary to make sure that m <= number of cores as well, because if m > number of cores, the speed will decrease quickly (with 24 cores it is for m = 32 deceleration by > 1000%).
It is recommended to use ratio 8 hashholders to 16 workers (24 cores on single Salomon node) and to use 4 nodes for deduplication
In total it will produce 96 CPU cores split into 32 hashholders and 64 workers
For additive deduplication, it's essential to preserve total number of hashholders from the previous run.

What to do when a crash occurs?

Examine what caused the crash, check dedup.eXXXXX and dedup.oXXXXX files and run the process again.

When launching using v5/start.sh, which then calls v5/dedup.sh after the environment parameters are set, the -j parameter is pre-set, to ensure that the deduplication continues whether or not it ended successfully. Keep in mind that -j causes already processed verticals to be skipped, to start the process over from scratch, delete the contents of the output folder with deduplicated verticals. Recovery mode can be turned off in the v5/dedup.sh file.

How does recovery mode work?

Worker - while processing vertical file, worker logs position in file fo each file. When a file is successfully processed, log file is removed. If deduplication fails or receives termination signal, file is not finished and log is not removed. On next run worker with the -j parameter will skip already processed files and finish unfinished files according to position in their log.
Hashholder - hashes from deduplication are serialized at time when all workers are done and disconnected. If deduplication fails or receives termination signal, hashes will not be serialized, but they are archived in a journal file. On next run, the process searches for the file containing serialized hashes, and then checks if a corresponding journal file exists.
If the first run failed, only the journal file will be available. On next run, the file will be loaded and serializes with new hashes when the deduplication ends. If a deduplication, where input hashes are loaded from serialized file crashes, folder will contain both serialized hashes from first run and hashes logged in their respective files. Both files are then loaded and if the process ends successfully, hashes are serialized into a new output file (if specified) and journal file is deleted.

Launching on KNOT servers:

 mpiexec dedup -h H -w W -i ~/vert/ -o ~/dedup/ -s ~/hash/

More information can be found here.

1.3.1.1 Hash migration KNOT <-> Salomon

Hashes distributed according to hash map on KNOT servers can be imported, used on Salomon and exported back. Use scripts in processing_steps/salomon/migration.

Hash import:

 python3 importhashes.py [-h] [-s SERVERS] [-m MAP] FILE TARGET

 FILE       path to hash map on KNOT servers, for example: /tmp/dedup/server.hashes
 TARGET     target folder on Salomon, for example: /scratch/work/user/$(whoami)/deduphashes
 -h --help  prints help
 -s SERVERS path to file containing list of servers for import (1 hostname per line), for example: athena5
 -m MAP     path to distribution map, alternative to -s, list of servers for import extracted from the distribution map

Hash export:

 python3 exporthashes.py [-h] SOURCE TARGET
        
 SOURCE source folder, for example: /scratch/work/user/$(whoami)/deduphashes
 TARGET target folder on KNOT servers, for example: tmp/dedup/dedup.hash

Use of the imported hashes on Salomon

Set the MAP_PATH parameter when launching the process, for example in file v5/start.sh:

 MAP_PATH="/scratch/work/user/${LOGIN}/hashmap.conf"

Deduplication changes the number of hashholders based on the number of servers in distribution map. If the distribution map contains 46 KNOT servers, then at least 47 cores are required (46 hashholders and 1 worker), optimally at least double. Parameters NUM_HASHES and NUM_WORKERS are ignored in this case.

1.4 Tagging

Tagging is executed via program TT-slave (dependent on /opt/TreeTagger), which can be found in:

 ./processing_steps/4/TT-slave

Compile using the following command:

 mvn clean compile assembly:single

Usage:

 java -jar package [opts] input_file output_dir [treetagger.home]

package - binary file of program, e.g. ./processing_steps/4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
ops - switches
- -d - turns on listing of debugging information
- -o - turns on deletion of URI from links
input_file - input file in vertical program (extension is .vert)
output_dir - path to output directory w/o slash in the end
treetagger.home - path to installation directory TreeTagger, default is /opt/TreeTagger

Script for parallel execution on multiple servers:

 ./processing_steps/4/tag.py

Usage:

 ./tag.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] [-t THREADS] [-d] [-u]
 
 -i   --input     input directory with files for tagging
 -o   --output    output folder, if it does not exist, script attempts its creation
 -s   --servers   file containing list of servers, with format HOSTNAME '\t' THREADS '\n' (each line per one machine)
 -e   --errors    if it is set, errors are logged to this file with current date and time 
 -b   --binary    argument specifying path to tagging .jar program (default is ./processing_steps/4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar)
 -t   --threads   sets number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority)
 -d   --debug     debugging listing
 -u   --uri       turns on deletion of URI from links

Example:

 ./tag.py -i /mnt/data/commoncrawl/CC-2015-18/dedup/ -o /mnt/data/commoncrawl/CC-2015-18/dedup/tagged/ -s ~/servers
 # tags files from folder/mnt/data/commoncrawl/CC-2015-18/dedup/ and saves them to
 # directory/mnt/data/commoncrawl/CC-2015-18/dedup/tagged/ on machines specified in file ~/servers

1.5 Parsing

Parsing is done by modified MDParser, which can be found in:

 ./processing_steps/5/MDP-package/MDP-1.0

Compile by command:

 ant make-mdp

Usage:

 java -jar package [opts] input_file output_dir [path_to_props]

package - binary file of program, e.g. ./processing_steps/5/MDP-package/MDP-1.0/build/jar/mdp.jar
opts - switches
- -o - turns on deleting of unnecessary URI documents from links
input_file - is tagged input file (extension is .tagged)
output_dir - Path to output directory without slash at the end
path_to_props - Path to .xml file with program parameters (in case of omission of parameters, parameters from file ./processing_steps/5/MDP-package/MDP-1.0/resources/props/props.xml)(on servers knot and anthena is appropriate file ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml)

Important files, if something was changed:

 ./processing_steps/5/MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/test/MDParser.java
 ./processing_steps/5/MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/outputformat/ConllOutput.java

Script for parallel execution on multiple servers:

 ./processing_steps/5/parse.py

Usage:

 ./parse.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] [-t THREADS] [-u] [-p XML_FILE]
 
 -i  --input    input directory with files for parsing
 -o  --output   output directory, if directory doesn't exist, script will attempt to create it
 -s  --servers  file with list of servers, where is format: HOSTNAME '\t' THREADS '\n' (one line for one machine)
 -e  --errors   if set, errors are logged to this file with current date and time
 -b  --binary   argument specifies path to parsing .jar program (default is ./processing_steps/5/MDP-package/MDP-1.0/build/jar/mdp.jar)
 -t  --threads  sets number or threads (default is 6, if numbers of threads is stored in file with list of servers, they have a higher priority
 -u  --uri      turns on deleting URI from links
 -p  --props    Path to .xml file with program parameters (default is ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml)

Exaples:

 ./parse.py -i /mnt/data/commoncrawl/CC-2015-18/tagged/ -o /mnt/data/commoncrawl/CC-2015-18/parsed/ -s ~/servers
 # parses files from folder /mnt/data/commoncrawl/CC-2015-18/tagged/ to 
 # folders /mnt/data/commoncrawl/CC-2015-18/parsed/ on servers specified in file ~/servers

 ./parse.py -i /mnt/data/commoncrawl/CC-2015-18/tagged/ -o /mnt/data/commoncrawl/CC-2015-18/parsed/ -s ~/servers \
  -p ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml
 # parses files from folder /mnt/data/commoncrawl/CC-2015-18/tagged/ do 
 # folders /mnt/data/commoncrawl/CC-2015-18/parsed/ on machines specified in file ~/servers
 # simultaneously hands config file propsKNOT.xml to MDParser (must be available on all machines we're trying to run it)

1.6 SEC (NER)

SEC (see SEC) is used for recognition of named entities. Client sec.py can be parallely run on multiple servers by:

 ./processing_steps/6/ner.py

Usage:

 ./ner.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY]

 -i   --input     input directory 
 -o   --output    output directory
 -s   --servers   file with list of servers, where hostname of one of the machines is on each line, if the line includes tabulator, it's considered a separator and text before separator is used as a hostname (due to compability of this config file with scripts, where you can specify number of threads for specific machine, you can use this format: HOSTNAME \t THREADS for each line)
 -e   --errors    if it's set, errors are logged to this file with current date and time
 -b   --binary    change of path to the SEC client (default is /var/secapi/SEC_API/sec.py)

Exaple:

 ./ner.py -i /mnt/data/commoncrawl/CC-2015-18/parsed/ -o /mnt/data/commoncrawl/CC-2015-18/secresult/ -s ~/servers
 # processes files from folder /mnt/data/commoncrawl/CC-2015-18/parsed/ to 
 # folder /mnt/data/commoncrawl/CC-2015-18/secresult/ on machines specified in file ~/servers

Example SEC configuration, mapping input vertical to output vertical, that annotates vertical to format MG4J

 {
    "annotate_vertical": {
        "annotation_format": "mg4j",
        "vert_in_cols": [
           "position", "token", "postag", "lemma", "parabspos", "function", "partoken", "parpostag", "parlemma", "parrelpos", "link", "length"
        ],
        "vert_out_cols": [
           "position", "token", "postag", "lemma", "parabspos", "function", "partoken", "parpostag", "parlemma", "parrelpos", "link", "length"
        ]
    }

1.7 MG4J Indexation

Introduction to MG4J: http://www.dis.uniroma1.it/~fazzone/mg4j-intro.pdf

Source files of a program designed for semantic indexing are located in directory:

 ./processing_steps/7/corpproc

Compile using this command:

 mvn package

If the current environment uses Java 7, it is necessary to switch to Java 8:
```
 export JAVA_HOME=/usr/lib/jvm/java-8-oracle
```
First create index shards (empty folders, number is usually equal to the number physical CPU cores):
```
 mkdir /mnt/data/indexes/CC-2015-18/collPart001
```
Data need to be loaded into these shards (move here from secresult, distribute equally)
If settings of indexator and server don't exist, create them (created in cwd, if you need to move it, use option -c <CONFIG_FILE>):
```
 java -jar processing_steps/7/corpproc/target/corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar -i
```

Then we can index (relative paths will now work):

java -jar processing_steps/7/corpproc/target/corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar index /mnt/data/indexes/CC-2015-18/collPart001 /mnt/data/indexes/CC-2015-18/final

Script for parallel execution on multiple servers

The script starts indexation so it creates 6 Shard , they are filled and a collection is created from them, than indexation on the collection is started . To do this, you need to specify the argument start . If not set, the status of the individual screens and state programs will be printed. The argument stop terminates screens.

 ./processing_steps/7/index.py

Use of the script:

 ./index.py [start|stop] -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY]
 
 -i   --input     input directory
 -o   --output    output directory
 -s   --servers   file containing list of servers where each line is hostname of one machine, if line contains tabulator it 
                  is treated as separator and hostname is text before the separator (it is for compatibility of this configuration 
                  file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line)
 -e   --errors    if set, errors are logged to this file with current date and time
 -b   --binary    path to directory containing .jar files (default is /mnt/data/wikipedia/scripts/mg4j/)

Examples:

 ./index.py -s ~/servers
 # displays the status of startup screens and program on all servers from file ~/servers

 ./index.py -i /mnt/data/commoncrawl/CC-2015-18/secresult/ -o /mnt/data/indexes/CC-2015-18/ -s ~/servers start
 # runs indexation on selected servers

 ./index.py -s ~/servers stop
 # ends screens and thereby processes in all servers from file ~/servers

The columns can be found here:

 https://docs.google.com/spreadsheets/d/1S4sJ00akQqFTEKyGaVaC3XsCYDHh1xhaLtk58Di68Kk/edit#gid=0

Note: Its not clear if the special XML charactes in columns should be escaped. The current implementation of SEC doesn't escape them (respectively any escaping is removed). tagMG4JMultiproc.py doesn't escape them too. On the other hand it leaves escaping on the input ( only partial - probably just some columns).

If columns are changed, the following things have to be changed:

Indexator - ./processing_steps/7/mg4j-big-5.2.1/src/it/unimi/di/big/mg4j/tool/IndexBuilder.java (around row 533) and Scan.java (case TOKEN) - After adding column may not be needed to change
TokenIterator - in ./processing_steps/7/mg4j-big-5.2.1/src/it/unimi/di/big/mg4j/document/CustomDocumentFactory.java (around row 192) - After adding column may not be needed to change
./processing_steps/7/mg4j-big-5.2.1/src/it/unimi/di/big/mg4j/document/CustomDocumentCollection.java
Searcher engine - ./processing_steps/7/mg4j-big-5.2.1/src/it/unimi/di/big/mg4j/query/Query.java, SnippetHolder.java a ThreadProxy.java (a IntervalSelector.java ?)
Server - ./processing_steps/7/mg4j-big-5.2.1/src/it/unimi/di/big/mg4j/query/HttpJsonServer.java
Grammar - ./processing_steps/7/mg4j-big-5.2.1/src/it/unimi/di/big/mg4j/query/GlobalConstraints.jj
Serializator - ./processing_steps/7/mg4j-big-5.2.1/src/it/unimi/di/big/mg4j/document/CustomDocumentCollection.java
Interrogator - ./processing_steps/8a/mg4jquery/src/main/java/mg4jquery/Query.java
GUI - ./processing_steps/8b/mg4j-gui/src/main/java/cz/vutbr/fit/MyVaadinUI.java

1.8 Daemon replying to requests

Daemon requires .collection file in the folder with index, new indexer creates it automatically. It is however possible to use indices created by the old indexer. This older .collection file is automatically converted to the new format. To move indices or data files to a new location, simply change the path to data files in the .collection file (JSON).

Run using these commands:

 cd /mnt/data/indexes/CC-2015-18/final
 java -jar processing_steps/7/mg4j/corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar serve /mnt/data/indexes/CC-2015-18/final

Script for parallel execution on multiple servers

Launches deamons replying to requests. Launch takes place in the screen and parameter start must be set, otherwise it will print the status of screens and daemons. The stop terminates screens:

 ./processing_steps/7/daemon.py

Usage:

 ./daemon.py [start|stop|restart] -i INPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY]
 
 -i   --input     input directory (automaticaly adds /final/ to the end of path)
 -p   --port      port (default is 12000)
 -s   --servers   file containing list of servers where each line is hostname of one machine, if line contains tabulatorit is
                  treated as separator and hostname is text before the separator (it is for compatibility of this configuration
                  file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line) 
 -e   --errors    if set, errors are logged to this file with current date and time
 -b   --binary    path to directory containing .jar files (default is /mnt/data/wikipedia/scripts/mg4j/)

Examples:

 ./daemon.py -i /mnt/data/indexes/CC-2015-18/final -s ~/servers start
 # Runs screens and deamon in them above collection in directory /mnt/data/indexes/CC-2015-18/final
 # on machines specified in file ~/servers

 ./daemon.py -s ~/servers
 # displays the status of running screens and deamons on all machines specified in file ~/servers

 ./daemon.py -s ~/servers stop
 # terminates screens on all machines specified in file ~/servers

1.9 Commandline query tool

The source codes of program are in a directory:

 ./processing_steps/8a/mg4jquery

Compilation:

 JAVA_HOME=/usr/lib/jvm/java-8-oracle/; mvn clean compile assembly:single

Launch examples:

 JAVA_HOME=/usr/lib/jvm/java-8-oracle/; java -jar mg4jquery-0.0.1-SNAPSHOT-jar-with-dependencies.jar -h ../../servers.txt -m ../src/main/java/mg4jquery/mapping.xml -s ../src/main/java/mg4jquery/config.xml  -q "\"was killed\""
 JAVA_HOME=/usr/lib/jvm/java-8-oracle/; java -jar mg4jquery-0.0.1-SNAPSHOT-jar-with-dependencies.jar -h ../../servers.txt -m ../src/main/java/mg4jquery/mapping.xml -s ../src/main/java/mg4jquery/config.xml  -q "1:nertag:person < 2:nertag:person" -c "1.nerid != 2.nerid"

For example, this returns documents, where are at least two different people. File with servers expects address of server with port on each line, example:

 knot01.fit.vutbr.cz:12000

1.10 Web GUI

Preparation script:

 ./prepare_webGUI.py

Usage:

 python ./prepare_webGUI.py -po PORT -pa PATH -n NAME [-a] [-e ERRORS_LOG_FILE]

-po --port - port where the database will be launched
-pa --path - path to directory, where the database will be created
-n --name - name of the database
-a --artifacts - installs required artifacts
-e --errors - error log file, logs are saved with time and date, if not specified then the logs are printed to STDERR

Example:

 python ./prepare_webGUI.py -po 9094 -pa ./src/main/webapp/WEB-INF/ -n users -a
 # installs required artifacts, compiles source files and sets a database "mem:users" up in directory ./src/main/webapp/WEB-INF/ on port 9094

Running on port 8086:

 mvn jetty:run -Djetty.port=8086

Termination:

 ./stop_webGUI.py

Usage:

 python ./stop_webGUI.py
 #necessary to launch every time you terminate the web GUI

In subdirectory maven_deps/src are sources of own gwt components to display dynamic tooltips. For successful retrieval of request results from the server, it is necessary to set the adress options of the servers (in the classical format domain_name_of_server:port). You can set number of results on page, behaviour of info windows (dynamic - hovering over an entity with a cursor, static - it stays visible until the user moves their cursor over another entity, or otherwise change the status of the application) and display type (default is corpus based, but you can switch to document based).

1.10.1 Requesting

It is similar to MG4J, semantic index automatically remaps requests to the same index, no need to write for example:

 "(nertag:person{{nertag-> token}}) killed"

This request is enough:

 "nertag:person killed"

Compared to mg4j there is expansion "Global constraints", which allows to tag token and do post-filter on the token.

 1:nertag:person < 2:nertag:person
 1.fof != 2.fof AND 1.nerid = 2.nerid

For example this returns documents, where is the same person in different forms (often name and coreference). In case of requesting on occurrences within the same sentence, you can use the difference operator.

 nertag:person < nertag:person - _SENT_
 nertag:person < nertag:person - _PAR_

These requests searche for two persons in a sentence (respectively paragraphs). Difference is that it will take part of the text, where the token does not occur.

1.11 Indexes

There are a lot of indexes which can be queried:

 position
 token
 tag
 lemma
 parpos
 function
 parword
 parlemma
 paroffset
 link
 length
 docuri
 lower
 nerid
 nertag
 person.name
 person.gender
 person.birthplace
 person.birthdate
 person.deathplace
 person.deathdate
 person.profession
 person.nationality
 artist.name
 artist.gender
 artist.birthplace 
 artist.birthdate
 artist.deathplace
 artist.deathdate
 artist.role
 artist.nationality
 location.name
 location.country
 artwork.name
 artwork.form
 artwork.datebegun
 artwork.datecompleted
 artwork.movement
 artwork.genre
 artwork.author
 event.name
 event.startdate
 event.enddate
 event.location
 museum.name
 museum.type
 museum.estabilished
 museum.director
 museum.location
 family.name
 family.role
 family.nationality
 family.members
 group.name
 group.role
 group.nationality
 nationality.name
 nationality.country
 date.year
 date.month
 date.day
 interval.fromyear
 interval.frommonth
 interval.fromday
 interval.toyear
 interval.tomonth
 interval.today
 form.name
 medium.name
 mythology.name
 movement.name
 genre.name
 nertype
 nerlength

Warning: all request starting with number or containing character "*" must be surrounded by parentheses otherwise it will cause an error:

 nertag:event ^ event.startdate: (19*)

Attribute requests of a named entity should be combined with request to index nertag (because indexes are overlaping, so it will save space). "Global constraints" can additionaly use request "fof", which is abbreviation of "full occurrence from".

1.12 File redistribution

Script allows to redistribute data (files) among given servers in order to equally use capacity of disc arrays.

Script location:

 ./processing_steps/1/redistribute.py

Usage:

 ./redistribute.py -i INPUT_DIRECTORY [-o OUTPUT_DIRECTORY [-d DISTRIB_DIRECTORY]] -s SERVER_LIST [-p RELATED_PATHS] [-x EXTENSION] [-r] [-m] [-e ERRORS_LOG_FILE]

-i --input - input directory for distribution
-o --output - output directory for generated scripts
-d --distrib - output directory on all servers, generated scripts are distributed into those servers
-s --servers - file containing list of servers where each line is hostname of one machine\
-p --paths - file with paths and extensions of related files in format PATH \\t EXTENSION
-x --extension - extension in INPUT_DIRECTORY (no extension by default)\n"
-m --moves - print movement on stdout
-e --errors - if it is set errors are logged to this file with current date and time

Final redistributed scripts have to be launched via parallel ssh and have to be removed afterwards.

Example:

 python /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/1/redistribute.py -i /mnt/data/commoncrawl/CC-2015-14/warc -s /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/servers.txt -m -o /home/idytrych/redistributionScripts -d /mnt/data/commoncrawl/software/redistributionScripts -x "-warc.gz" -p /home/idytrych/CC-2015-14-rel.txt >moves.txt

where CC-2015-14-rel.txt contains:

 /mnt/data/commoncrawl/CC-2015-14/uri	-warc.domain
 /mnt/data/commoncrawl/CC-2015-14/uri	-warc.domain.srt
 /mnt/data/commoncrawl/CC-2015-14/uri	-warc.netloc
 ...

afterwards scripts in screen are launched:

 parallel-ssh -h servery_b9_idytrych.txt -t 0 -A -i "bash /mnt/data/commoncrawl/software/redistributionScripts/\$HOSTNAME.sh"

and removed:

 parallel-ssh -h servery_b9_idytrych.txt -t 0 -A -i "rm /mnt/data/commoncrawl/software/redistributionScripts/\$HOSTNAME.sh"

Servers where nothing will be moved have no script, error will be printed.

1.13 Where can you test it?

A running search engine is located on server athena1, try this as an example.

Indexes are running almost on all servers (for this search engine). For wikipedia you can restart deamons like this(only the person who launched the deamons can actually restart them):

 python daemon.py restart -i /mnt/data/indexes/wikipedia/enwiki-20150901/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIW.txt -b /mnt/data/wikipedia/software/mg4j

For CC (In order of incremental deduplication):

 python daemon.py restart -i /mnt/data/indexes/CC-2015-32/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC32.txt -b /mnt/data/commoncrawl/software/mg4j -p 12001
 python daemon.py restart -i /mnt/data/indexes/CC-2015-35/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC35.txt -b /mnt/data/commoncrawl/software/mg4j -p 12002
 python daemon.py restart -i /mnt/data/indexes/CC-2015-40/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC40.txt -b /mnt/data/commoncrawl/software/mg4j -p 12003
 python daemon.py restart -i /mnt/data/indexes/CC-2015-27/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC27.txt -b /mnt/data/commoncrawl/software/mg4j -p 12004

Indexes created on Salomon require symlinks - they contain paths such as /scratch/work/user/idytrych/CC-2015-32/mg4j_index.

1.14 Repacking warc by vertical

Script allows to repack input warc.gz file according to the respective vertical. Only files that are in input vertical are extracted from the original warc.gz file and the result is saved to file warc.xz. The script can be executed for a single warc.gz file and specific vertical, or for a folder containing input warc.gz files and folder containing verticals. No other combinations are allowed.

Script location:

 ./processing_steps/1/warc_from_vert/warc_from_vert.py

Launching:

 -i --input    input warc.gz file or folder
 -v --vertical input vertical or folder
 -o --output   output folder, where the results are saved

1.15 Packing wikipedia into warc

Script allows packing of preprocessed wikipedia to warc.gz or warc.xz file. Input can be a single file or a folder containing files to be processed.

Script location:

 ./processing_steps/1/pack_wikipedia/pack_wiki.py

Launching:

 -i --input    input file or folder
 -o --output   output folder
 -d --date     date of wiki download, format: DD-MM-YYYY
 -f --format   output file format, gz or xz

1.16 Network overload monitoring

Script network_monitor.py allows monitoring of the current CESNET link load. It can be launched as a standalone script or used as a python library.

When used as a library, it offers NetMonitor class, which can be given a network name should we wish to monitor it. The instance of this class provides the get_status function, that returns result containing incoming and outgoing bits per second and load in percentages (again in JSON).

When launched as a standalone script in a terminal, it prints incoming/outgoing load in percentages, alternatively prints the worst one of them according to the specified parameters.

Values are refreshed every minute. More frequent refresh rate is pointless.

Script location:

 ./processing_steps/1/network_monitor.py

Launching:

 -i --datain    displays incoming load in percentages
 -o --dataout   displays outgoing load in percentages
 -w --worst     displays the worst load in percentages
 -l --link      link to be monitored, if not specified then the Telia link is monitored

1.17 Automated CommonCrawl downloading

Used for automated downloading large amounts of data CC, while maintaining the load in a certain range. The earlier mentioned script network_monitor.py is used to monitor the status of a network. Script is given a low network load limit in percentages and will maintain the load between the specified valu and a 10% higher value. Input file is warcs.lst or it's part if downloaded on multiple servers. Highest number of downloading processes in range 1 to 15 can be specified via parameter (see below). It is advised to use the script on less servers with more processes.

It was tested on Salomon, downloading CC-2016-44. The script ran on 4 servers with a maximum of 12 processes and low network load limit set to 55%, while maintaining network load between 55-65%.

Script location:

 ./processing_steps/1/cc_auto_download.py

Launching:

 -i --input_file   path to input file warcs.lst or it's part
 -o --output_dir   path to output folder
 -p --process      number of downloading processes (1-15)
 -l --limit        lower network load limit in percentages (10-80)

1.18 Zimlib

Program that enables the extraction of wikipedia page from a file in zim format. Output is in preprocessed wikipedia format, it is saved to the output folder and separated in multiple files of approx. 100 MB size.

Program location:

 /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/zim_data/zimlib-1.2

Compiled using the make command, executable program zimdump is created when compiled, can be found here:

 /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/zim_data/zimlib-1.2/src/tools

Launch example:

 ./zimdump -a ~/output -J cs wikipedia_en_2016_05.zim

Parameters:

-a - output folder
-j - optional, specifies the language of input file, used to generate correct URI (default is en)

2 Getting it working on Salomon

Clone repository with SEC API (minerva1.fit.vutbr.cz:/mnt/minerva1/nlp/repositories/decipher/secapi) to home on Salomon
Download KB (in directory secapi/NER ./downloadKB.sh)
Copy to home on Salomon:
- Everything from /mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/
- /mnt/minerva1/nlp/projects/corpproc/for_salomon/jre-8u51-linux-x64.tar.gz
  - Install Java to home from jre-8u51-linux-x64.tar.gz - not necesarry, but simplifies the whole situlation (Otherwise it is necesarry to modify all scripts and add correct module - if available)
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/5/MDP-package
  - In MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/test/MDParser.java is necesarry to fix paths (around 3 occurrences /opt/MDParser/ replace for example with /home/idytrych/MDParser/)
  - Recompile (Either on Salomon, or somewhere at our place and copy there only the result) - method see above
  - copy MDP-package/MDP-1.0/build/jar/mdp.jar directly to home
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/7/mg4j
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/salomon/mpidedup
- Everything in /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/salomon/v3
  - Some is copied into directory secapi, which was created by cloning of repository
Search for string "idytrych" in all files in home (and in subdirectories), eventually "smrz" and change absolute paths accordingly
Build mpidedup (cd mpidedup; make)
Create working directory by script createWorkDirs_single.sh
File warcDownloadServers.cfg contains list of nodes, on those CommonCrawl will be downloaded - it is appropriate to check if it's up to date.
For downloading of CommonCrawl it is necessary to somehow get file .s3cfg and put it into home.

For 4. version of scripts:

Clone repository with SEC API (minerva1.fit.vutbr.cz:/mnt/minerva1/nlp/repositories/decipher/secapi) to home on Salomon
Download KB (in directory secapi/NER ./downloadKB.sh)
Copy to home on Salomon:
- Everything from /mnt/minerva1/nlp/projects/corpproc/dependencies_for_salomon/
- /mnt/minerva1/nlp/projects/corpproc/for_salomon/jre-8u51-linux-x64.tar.gz
  - Install Java to home from jre-8u51-linux-x64.tar.gz - not necesarry, but simplifies the whole situlation (Otherwise it is necesarry to modify all scripts and add correct module - if available)
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/5/MDP-package
  - In MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/test/MDParser.java is necesarry to fix paths (around 3 occurrences /opt/MDParser/ replace for example with /home/idytrych/MDParser/)
  - Recompile (Either on Salomon, or somewhere at our place and copy there only the result) - method see above
  - copy MDP-package/MDP-1.0/build/jar/mdp.jar directly to home
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/7/mg4j
- /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/salomon/mpidedup
- Everything in /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/salomon/v3
  - Some is copied into directory secapi, which was created by cloning of repository
Build mpidedup (cd mpidedup; make)
Create working directory by script createWorkDirs_single.sh
File warcDownloadServers.cfg contains list of nodes, on those CommonCrawl will be downloaded - it is appropriate to check if it's up to date.
For downloading of CommonCrawl it is necessary to somehow get file .s3cfg and put it into home.

3 Launching on our servers

3.1 Processing of Wikipedia

Complete sequence for launching (not tested yet):

 (dump is labeled RRRRMMDD)
 cd 1a
 ./download_wikipedia_and_extract_html.sh RRRRMMDD
 cd ..
 
 python ./1/distribute.py -i ./2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a
 python ./2/verticalize.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/html_from_xml/AA/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/vert/ -s servers.txt -b /mnt/data/wikipedia/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar

 python ./1/distribute.py -i ./4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a
 python ./4/tag.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/vert/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/tagged/ -s servers.txt -b /mnt/data/wikipedia/software/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar

 python ./1/distribute.py -i ./5/MDP-package/MDP-1.0/build/jar/mdp.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a
 python ./5/parse.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/tagged/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/parsed/ -s servers.txt -b /mnt/data/wikipedia/software/mdp.jar

 python ./6/ner.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/parsed/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/secresult/ -s servers.txt

 python ./1/distribute.py -i ./7/mg4j/ -o /mnt/data/wikipedia/software/mg4j/ -s servers.txt -a
 python ./7/index.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/secresult/ -o /mnt/data/indexes/enwiki-RRRRMMDD/ -s servers.txt start

Launch of daemons corresponding to queries:

 python ./daemon.py -i /mnt/data/indexes/enwiki-RRRRMMDD/final -s servers.txt -b /mnt/data/wikipedia/software/mg4j/ start

3.2 Processing of CommonCrawl

Complete sequence for launching (not tested yet):

 (CC labeled RRRR-MM)
 cd processing_steps/1b/download_commoncrawl
 ./dl_warc.sh RRRR-MM
 cd ../..

 python ./1/distribute.py -i ./2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a
 python ./2/verticalize.py -i /mnt/data/commoncrawl/CC-RRRR-MM/warc/ -o /mnt/data/commoncrawl/CC-RRRR-MM/vert/ -s servers.txt -b /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 
 python ./1/distribute.py -i ./3/dedup/server -o /mnt/data/commoncrawl/software/dedup/ -s servers.txt -a
 python ./1/distribute.py -i ./3/dedup/dedup -o /mnt/data/commoncrawl/software/dedup/ -s servers.txt -a
 cd 3
 parallel-ssh -h servers_only.txt -t 0 -i "mkdir /mnt/data/commoncrawl/CC-RRRR-MM/hashes/"
 ( to load hashes from previous processing, use parameter -i /mnt/data/commoncrawl/CC-RRRR-MM/hashes/)
 python ./server.py start -s servers.txt -w workers.txt -o /mnt/data/commoncrawl/CC-RRRR-MM/hashes/ -b /mnt/data/commoncrawl/software/dedup/server
 python ./deduplicate.py -i /mnt/data/commoncrawl/CC-RRRR-MM/vert/ -o /mnt/data/commoncrawl/CC-RRRR-MM/dedup/ -w workers.txt -s servers.txt -b /mnt/data/commoncrawl/software/dedup/dedup
 python ./server.py stop -s servers.txt -w workers.txt -b /mnt/data/commoncrawl/software/dedup/server
 cd ..

 python ./1/distribute.py -i ./4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a
 python ./4/tag.py -i /mnt/data/commoncrawl/CC-RRRR-MM/dedup/ -o /mnt/data/commoncrawl/CC-RRRR-MM/dedup/tagged/ -s servers.txt -b /mnt/data/commoncrawl/software/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 
 python ./1/distribute.py -i ./5/MDP-package/MDP-1.0/build/jar/mdp.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a
 python ./5/parse.py -i /mnt/data/commoncrawl/CC-2RRRR-MM/tagged/ -o /mnt/data/commoncrawl/CC-RRRR-MM/parsed/ -s servers.txt -b /mnt/data/commoncrawl/software/mdp.jar

 python ./6/ner.py -i /mnt/data/commoncrawl/CC-RRRR-MM/parsed/ -o /mnt/data/commoncrawl/CC-RRRR-MM/secresult/ -s servers.txt

 python ./1/distribute.py -i ./7/mg4j/ -o /mnt/data/commoncrawl/software/mg4j/ -s servers.txt -a
 python ./7/index.py -i /mnt/data/commoncrawl/CC-RRRR-MM/secresult/ -o /mnt/data/indexes/CC-RRRR-MM/ -s servers.txt start

Launch of daemons corresponding to queries:

 python ./daemon.py -i /mnt/data/indexes/CC-RRRR-MM/final -s servers.txt -b /mnt/data/commoncrawl/software/mg4j/ start

3.3 `Run_local`

Scripts used to launch processing on KNOT servers unanimously. Processing is launched using the run.py script, that launches required step according to the configuration file config.ini, this file has to be edited before processing. Set the path to folder to be processed in root_data parameter in the [shared] section. Other variables point to folders using scripts etc, these variables do not need to be changed. It is advised to copy this file to home for example, so it remains unchanged.

When launching on every server, where the processing happens, screen with name USERNAME-proc-STEP_NUMBER is created. All screens can be terminated using the script with -a kill parameter.

Script location:

 ./processing_steps/run_local/run.py

Launching:

 -c --config   required parameter, specifies path to config file
 -p --proc     required parameter, specifies processing step, options:
                       vert, tag, pars, sec, index, index_daemon, shards or their numerical values (2, 4, 5, 6, 7, 8, 9)
                       test value can be used to test steps, this does not have a numerical equivalent
                       when using this parameter, using the step in -e parameter is also required
 -a --action   optional, specifies an action to be executed, options:
                       start, kill, check, progress, eval (default is start)
 -s --servers  optional, specifies path to file containing server list used to process, if not set, the processing will only execute locally
 -t --threads  optional, specifies the number of threads for processing
 -e --examine  required if the value of parameter -p is set to test
 -l --logging  optional, turns logging of test results to files in folder containing logs, test_* and main_* files
 -h --help     prints help

3.3.1 Verticalization

Settings in the [vert] section of config file. Variable exe_path specifies the path to verticalizator script. Variables input_dir, output_dir and log_path specify the paths to input, output and log folders. Variable no_log is used to turn logging on and off. Variable Stoplist_path specifies path to file containing the list of stop words.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p vert

3.3.2 Deduplication

Servers are sequentially launched as well as workers on port specified in config file. Variables in section [dedup] need to be set. The folder containing run.py must also contain script dedup_handler.py, responsible for servers and workers launching.

Variables:

 exe_path          path to folder containing files server.py and deduplicate.py
 bin_path          path to folder containing executable files
 input_path        path to folder containing verticals
 output_path       path to output folder
 map_file          path to hashmap.conf file
 log_path          path to folder containing logs
 progress_tester   path to file dedup_check.py
 hash_path         path to folder from hash
 port              port where servers and workers run
 dropped           True or False, responsible for -dr --dropped
 droppeddoc        True or False, responsible for -dd --droppeddoc
 debug             True or False, responsible for -d --debug
 neardedup         True or False, responsible for -n --near

Launch example:

 python3 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p dedup

Deduplication proces verification:

 python3 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p dedup -a progress

3.3.3 Tagging

Settings in the [tag] section of config file. Variables exe_path, input_path, output_path, log_path with the same meaning in verticalization. Additional variables remove_uri which turns deletion of URIs from links on and off, and ttagger_path specifying path to install directory TreeTagger.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p tag

3.3.4 Parsing

Settings in the [pars] section of config file. Variables exe_path, input_path, output_path, log_path with the same meaning as in verticalization and tagging. Variable config_path specifies path to config file for parsing.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p pars

3.3.5 SEC

Settings in the [sec] section of config file. Contains same variables as verticalization, tagging and parsing. One new variable config_path specifies path to file containing SEC queries in JSON.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p sec

3.3.6 Creating a populating shards

Script to prepare for indexing. Required number of collections are created in the output folder of each servers, files in MG4J format (SEC output) are distributed equally in these collections. The [shards] section of config file contains variables for input and output folders as well as a number of required collections (based on the number of CPU cores). Output file is not required.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p shards

3.3.7 Indexing

Runs collection indexing. Input folder contains all collections, output is saved to a single folder (final). Variables can be changed in the [index] section of config file.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p index

3.3.8 Launching daemon for indexes

Settings in the [index_daemon] section of config file. Variables exe_path specifying path to indexer, input_path specifying path to folder final, on which the daemon runs and log_path specifying path to log folder. If log_path is set to "/", no logs are saved. Another variable port_number specifies the port number, where the daemon runs and config_path specifies path to config file for indexer.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p index_deamon

3.3.9 Checking outputs after each step

Scripts providing options for checking of the output files and their content after each step of processing. Links from input, output and log files are loaded and their occurrence is compared to the other files. If a mismatch is detected, an error will be printed to STDOUT or to a file. The newest version can be found on "test_scripts_xgrigo02" branch in folder processing_steps/step_check/.

Usage:

 python3 step_check.py -i IN_PATH -o OUT_PATH -l LOGS_PATH -t STEP -s [-c PATH_TO_CONFIG]

Parameters:

 -c PATH, --config PATH    Path to config file. 
                                    If config path was specified, then arguments -i, -o and -l 
                                    will be ignored, there values will be obtained from config.
 -t STEP, --target STEP    Step for script to test. 
                                    Possible STEP values:
                                     '2' or 'vert'
                                     '3' or 'dedup'
                                     '4' or 'tag'
                                     '5' or 'pars'
                                     '6' or 'sec'
                                     '7' or 'index'
                                    Implicit value is 'vert'
 -i IN_PATH, --input_path   Path with input files
 -o OUT_PATH, --output_path Path with output files
 -l LOG_PATH, --log_path    Path with log files to check. Also test output will
                                     be saved there, if -s option was specified
 -s, --save_out             Enable output writing to log file
 -h, --help                 Prints this message
 --inline                   Enables inline output

You can either launch with parameters -i, -o and -l, or set path to config file, all the required settings will be read from there. In that case, arguments -i, -o and -l will be ignored.

When specifying parameter -s, the output of scripts will be saved to files with suffixes .test_out or .main_out. When testing of a pair of folders is finished, another file .tested will be created. This indicates that scripts finished without an error.

To run using the run_local script, you need to set exe_path in [test] section of config file. This file can be found in the corpora_processing_sw/processing_steps/run_local/ folder.

Launch example using run.py:

 python3 run.py -a start -p test -e vert -l -c ~/config.ini -s ~/servers.txt

Manual launch example:

 python3 step_check.py -i ./warc_path -o ./vert_path -l ./logfiles/verticalization -t vert -s

alternatively:

 python3 step_check.py -c ./config.ini -t vert

4 Launching on Salomon

Scripts are in repository in subdirectory v3 (v2 is eligible for testing and allow launch on 1 node only)
it is necessary to put names of files to process without ending, extensions and so on into file namelist - e.g.: "1429246638571.67_20150417045718-00058" (1 name without quotation marks on line)
```
 ls -1 /scratch/work/user/idytrych/warc | sed 's/\-warc.gz//' > ~/namelist
 or
 ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
```
Number of used nodes will equal to number of files that are supposed to be processed, divided by number of files that one node will process.
Number of files, that are to be processed on one node need to be specified in file numtasks in the sequence format. Example:
```
 seq 14 > numtasks
```
Nodes are processing N files in parallel, therefore N differs for each part of processing (chosen experimentally so that RAM will not run out and processes will not collapse)
List of collection names needs to be specified in the collectionslist file, example for 192 collections:
```
 bash createCollectionsList_single.sh 192
```
Also it is necessary to edit select=1:ncpus=24 or select=2:ncpus=24:mpiprocs=18 as necessary (more above and below)
It is necessary to accordingly adjust the numbers while launching (1-<number of files to process>:<number of files that will be processed on 1 node>).
In some scripts (e.g. at deduplication and creating of collections may be considered waste of money - determining factor is what can be managed in one hour on 8 nodes) it is necessary to replace que and add project in order to not timeout (in repository are mainly versions for tests in small scale). Replace:
```
 #PBS -q qexp
 for example by
 #PBS -q qprod
 #PBS -A IT4I-9-16
 (and add the following)
 #PBS -l walltime=48:00:00
                
```

Complete sequence for launching for CommonCrawl:

 bash prepare_dl_warc_single.sh 2015-3
 Download CC (manually on login nodes):
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login1_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login2_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login3_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login4_all.sh
 ls /scratch/work/user/idytrych/CC-2015-32/warc/ | sed 's/\-warc.*//g' > ~/namelist
 mv /scratch/work/user/idytrych/CC-2015-32/warc /scratch/work/user/idytrych/
 wc -l namelist
 (printed number of rows in namelist is NNN - will be needed below)
 (Count NU = NNN / MU, where MU is maximal number of nodes, can be used in parallel according to Salomon's documentation)
 seq NU >numtasks
 qsub -N vert -J 1-NNN:NU vert.sh
 qsub dedup.sh
 qsub -N tag -J 1-NNN:NU tag.sh
 qsub -N parse -J 1-NNN:NU parse.sh
 secapi/SEC_API/salomon/v3/start.sh 3
 (M is number of collections - should be ca. 6 for each destination server)
 bash createCollectionsList_single.sh M
 (it is assumed that there is enough nodes to process all collections in parallel in case of 24 as well as in case of 8 processes)
 seq 24 >numtasks
 qsub -N createShards -J 1-M:24 createShards.sh
 qsub -N populateShards -J 1-M:24 populateShards.sh
 qsub -N makeCollections -J 1-M:24 makeCollections.sh
 seq 8 >numtasks
 qsub -N makeIndexes -J 1-M:8 makeIndexes.sh

Complete sequence for launching for Wikipedia:

 (NP is number of parts on which we want Wikipedia to be divided - which means number of destination servers)
 bash createWikiParts_single.sh NP
 (Not tested part: )
   bash download_and_split_wikipedia_single.sh 20150805
   qsub -N extract_wikipedia extract_wikipedia_html.sh
 ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
 wc -l namelist
 (printed number of rows in namelist is NNN - will be needed below)
 (Count NU = MAX(14, (NNN / MU)), where MU is maximal number of nodes, which can be used in parallel according to Salomon's documentation - so it makes no sense to use less than 14 per node)
 qsub -N vertWiki -J 1-NNN:NU vertWiki.sh
 qsub -N tagWiki -J 1-NNN:NU tag.sh
 qsub -N parseWiki -J 1-NNN:NU parseWiki.s
 (Count NUL = MAX(3, (NNN / MUL)), where MUL is maximal number of nodes, which can be used in parallel in qlong according to Salomon's documentation - so it makes no sense to use less than 3 per node)
 secapi/SEC_API/salomon/v3/start.sh NUL
 (M is number of collections - should be one for each destination server)
 bash createCollectionsList_single.sh M
 (it is assumed that there is enough nodes to process all collections in parallel in case of 24 as well as in case of 8 processes)
 seq 24 >numtasks
 qsub -N createShards -J 1-M:24 createShards.sh
 qsub -N populateShards -J 1-M:24 populateShards.sh
 qsub -N makeCollections -J 1-M:24 makeCollections.sh
 seq 8 >numtasks
 qsub -N makeIndexes -J 1-M:8 makeIndexes.sh

5 Launching on Salomon with new scripts

Scripts are in repository in subdirectories v4 and v5
it is necessary to put names of files to process without ending, extensions and so on into file namelist - e.g.: "1429246638571.67_20150417045718-00058" (1 name without quotation marks on line)
```
 ls -1 /scratch/work/user/idytrych/warc | sed 's/\-warc.gz//' > ~/namelist
 or
 ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
```
In scripts start* it is necessary to set project (PROJECT="IT4I-9-16")

Complete sequence for launching for CommonCrawl with scripts v4:

 bash prepare_dl_warc_single.sh 2015-32
 Download CC (manually for login uzlech):
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login1_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login2_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login3_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login4_all.sh
 ls /scratch/work/user/idytrych/CC-2015-32/warc/ | sed 's/\-warc.*//g' > ~/namelist
 mv /scratch/work/user/idytrych/CC-2015-32/warc /scratch/work/user/idytrych/
 bash start.sh vert 10 qprod
 (can be repeated N-times and add more nodes - instead of 10 it is possible to use 20 - but overall it is not right to use over 70 nodes)
 bash start.sh dedup 4 qprod
 (nodes cannot be added)
 bash start.sh tag 10 qprod
 (can be repeated N-times and add more nodes - instead of 10 it is possible to use 20 - but overall it is not right to use over 40 nodes)
 bash start.sh parse 10 qprod
 (can be repeated N-times and add more nodes - instead of 10 it is possible to use 25 - but overall it is not right to use over 90 nodes)
 bash start.sh sec 1 qprod
 (can be repeated N-times and add more nodes - instead of 1 it is possible to use 50 - but for the fist time after 1. 10 min. pause for build and overall it is not right to use over 100 nodes)
 mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index/final
 (MMM is number of desired collections - number of destination servers * 6 might be appropriated)
 bash startIndexing.sh cList MMM
 bash startIndexing.sh cShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 bash startIndexing.sh pShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 bash startIndexing.sh colls qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 bash startIndexing.sh indexes qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index

Changes with scripts v5:

 bash start.sh vert 1:f qprod
 bash start.sh vert 50 qprod
 (1. note is launched separately and it needs to be guaranteed that they will run for the whole time of processing- alternatively use qlong)
 (another nodes can be launched at any speed and at unlimited quantity - even though it does not make much sense to use over 150 - discs cannot handle it)
 bash start.sh dedup 4 qprod
 bash start.sh tag 1:f qprod
 bash start.sh tag 50 qprod
 bash start.sh parse 1:f qprod
 bash start.sh parse 50 qprod
 bash start.sh sec 1:f qlong
 bash start.sh sec 20 qprod

checking output of SEC and launching 2. attempt (can be repeated with substitution of namelist for restlist and of restlist for restlist2):

 bash startCheckSec.sh qprod namelist 2 1 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult
 bash startCheckSec.sh qprod namelist 2 2 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult
 (2 1 is number of parts and which one should be launched - with multiple parts the result can be achieved quicker; number of parts is given to another command by -n)
 python remove.py -n 2 -s /scratch/work/user/idytrych/secsgeresult >rm.sh
 sed "s/secsgeresult/sec_finished/;s/.vert.dedup.parsed.tagged.mg4j//" rm.sh >rmf.sh
 bash rm.sh
 bash rmf.sh
 rm rm.sh
 rm rmf.sh
 bash create_restlist.sh namelist /scratch/work/user/idytrych/secsgeresult >restlist
 rm /scratch/work/user/idytrych/counter/*
 bash start.sh sec 1:f qlong restlist
 bash start.sh sec 20 qprod restlist

Checking indexes:

 bash startCheckIndexes.sh /home/idytrych/collectionlist /scratch/work/user/idytrych/CC-2015-32/mg4j_index "/home/idytrych/check_i.txt"
 (on each line in check_i.txt should be the right number of files of index and at the same time there should not be any exception in any error log)

When SEC does not "run" (files are not being created in counter folder, processes return timeout code,...), check that files in /opt/lockfile folder are executable.

6 Launching on Salomon with new scripts v5

Scripts are located in repository, in subfolder v5 (if anything is missing, it can possibly be in v4)
Names of files to be processed (without suffixes) need to be specified in the namelist file, e.g.: "1429246638571.67_20150417045718-00058" (1 name without quotations per line)
```
 ls -1 /scratch/work/user/idytrych/warc | sed 's/\-warc.gz//' > ~/namelist
 nebo
 ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
```
In scripts start* it is necesary to set project (PROJECT="UT4I-9-16")

Complete launch sequence for CommonCrawl with scripts v5:

 bash prepare_dl_warc_single.sh 2015-32
 Download CC (manually on login nodes):
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login1_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login2_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login3_all.sh
   /scratch/work/user/idytrych/CC-2015-32/download/dload_login4_all.sh
 ls /scratch/work/user/idytrych/CC-2015-32/warc/ | sed 's/\-warc.*//g' > ~/namelist
 mv /scratch/work/user/idytrych/CC-2015-32/warc /scratch/work/user/idytrych/
 rm /scratch/work/user/idytrych/counter/*
 bash start.sh vert 1:f qprod
 bash start.sh vert 50 qprod
 (first node launches separately and it's runtime has to be ensured for the whole processing time, use qlong if needed)
 (other nodes can be launched at any speed and there can be unlimited number of them - even though there is not much point running over 150 of them, discs will be delayed)
 bash start.sh dedup 4 qprod
 rm /scratch/work/user/idytrych/counter/*
 bash start.sh tag 1:f qprod
 bash start.sh tag 50 qprod
 rm /scratch/work/user/idytrych/counter/*
 bash start.sh parse 1:f qprod
 bash start.sh parse 50 qprod
 rm /scratch/work/user/idytrych/counter/*
 bash start.sh sec 1:f qlong
 bash start.sh sec 20 qprod
 mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index/final
 (MMM represents number of required collections - advised number is number of target servers * 6)
 bash startIndexing.sh cList MMM
 bash startIndexing.sh cShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 bash startIndexing.sh pShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
 bash startIndexing.sh indexes qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index

checking output of SEC and 2nd run (can be repeated by replacing namelist with restlist and restlist with restlist2):

 bash startCheckSec.sh qprod namelist 2 1 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult
 bash startCheckSec.sh qprod namelist 2 2 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult
 (2 1 is the number of parts and which one to launch - running multiple parts leads to a faster achieved result; the following command receives number of parts from -n)
 python remove.py -n 2 -s /scratch/work/user/idytrych/secsgeresult >rm.sh
 sed "s/secsgeresult/sec_finished/;s/.vert.dedup.parsed.tagged.mg4j//" rm.sh >rmf.sh
 bash rm.sh
 bash rmf.sh
 rm rm.sh
 rm rmf.sh
 bash create_restlist_single.sh namelist /scratch/work/user/idytrych/secsgeresult >restlist
 rm /scratch/work/user/idytrych/counter/*
 bash start.sh sec 1:f qlong restlist
 bash start.sh sec 20 qprod restlist

Checking indexes:

 bash startCheckIndexes.sh /home/idytrych/collectionlist /scratch/work/user/idytrych/CC-2015-32/mg4j_index "/home/idytrych/check_i.txt"
 (every line in check_i.txt should have the correct number of index files and no error log should contain an exception)

7 Data format

7.1 Manatee

Good example of manatee format can be downloaded here.

Susane corpus
This one differs from ours - it has only 4 columns (we have 27). All tags that begins with < keep this format, nothing is transformed such as in case of MG4J.Considering necessary changes, it will only not add GLUE as a token variant and it will not generate things such as %%#DOC PAGE PAR SEN. In manatee instead of of empty anotation underscore is used (in mg4j 0). In addition in manatee configuration file, which defines tags and determines path to vertical file, has to be created in order to make it possible to index via program encodevert.

7.2 Elasticsearch

Format ElasticSearch used for sematic anotations looks as follows:

 Word[anotation1;anotation2...] and[anotation1;anotation2...] other[...;anotation26;anotation27] word[...;anotation26;anotation27]

Forms of anotations may be arbitrary, however only alfanumeric symbols and underscore are allowed. At the moment, this format is used - each anotation is in form typeOfAnotation_value

Types of annotations:

position token tag lemma parpos function parword parlemma paroffset link length docuri lower nerid nertag param0 param1 param2 param3 param4 param5 param6 param7 param8 param9 nertype nerlength

Actual anotated text looks as follows:

 Word[position_1;token_Word...]

For sematic questioning typical Lucenic queries are used(see testing query in project directory).

8 Preparation of new tagging and parsing

Two new scripts (revert.py, reparse.py) were created to use the new syntaxnet utility. It is used to analyze text and utilize well known tools. Script revert.py is used to convert Vertical to input suited for syntaxnet so that every single tag and link is placed at the end of the file and they're assigned numerical identifiers. These identifiers are used by the reparse.py script, that restores tags and links from the syntaxnet output.

Script revert.py:

 ./processing_steps/5/syntaxnet/revert.py

Usage:

 ./processing_steps/5/syntaxnet/revert.py INPUT_FILE OUTPUT_FILE

INPUT_FILE - input file, expected file is a verticalizator output file
OUTPUT_FILE - output file usable by the syntaxnet

Script reparse.py:

 ./processing_steps/5/syntaxnet/reparse.py

Usage:

 ./processing_steps/5/syntaxnet/reparse.py INPUT_FILE OUTPUT_FILE

INPUT_FILE - input file, expected file is a syntaxnet output file
OUTPUT_FILE - output file enriched with tags and links

9 Description of files in directory old

9.1 CC-2014-35

Scripts to download CommonCrawl 2014-35:

dload_warc.sh - downloading script for WARC. BEWARE: it is necessary to add target file to the end of a line or download content of each directory separately.
dload_wat.sh - downloading script for WAT. BEWARE: it is necessary to add target file to the end of a line or download content of each directory separately.
dload_wet.sh - downloading script for WET. BEWARE: it is necessary to add target file to the end of a line or download content of each directory separately.
lister.sh - main script that downloads segment.path and creates files dload_*.sh, list.*.sh, *.lst and counts size of file for WARC, WAT and WET
list.WARC.sh - script creating WARC listing
list.WAT.sh - script creating WAT listing
list.WET.sh - script creating WET listing
segment.paths - downloaded file with description of segments
warcs.lst - listing of WARC files
wats.lst - listing of WAT files
wets.lst - listing of WET files

9.2 ccrawl_dloader

dl_warc.sh - gradually launches lister.sh in order to create listing, creator.py with file warc.lst and starts downloading on machines that are stated in servers.cfg
creator.py - creates downloading scripts, for standard input one of files w*.lst is used, argument is path for downloading, list of machines is stated in servers.cfg
servers.cfg - list of machines, format name_of_machine number_of_thread coefficient(everything separated by one space), number of threads and coefficient are integers. Coefficient determines proportion that will be downloaded (e.g. if there are 3 machines in file that have coefficients 2, 1, 1, on the first machine 2/4=1/2 of all files will be downloaded, on the other two 1/4 per each).

9.3 wet2vert

wet2vert1.py - script for processing vertical files from WET: arguments are input directory, output directory, number of processes

9.4 tt

link to TreeTagger, run using the following commands:

 [path_to_corpproc/]tt/bin/tree-tagger -token -lemma -sgml -no-unknown 
 [path_to_corpproc/]tt/lib/english.par

or by script tagger.py from directory vert2ner.

9.5 mdparser

parser.py - divides files between threads and then runs on every group of files MDParser.
mdp.jar - main file, run by java -Xmx1g -jar mdp.jar props.xml (props.xml - is config file)
props1.xml - example config file. WARNING: do not change path nor content of this file otherwise script parser.py will not work.

9.6 vert2ner

tagger.py - TreeTagger, arguments are input directory, output directory, number of processes

10 Processing of CommonCrawl 2014-35

10.1 Statistics

Quantity:

WARC: 43430,7952731 GB (46633461334359 bytes), WAT: 14702,6299882 GB (15786828741075 bytes), WET: compressed 5300,3036399 GB (5691157698008 bytes), uncompressed 12319,1 GB.

Statistics after verticalisation (old verticaliser):

Number of files: 52 849

Number of documents: 2 744 133 462

Number of paragraphs: 316 212 991 122

Number of sentences: 358 101 267 144

Number of words (tokens): 2 534 513 098 452

Downloading, processing and indexing of large text corpus

Table of Contents

1 Description of processing

1.1 Distribution of programs or data for processing

1.1.1 Downloading the dump of Wikipedia

1.1.2 Downloading CommonCrawl

1.1.3 Downloading web pages from RSS feeds

1.2 Verticalization

1.2.1 Warcreader

1.2.2 Input types

1.2.3 Profiling scripts

1.3 Deduplication

1.3.1 Deduplication for Salomon

1.3.1.1 Hash migration KNOT <-> Salomon

1.4 Tagging

1.5 Parsing

1.6 SEC (NER)

1.7 MG4J Indexation

1.8 Daemon replying to requests

1.9 Commandline query tool

1.10 Web GUI

1.10.1 Requesting

1.11 Indexes

1.12 File redistribution

1.13 Where can you test it?

1.14 Repacking warc by vertical

1.15 Packing wikipedia into warc

1.16 Network overload monitoring

1.17 Automated CommonCrawl downloading

1.18 Zimlib

2 Getting it working on Salomon

3 Launching on our servers

3.1 Processing of Wikipedia

3.2 Processing of CommonCrawl

3.3 Run_local

3.3.1 Verticalization

3.3.2 Deduplication

3.3.3 Tagging

3.3.4 Parsing

3.3.5 SEC

3.3.6 Creating a populating shards

3.3.7 Indexing

3.3.8 Launching daemon for indexes

3.3.9 Checking outputs after each step

4 Launching on Salomon

5 Launching on Salomon with new scripts

6 Launching on Salomon with new scripts v5

7 Data format

7.1 Manatee

7.2 Elasticsearch

8 Preparation of new tagging and parsing

9 Description of files in directory old

9.1 CC-2014-35

9.2 ccrawl_dloader

9.3 wet2vert

9.4 tt

9.5 mdparser

9.6 vert2ner

10 Processing of CommonCrawl 2014-35

10.1 Statistics

3.3 `Run_local`