Downloading, processing and indexing of large text corpus

Table of Contents

1 Description of processing

Processing is divided into several steps. Scripts were created for each step to simplify the work using arguments and they allow the processing performed on one machine, or in parallel on multiple machines.

All source codes of programs and scripts are in repository corpora_processing_sw and potential missing libraries are in /mnt/minerva1/nlp/projects/corpproc. Anywhere else than here you might find old nonfunctional versions!

1.1 Distribution of programs or data for processing

Script enables reallocating data (files) between servers. Data are divided according to their size so all servers use similiar amounts of data. Parameter -a switches reallocating off and each file is copied to all servers. This is particularly suitable for the distribution of processing programs.

 ./processing_steps/1/distribute.py
        

Usage:

 ./distribute.py [-i INPUT_DIRECTORY] -o OUTPUT_DIRECTORY -s SERVER_LIST [-a] [-e ERRORS_LOG_FILE]

 -i   --input     input directory/file for distribution (if it is not set, file names are expected on stdin separated by '\n')
 -o   --output    output directory on target server, if directory doesn't exist script tries to create it
 -a   --all       each file will be copied to all servers (suitable for scripts/programs)
 -s   --servers   file containing list of servers where each line is hostname of one machine, if line contains tabulator it is
                  treated as a separator and with the text before the separator as hostname (it is for compatibility of this configuration
                  file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line)                   
 -e   --errors    if set, errors are logged to this file with current date and time
        

Examples:

 ./distribute.py -i ~/data_to_distribution/ -o /mnt/data/project/data/ -s ~/servers
 # All files from directory ~/data_to_distribution/ are reallocated between servers set by file ~/servers.
 # Output directory is /mnt/data/project/data/. If directory data on some machine doesn't exist, script tries to create it. 

 ./distribute.py -i ~/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s ~/servers -a
 # Copies program NLP-slave to servers set by file~/servers.
 # Output directory is /mnt/data/commoncrawl/software/. If directory "software" doesn't exist on a machine, the script creates it.
        

1.1.1 Downloading the dump of Wikipedia

There is a program for extraction plain text from Wikipedia called WikiExtractor

Launching:

 cd /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/
 ./download_wikipedia_and_extract_html.sh 20151002
        

Script in working directory expects file hosts.txt containing list of servers (1 on each line) on which it is supposed to run.

 /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/tools/WikiExtractor.py
        

Output is collection of files of size approx. 100MB divided into servers mentioned in /mnt/data/wikipedia/enwiki-.../html_from_xml/enwiki=...

1.1.2 Downloading CommonCrawl

To download WARC you need to know the exact specification of CommonCrawl, e.g. "2015-18".

 ./processing_steps/1b/download_commoncrawl/dl_warc.sh
        

downloaded files are located in:

 /mnt/data/commoncrawl/CC-Commoncrawl_specification/warc/
        

support files are located in:

 /mnt/minerva1/nlp-2/download_commoncrawl/CC-Commoncrawl_specification/download/
        

Consequently it is possible to calculate URI statistics using:

 ./processing_steps/1b/uri_stats.sh
        

result is saved in:

 /mnt/minerva1/nlp-2/download_commoncrawl/CC-Commoncrawl_specification/uri/
        

data on individual machines are located in:

 /mnt/data/commoncrawl/CC-Commoncrawl_specification/uri/
        

1.1.3 Downloading web pages from RSS feeds

To get url from entered RSS source use:

 ./processing_steps/1c/collect.py
        

Usage:

 ./collect.py [-i INPUT_FILE] [-o OUTPUT_FILE [-a]] [-d DIRECTORY|-] [-e ERRORS_LOG_FILE]
 
 -i   --input     input file containing RSS url separated by '\n' (if not set, it is expected on stdin)
 -o   --output    output file for saving parsed url (if not set, it is printed to stdout)
 -a   --append    output file is in adding mode 
 -d   --dedup     deduplication of obtained links (according to matched url), you can optionally enter folder with files with list of already colled url, these url will be included in deduplication, in mode -a the output file is also included
 -e   --errors    if set, errors are logged in this file with current date and time
        

Examples:

 ./collect.py -i rss -o articles -a
 # Adds the url from RSS sources in the file articles, urls found in file rss
        

To download websites according to entered list of url and save to WARC archive use:

 ./processing_steps/1c/download.py
        

Usage:

 ./download.py [-i INPUT_FILE] -o OUTPUT_FILE [-e ERRORS_LOG_FILE]
 
 -i   --input     input file containing url (if it is not set, it's expected on stdin)
 -o   --output    output file for saving warc archive
 -r   --requsts   limit the number of requests per minute on one domain (default is 10)
 -e   --errors    if set, errors are logged into this file with current date and time
        

Script downloads sites equally according to domain, not in order according to input file, to avoid possible "attack". Simultaneously it's possible to set limit of requests to domain per minute. When the limit is reached for all domains, downloading is paused until the limit is restored. The limit is restored every 6 seconds (1/10 minute) on 1/10 of total limit.
If errors occur during the download (except for code 200), the whole domain is excluded.

Example:

 ./download.py -i articles -o today.warc.gz
 # it downloads sites with url included in file articles into archive today.warc.gz
        

1.2 Verticalization

Vertical file format description

Input for verticalization program is file warc.gz. From this file the program unpacks individual records (documents), filters HTML from it, filters non-English articles and executes tokenization. It can process also 1 webpage in .html and Wikipedia in preprocesed HTML. Output is saved with extension vert into target folder.

New verticalizer:

 ./processing_steps/2/vertikalizator/main.py
        

Compile using the following command:

 make
        

In case of error during compiling justextcpp (aclocal.m4), run the following commands in justextcpp/htmlcxx:

 touch aclocal.m4 configure Makefile.in
 ./configure
        

Salomon may require modules (some of the following modules and dependencies):

 module load OpenMPI/1.8.8-GNU-4.9.3-2.25
 module load Autoconf/2.69
 module load Automake/1.15
 module load Autotools/20150215
 module load Python/2.7.9
 module load GCC/4.9.3-binutils-2.25
        

If LZMA-compressed WARC output is required, backports.lzma module has to be installed:

 pip install --user backports.lzma
        

This module's installation can fail during compilation, which means lzma library is not installed on the server. In that case run:

 make lzma
        

liblzma-dev package needs to be installed, use command:

 apt-get install -y liblzma-dev
        

Usage:

 ./main.py main.py [-h] [-i INPUT] [-o OUTPUT] [-n] [-t INPUTTYPE] -s STOPWORDS [-l LOG] [-a WARCOUTPUT] [-d]
 
 -i   --input        input file for verticalization (if it is not set, expected WARC on stdin)
 -o   --output       output file (if it is not set, it will use stdout)
 -n   --nolangdetect turns off detection of language (speeds it up with minimal difference in output)
 -t   --inputtype    determines expected input type, see Input types
 -s   --stopwords    file with list of stop words, required form boilerplate removal using Justext algorithm
 -l   --log          logfile containing debugging information, STDOUT or STDERR need to be set to print the debugging information to their respective output.
                           If not present, no log is stored or displayed.
 -a   --warcoutput   path to output WARC file. This file will contain HTTP response records from the input file,
                     whose contents were not completely deleted by justext algorithm or language detection.
                     No WARC output is generated if this parameter is not set. The parameter has no effect if the input
                     for the verticalization is one HTML file or a Wikipedia archieve. Does not cancel standard vertical format
                     output generation, the verticalization process will have 2 different outputs at the same time.
                     Compression will be used if the filename ends with .gz(GZip) or .xz(LZMA).
 -d   --dedup        vertical output will be deduplicated. File dedup_servers.txt has to be modified to contain hostnames of the servers where deduplication server processes should run.
                     Deduplication servers will keep running even after the verticalization process ends, so they can be used by other verticalization processes. They have to be shut down manually.
 -m   --map          configuration map for deduplication(now only in hash redistribution branch).
        

Stdin input:

Verticalizator is faster with stdin input than input file. For example:

 xzcat file | python main.py [-o OUTPUT] [-n] -s STOPWORDS [-l LOG] [-a WARCOUTPU] [-d]
        

1.2.1 Warcreader

Warcreader is now located in the verticalizator folder.

Old warcreader When upgrading to newer version of verticalizer it might be required to reinstall the warcreader package, which is developed simultaneously with the verticalizer, but published on Python Package Index as a standalone library. This package is installed to current user's home directory using make command and cannot be removed or upgraded by pip utility. It has to be removed manually using: '

 rm -r ~/.local/lib/pythonX.X/site-packages/warcreader*
        

where X.X is version of Python, usually 2.7. Then you can install new version of warcreader using pip:

1.2.2 Input types

Verticalizer can process several types of input. Expected input type must be defined by the -t/--inputtype argument. Possible values are:

DO NOT USE: - old verticalizer

Originally, NLP-slave was used for verticalization (dependent on local langdetect), source code is in repository:

 ./processing_steps/2/NLP-slave
        

This program expects file warc.gz as input (first parameter) from which it gradually extracts particular records, filters HTML, filters articles not in English and does tokenization. It is also possible to process 1 website in .html and Wikipedia in preprocessed plaintext or HTML. Result is saved with extension vert into destination directory.

Compile using this command:

 mvn clean compile assembly:single
        

Usage of old verticalizer:

 java -jar package [opts] input_file.html output_dir file_URI [langdetect_profiles]
 java -jar package [opts] input_file.txt output_dir [langdetect_profiles]
 java -jar package [opts] input_file-warc.gz output_dir [langdetect_profiles]
 java -jar package [opts] input_file output_dir [langdetect_profiles]
        

Script for launching in parallel on multiple machines:

 ./processing_steps/2/verticalize.py
        

Usage:

 ./verticalize.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-b BINARY] [-t THREADS] [-l] [-d] [-s STOPWORDS_LIST] [-e ERRORS_LOG]
 
 -i   --input     input directory containing files to verticalize
 -o   --output    output directory, if it does not exist, script attempts its creation
 -s   --servers   file with list of servers with format HOSTNAME '\t' THREADS '\n' (one line per one machine)
 -e   --errors    if it is set errors are logged to this file with current date and time
 -b   --binary    argument specifying path to verticalizer (default is ./processing_steps/2/vertikalizator/main.py)
 -t   --threads   sets number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority)
 -l   --no-lang   turns off language detection
 -d   --debug     verticalizer prints debugging information
 -w   --stopwords file with list of stop words (default is ./processing_steps/2/vertikalizator/stoplists/English.txt)
        

Example:

 ./verticalize.py -i /mnt/data/commoncrawl/CC-2015-18/warc/ -o /mnt/data/commoncrawl/CC-2015-18/vert/ -s ~/servers -b /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 # Data from directory /mnt/data/project/CC-2015-18/warc/ verticalizates to directory/mnt/data/commoncrawl/CC-2015-18/vert/
 # verticalization will be executed on all servers determined by file -s ~/servers
 # It uses program /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar, which has to be on all machines
        

1.2.3 Profiling scripts

All scripts can be found in verticalization folder profiling.

files_compare.py Script for comparing two files. There are two arguments. Both are paths to files which we want compare.

Run example:

 python files_compare.py testregularpuvodni.txt testregular11.vert
        

profiling.py

Script to simplify profiling. All variants to profile are uploaded to the folder. If there are multiple occurences of files of the same type e.g. tokenizer.py, add "-" before the endpoint, and the rest of the text.

Example:

 tokenizer-2.py or tokenizer-slouceniVyrazu.py
        

First verticalizator runs the original version of verticalizator to compare profiling result. The script then takes individual files, rewrites test file and runs verticalization. When verticalization ends, the original file is taken from reset folder and is rewritten into verticalizator. (Precaution in case verticalize.py is tested first and then tokenizer.py, so that the edited verticalize.py is not used.)

Requirements: Create folder reset and copy the original files here. Then create folder profiling. This is where the results will be saved. Script and folders must be located in folder verticalizator.

Script requires two arguments. Path to the input file and file type.

Run example:

 python profiling.py /mnt/data/commoncrawl/CC-2016-40/warc/1474738659833.43_20160924173739-00094-warc.xz warc
        

profiling_results.py

Script used to display the results. The script requires three arguments. Folder with results, sorting criteria and number of items to display for each file.

Run example:

 python profiling_result.py profiling tottime 20
        

Most important sorting options:

1.3 Deduplication

Detailed documentation of deduplication can be found here.

Programs dedup and server are used for deduplication. Both are available to compile via Makefile in folder:

 processing_steps/3/dedup/
 in repository corpproc_dedup
        

Launch parameters:

 ./server -m=~/hashmap.conf [-i=INPUT_FILE] [-o=OUTPUT_FILE [-d]] [-s=STRUCT_SIZE] [-p=PORT] [-j=JOURNAL_FILE] [-k] [a] [-P=RPORT]

 -i   --input     input file 
 -o   --output    output file
 -h   --help      lists launch arguments and their use
 -p   --port      server port (default 1234)
 -s   --size      change size of structure for saving hashes (default size is 300,000,000)
 -d   --debug     simultaneously with output file a  file containing debugging dumps is generated
 -j   --journal   recovery of hashes from journal file - (use with -j at client)
 -k   --keep      archiving journal file after successfully saving hashes to output file
 -m   --map       hash distribution map
 -a   --altered   distribution map changed, enable hash migration
 -P   --rport     port for hash migration (default -p + 1)
        

Server runs until it's "killed". hashes are saved into output file (if specified) as a reaction to signals SIGHUP and SIGTERM.

 ./dedup -i=INPUT_DIR -o=OUTPUT_DIR -s=SERVERS_STRING [-p=PORT] [-t=THREADS] [-n] [-d] [-wl] [-uf=FILTER_FILE]

 -i   --input       input folder with files for deduplication
 -o   --output      output folder, if it does not exist, script attempts its creation
 -p   --port        server port (default 1234)
 -t   --threads     sets number of threads (default 6)
 -n   --near        uses algorithm "nearDedup"
 -d   --debug       for every output file .dedup file .dedup.debug containing debugging logs is generated 
 -wl  --wikilinks   deduplication of format WikiLinks
 -dr  --dropped     for each output file .dedup file .dedup.dropped containing deleted duplicates is generated
 -f   --feedback    records in file .dedup.dropped containing reference to records responsible for its elimination (more below) 
 -dd  --droppeddoc  for each output file *.dedup a .dedup.dd file containing list of URL addresses of completely eliminated documents is created                        
 -j   --journal     continue in dedup after system crash - already processed files are skipped, unfinished ones are about to be processed
                    (use with -j on server)
 -h   --help        prints help
 -m   --map         hash distribution map
        

For easier launching there are scripts server.py and deduplicate.py (described below), which allow launching deduplication in parallel on more machines. Programs have to be pre-distributed on all used machines and have to be on the same place e.g.: /mnt/data/bin/dedup

Launching servers for deduplication

Script with argument start first launches screens and then servers in them. If argument stop is used script closes the screens. If start nor stop nor restart is not entered, then script firstly tests if screens are running and then servers.

 ./processing_steps/3/server.py
        

Usage:

 ./server.py [start|stop|restart|migrate] -m MAP_FILE [-i INPUT_FILE] [-o OUTPUT_FILE [-d]] [-a] [-t THREADS] [-p PORT] [-e ERRORS_LOG_FILE] [-b BINARY] [-d] [-j] [-k] [-r SPARSEHASH_SIZE] [-P rport]


 start             start servers
 stop              stop servers, wait for hash serialization
 restart           restart servers
 migrate           migrate hashes and exit. 
 -i   --input      input file 
 -o   --output     output file
 -t   --threads    number of threads containing workers (default = 384)
 -p   --port       server port (default 1234)
 -r   --resize     change size of structure for saving hashes (default size is 300000000)
 -e   --errors     if set, errors are logged into this file with current date and time
 -b   --binary     argument specifying path to deduplication server (default is /mnt/data/commoncrawl/corpproc/bin/server)
 -d   --debug      simultaneously with output file a file containing debugging dumps is generated
 -j   --journal    recover hashes from journal file (suggested to use with -j at client)
 -k   --keep       archiving journal file after successful save of hashes to output file
 -v   --valgrind   Runs server in valgrind (emergency debugging)
 -P   --rport      Port for hash migration service (default is -p + 1)
 -E   --excluded   List of excluded servers for hash migration (old->new)
 -a   --altered    Enables hash migration. see -a at server
 -m   --map        hash distribution
        

Note: -j, --journal: path to input file to recover hashes from journal is required, which was specified as output file before server crash. Script verifies, if file "input_file".backup exists. If input file does not exist on a servers, it will be created.

For example:

 ./server.py start -m ~/hashmap
 # Launches screens and then launches servers in them on machines specified in file ~/hashmap
 # Servers are waiting for workers to connect

 ./server.py -m ~/hashmap
 # Tests if screens and servers are launched on machines specified in file ~/hashmap

 ./server.py stop -m ~/hashmap -E ~/excluded.list
 # Closes screens and servers on machines specified in file~/hashmap and ~/excluded.list

 ./server.py migrate -m ~/hashmap -E ~/excluded.list -i ~/input.hash -o ~/output.hash
 # Runs hash migration according to distribution map, hashes are saved and servers are stopped when migration ends.
        

Launching workers for deduplication

It is necessary to launch servers with the exact parameters -m and -p.

 ./processing_steps/3/deduplicate.py
        

Usage:

 ./deduplicate.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY -w WORKERS_FILE -m MAP_FILE [-p PORT] [-e ERRORS_LOG_FILE] [-b BINARY] [-n] [-d] [-wl]

 -i   --input      input folder with files for deduplication
 -o   --output     output folder, if it does not exist, script attepts its creation
 -w   --workers    file with list of workers, the format is: HOSTNAME '\t' THREADS '\n' (one line per one machine)                   
                   (beware - replacing tabulators with spaces will not work)
 -p   --port       server port (default 1234)
 -e   --errors     if set, errors are logged into this file with current date and time
 -b   --binary     argument specifying path to deduplication program (default is /mnt/data/commoncrawl/corpproc/bin/dedup)
 -t   --threads    number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority)
 -n   --near       it uses algorithm "nearDedup"
 -d   --debug      simultaneously with output file .dedup a .dedup.dropped containing removed duplicates and file .dedup.debug containing debugging dumps are generated
 -wl  --wikilinks  deduplication of format Wikilinks
 -dr  --dropped    for each output file .dedup a .dedup.dropped containing deleted duplicates is generated
 -f   --feedback   records in file .dedup.dropped containing reference on records responsible for its elimination (see below) 
 -dd  --droppeddoc file "droppedDocs.dd" containing list of completely excluded documents will be created in the output directory
 -j   --journal    continue dedup after system crash - already processed files are skipped, unfinished ones are about to be processed
                   (use with -j on server)
 -m   --map        hash distribution map
        

Deduplication of format wikilinks is implemented for computing hash for concatenation of columns 2, 3, 5, 6, all the columns have to be the same to evaluate row as duplicated. In neardedup computing of hashes using the N-grams is concatenation of columns 5, 3, 6 (in this order), then it computes hash of 2. column. The row is duplicated if both last entries found conjunction.

For example:

 ./deduplicate.py -i /mnt/data/commoncrawl/CC-2015-18/vert/ -o /mnt/data/commoncrawl/CC-2015-18/dedup/ -w ~/workers-s ~/servers 
 # Data from folder /mnt/data/commoncrawl/CC-2015-18/vert/ deduplicate to folder /mnt/data/commoncrawl/CC-2015-18/dedup/
 # Deduplicaton runs on machines specified in file ~/workers
 # Running servers on machines specified in file ~/servers are expected
        

1.3.1 Deduplication for Salomon

This step only differs for Salomon, because it deals with distributed calculation. In standart execution the communication is through sockets via standard TCP/IP. On Salomon it is not possible (InfiniBand), thats why library mpi is used. Source codes are in:

 /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/salomon/mpidedup
        

It is highly recomended to use module OpenMPI/1.8.8-GNU-4.9.3-2.25 due to compatibility issues with some other versions.

Compilation:

 module load OpenMPI/1.8.8-GNU-4.9.3-2.25
 make
        

Launching on salomon

Deduplication is easiest to run using the v5/start.sh script, launch parameters are set in the script. It is recommended to use 4 nodes:

 bash start.sh dedup 4 qexp
        

Parameters:

 -h --hash - relative number of servers holding particular subspaces of total hashing space
 -w --work - relative number of worker servers executing their own deduplication
 -i --input - input directory with files for deduplication
 -o --output - output directory
 -l --load - optional directory used to load existing hashes 
    (for incremental deduplication - obviously needs to be launched the same way 
    (infrastructure, number of workers and servers with hashing space), the way they were saved before)
 -s --store - optional directory to save hashes
 -r --resize - optional parameter to change size of structure to save hashes
 -d --debug - turns on debugging mode (generates logs)
 -p --dropped - generates *.dropped file for each input vert, containing removed paragraphs
 -c --droppeddoc - generates *.dd file for each input vert, containing list of processed documents
 -n --near - switches to neardedup algorithm
 -j --journal - takes processed files and journals into consideration (unsaved hashes and unfinished files)
              - attempts to recover after crash and to continue deduplication
 -m --map - loads distribution map from KNOT servers and use it
        

What to do when a crash occurs?

Examine what caused the crash, check dedup.eXXXXX and dedup.oXXXXX files and run the process again.

When launching using v5/start.sh, which then calls v5/dedup.sh after the environment parameters are set, the -j parameter is pre-set, to ensure that the deduplication continues whether or not it ended successfully. Keep in mind that -j causes already processed verticals to be skipped, to start the process over from scratch, delete the contents of the output folder with deduplicated verticals. Recovery mode can be turned off in the v5/dedup.sh file.

How does recovery mode work?

Launching on KNOT servers:

 mpiexec dedup -h H -w W -i ~/vert/ -o ~/dedup/ -s ~/hash/
        

More information can be found here.

1.3.1.1 Hash migration KNOT <-> Salomon

Hashes distributed according to hash map on KNOT servers can be imported, used on Salomon and exported back. Use scripts in processing_steps/salomon/migration.

Hash import:

 python3 importhashes.py [-h] [-s SERVERS] [-m MAP] FILE TARGET

 FILE       path to hash map on KNOT servers, for example: /tmp/dedup/server.hashes
 TARGET     target folder on Salomon, for example: /scratch/work/user/$(whoami)/deduphashes
 -h --help  prints help
 -s SERVERS path to file containing list of servers for import (1 hostname per line), for example: athena5
 -m MAP     path to distribution map, alternative to -s, list of servers for import extracted from the distribution map
        

Hash export:

 python3 exporthashes.py [-h] SOURCE TARGET
        
 SOURCE source folder, for example: /scratch/work/user/$(whoami)/deduphashes
 TARGET target folder on KNOT servers, for example: tmp/dedup/dedup.hash
        

Use of the imported hashes on Salomon

Set the MAP_PATH parameter when launching the process, for example in file v5/start.sh:

 MAP_PATH="/scratch/work/user/${LOGIN}/hashmap.conf"
        

Deduplication changes the number of hashholders based on the number of servers in distribution map. If the distribution map contains 46 KNOT servers, then at least 47 cores are required (46 hashholders and 1 worker), optimally at least double. Parameters NUM_HASHES and NUM_WORKERS are ignored in this case.

1.4 Tagging

Tagging is executed via program TT-slave (dependent on /opt/TreeTagger), which can be found in:

 ./processing_steps/4/TT-slave
        

Compile using the following command:

 mvn clean compile assembly:single
        

Usage:

 java -jar package [opts] input_file output_dir [treetagger.home]
        

Script for parallel execution on multiple servers:

 ./processing_steps/4/tag.py
        

Usage:

 ./tag.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] [-t THREADS] [-d] [-u]
 
 -i   --input     input directory with files for tagging
 -o   --output    output folder, if it does not exist, script attempts its creation
 -s   --servers   file containing list of servers, with format HOSTNAME '\t' THREADS '\n' (each line per one machine)
 -e   --errors    if it is set, errors are logged to this file with current date and time 
 -b   --binary    argument specifying path to tagging .jar program (default is ./processing_steps/4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar)
 -t   --threads   sets number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority)
 -d   --debug     debugging listing
 -u   --uri       turns on deletion of URI from links
        

Example:

 ./tag.py -i /mnt/data/commoncrawl/CC-2015-18/dedup/ -o /mnt/data/commoncrawl/CC-2015-18/dedup/tagged/ -s ~/servers
 # tags files from folder/mnt/data/commoncrawl/CC-2015-18/dedup/ and saves them to
 # directory/mnt/data/commoncrawl/CC-2015-18/dedup/tagged/ on machines specified in file ~/servers
        

1.5 Parsing

Parsing is done by modified MDParser, which can be found in:

 ./processing_steps/5/MDP-package/MDP-1.0
        

Compile by command:

 ant make-mdp
        

Usage:

 java -jar package [opts] input_file output_dir [path_to_props]
        

Important files, if something was changed:

 ./processing_steps/5/MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/test/MDParser.java
 ./processing_steps/5/MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/outputformat/ConllOutput.java
        

Script for parallel execution on multiple servers:

 ./processing_steps/5/parse.py
        

Usage:

 ./parse.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] [-t THREADS] [-u] [-p XML_FILE]
 
 -i  --input    input directory with files for parsing
 -o  --output   output directory, if directory doesn't exist, script will attempt to create it
 -s  --servers  file with list of servers, where is format: HOSTNAME '\t' THREADS '\n' (one line for one machine)
 -e  --errors   if set, errors are logged to this file with current date and time
 -b  --binary   argument specifies path to parsing .jar program (default is ./processing_steps/5/MDP-package/MDP-1.0/build/jar/mdp.jar)
 -t  --threads  sets number or threads (default is 6, if numbers of threads is stored in file with list of servers, they have a higher priority
 -u  --uri      turns on deleting URI from links
 -p  --props    Path to .xml file with program parameters (default is ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml)
        

Exaples:

 ./parse.py -i /mnt/data/commoncrawl/CC-2015-18/tagged/ -o /mnt/data/commoncrawl/CC-2015-18/parsed/ -s ~/servers
 # parses files from folder /mnt/data/commoncrawl/CC-2015-18/tagged/ to 
 # folders /mnt/data/commoncrawl/CC-2015-18/parsed/ on servers specified in file ~/servers

 ./parse.py -i /mnt/data/commoncrawl/CC-2015-18/tagged/ -o /mnt/data/commoncrawl/CC-2015-18/parsed/ -s ~/servers \
  -p ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml
 # parses files from folder /mnt/data/commoncrawl/CC-2015-18/tagged/ do 
 # folders /mnt/data/commoncrawl/CC-2015-18/parsed/ on machines specified in file ~/servers
 # simultaneously hands config file propsKNOT.xml to MDParser (must be available on all machines we're trying to run it)
        

1.6 SEC (NER)

SEC (see SEC) is used for recognition of named entities. Client sec.py can be parallely run on multiple servers by:

 ./processing_steps/6/ner.py
        

Usage:

 ./ner.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY]

 -i   --input     input directory 
 -o   --output    output directory
 -s   --servers   file with list of servers, where hostname of one of the machines is on each line, if the line includes tabulator, it's considered a separator and text before separator is used as a hostname (due to compability of this config file with scripts, where you can specify number of threads for specific machine, you can use this format: HOSTNAME \t THREADS for each line)
 -e   --errors    if it's set, errors are logged to this file with current date and time
 -b   --binary    change of path to the SEC client (default is /var/secapi/SEC_API/sec.py)
        

Exaple:

 ./ner.py -i /mnt/data/commoncrawl/CC-2015-18/parsed/ -o /mnt/data/commoncrawl/CC-2015-18/secresult/ -s ~/servers
 # processes files from folder /mnt/data/commoncrawl/CC-2015-18/parsed/ to 
 # folder /mnt/data/commoncrawl/CC-2015-18/secresult/ on machines specified in file ~/servers
        

Example SEC configuration, mapping input vertical to output vertical, that annotates vertical to format MG4J

 {
    "annotate_vertical": {
        "annotation_format": "mg4j",
        "vert_in_cols": [
           "position", "token", "postag", "lemma", "parabspos", "function", "partoken", "parpostag", "parlemma", "parrelpos", "link", "length"
        ],
        "vert_out_cols": [
           "position", "token", "postag", "lemma", "parabspos", "function", "partoken", "parpostag", "parlemma", "parrelpos", "link", "length"
        ]
    }
        

1.7 MG4J Indexation

Introduction to MG4J: http://www.dis.uniroma1.it/~fazzone/mg4j-intro.pdf

Source files of a program designed for semantic indexing are located in directory:

 ./processing_steps/7/corpproc
        

Compile using this command:

 mvn package
        

Script for parallel execution on multiple servers

The script starts indexation so it creates 6 Shard , they are filled and a collection is created from them, than indexation on the collection is started . To do this, you need to specify the argument start . If not set, the status of the individual screens and state programs will be printed. The argument stop terminates screens.

 ./processing_steps/7/index.py
        

Use of the script:

 ./index.py [start|stop] -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY]
 
 -i   --input     input directory
 -o   --output    output directory
 -s   --servers   file containing list of servers where each line is hostname of one machine, if line contains tabulator it 
                  is treated as separator and hostname is text before the separator (it is for compatibility of this configuration 
                  file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line)
 -e   --errors    if set, errors are logged to this file with current date and time
 -b   --binary    path to directory containing .jar files (default is /mnt/data/wikipedia/scripts/mg4j/)
        

Examples:

 ./index.py -s ~/servers
 # displays the status of startup screens and program on all servers from file ~/servers

 ./index.py -i /mnt/data/commoncrawl/CC-2015-18/secresult/ -o /mnt/data/indexes/CC-2015-18/ -s ~/servers start
 # runs indexation on selected servers

 ./index.py -s ~/servers stop
 # ends screens and thereby processes in all servers from file ~/servers
        

The columns can be found here:

 https://docs.google.com/spreadsheets/d/1S4sJ00akQqFTEKyGaVaC3XsCYDHh1xhaLtk58Di68Kk/edit#gid=0
        

Note: Its not clear if the special XML charactes in columns should be escaped. The current implementation of SEC doesn't escape them (respectively any escaping is removed). tagMG4JMultiproc.py doesn't escape them too. On the other hand it leaves escaping on the input ( only partial - probably just some columns).

If columns are changed, the following things have to be changed:

1.8 Daemon replying to requests

Daemon requires .collection file in the folder with index, new indexer creates it automatically. It is however possible to use indices created by the old indexer. This older .collection file is automatically converted to the new format. To move indices or data files to a new location, simply change the path to data files in the .collection file (JSON).

Run using these commands:

 cd /mnt/data/indexes/CC-2015-18/final
 java -jar processing_steps/7/mg4j/corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar serve /mnt/data/indexes/CC-2015-18/final
        

Script for parallel execution on multiple servers

Launches deamons replying to requests. Launch takes place in the screen and parameter start must be set, otherwise it will print the status of screens and daemons. The stop terminates screens:

 ./processing_steps/7/daemon.py
        

Usage:

 ./daemon.py [start|stop|restart] -i INPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY]
 
 -i   --input     input directory (automaticaly adds /final/ to the end of path)
 -p   --port      port (default is 12000)
 -s   --servers   file containing list of servers where each line is hostname of one machine, if line contains tabulatorit is
                  treated as separator and hostname is text before the separator (it is for compatibility of this configuration
                  file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line) 
 -e   --errors    if set, errors are logged to this file with current date and time
 -b   --binary    path to directory containing .jar files (default is /mnt/data/wikipedia/scripts/mg4j/)
        

Examples:

 ./daemon.py -i /mnt/data/indexes/CC-2015-18/final -s ~/servers start
 # Runs screens and deamon in them above collection in directory /mnt/data/indexes/CC-2015-18/final
 # on machines specified in file ~/servers

 ./daemon.py -s ~/servers
 # displays the status of running screens and deamons on all machines specified in file ~/servers

 ./daemon.py -s ~/servers stop
 # terminates screens on all machines specified in file ~/servers
        

1.9 Commandline query tool

The source codes of program are in a directory:

 ./processing_steps/8a/mg4jquery
        

Compilation:

 JAVA_HOME=/usr/lib/jvm/java-8-oracle/; mvn clean compile assembly:single
        

Launch examples:

 JAVA_HOME=/usr/lib/jvm/java-8-oracle/; java -jar mg4jquery-0.0.1-SNAPSHOT-jar-with-dependencies.jar -h ../../servers.txt -m ../src/main/java/mg4jquery/mapping.xml -s ../src/main/java/mg4jquery/config.xml  -q "\"was killed\""
 JAVA_HOME=/usr/lib/jvm/java-8-oracle/; java -jar mg4jquery-0.0.1-SNAPSHOT-jar-with-dependencies.jar -h ../../servers.txt -m ../src/main/java/mg4jquery/mapping.xml -s ../src/main/java/mg4jquery/config.xml  -q "1:nertag:person < 2:nertag:person" -c "1.nerid != 2.nerid"
        

For example, this returns documents, where are at least two different people. File with servers expects address of server with port on each line, example:

 knot01.fit.vutbr.cz:12000
        

1.10 Web GUI

Preparation script:

 ./prepare_webGUI.py
        

Usage:

 python ./prepare_webGUI.py -po PORT -pa PATH -n NAME [-a] [-e ERRORS_LOG_FILE]
        

Example:

 python ./prepare_webGUI.py -po 9094 -pa ./src/main/webapp/WEB-INF/ -n users -a
 # installs required artifacts, compiles source files and sets a database "mem:users" up in directory ./src/main/webapp/WEB-INF/ on port 9094
        

Running on port 8086:

 mvn jetty:run -Djetty.port=8086
        

Termination:

 ./stop_webGUI.py
        

Usage:

 python ./stop_webGUI.py
 #necessary to launch every time you terminate the web GUI
        

In subdirectory maven_deps/src are sources of own gwt components to display dynamic tooltips. For successful retrieval of request results from the server, it is necessary to set the adress options of the servers (in the classical format domain_name_of_server:port). You can set number of results on page, behaviour of info windows (dynamic - hovering over an entity with a cursor, static - it stays visible until the user moves their cursor over another entity, or otherwise change the status of the application) and display type (default is corpus based, but you can switch to document based).

1.10.1 Requesting

It is similar to MG4J, semantic index automatically remaps requests to the same index, no need to write for example:

 "(nertag:person{{nertag-> token}}) killed"
        

This request is enough:

 "nertag:person killed"
        

Compared to mg4j there is expansion "Global constraints", which allows to tag token and do post-filter on the token.

 1:nertag:person < 2:nertag:person
 1.fof != 2.fof AND 1.nerid = 2.nerid
        

For example this returns documents, where is the same person in different forms (often name and coreference). In case of requesting on occurrences within the same sentence, you can use the difference operator.

 nertag:person < nertag:person - _SENT_
 nertag:person < nertag:person - _PAR_
        

These requests searche for two persons in a sentence (respectively paragraphs). Difference is that it will take part of the text, where the token does not occur.

1.11 Indexes

There are a lot of indexes which can be queried:

 position
 token
 tag
 lemma
 parpos
 function
 parword
 parlemma
 paroffset
 link
 length
 docuri
 lower
 nerid
 nertag
 person.name
 person.gender
 person.birthplace
 person.birthdate
 person.deathplace
 person.deathdate
 person.profession
 person.nationality
 artist.name
 artist.gender
 artist.birthplace 
 artist.birthdate
 artist.deathplace
 artist.deathdate
 artist.role
 artist.nationality
 location.name
 location.country
 artwork.name
 artwork.form
 artwork.datebegun
 artwork.datecompleted
 artwork.movement
 artwork.genre
 artwork.author
 event.name
 event.startdate
 event.enddate
 event.location
 museum.name
 museum.type
 museum.estabilished
 museum.director
 museum.location
 family.name
 family.role
 family.nationality
 family.members
 group.name
 group.role
 group.nationality
 nationality.name
 nationality.country
 date.year
 date.month
 date.day
 interval.fromyear
 interval.frommonth
 interval.fromday
 interval.toyear
 interval.tomonth
 interval.today
 form.name
 medium.name
 mythology.name
 movement.name
 genre.name
 nertype
 nerlength
        

Warning: all request starting with number or containing character "*" must be surrounded by parentheses otherwise it will cause an error:

 nertag:event ^ event.startdate: (19*)
        

Attribute requests of a named entity should be combined with request to index nertag (because indexes are overlaping, so it will save space). "Global constraints" can additionaly use request "fof", which is abbreviation of "full occurrence from".

1.12 File redistribution

Script allows to redistribute data (files) among given servers in order to equally use capacity of disc arrays.

Script location:

 ./processing_steps/1/redistribute.py
        

Usage:

 ./redistribute.py -i INPUT_DIRECTORY [-o OUTPUT_DIRECTORY [-d DISTRIB_DIRECTORY]] -s SERVER_LIST [-p RELATED_PATHS] [-x EXTENSION] [-r] [-m] [-e ERRORS_LOG_FILE]
        

Final redistributed scripts have to be launched via parallel ssh and have to be removed afterwards.

Example:

 python /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/1/redistribute.py -i /mnt/data/commoncrawl/CC-2015-14/warc -s /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/servers.txt -m -o /home/idytrych/redistributionScripts -d /mnt/data/commoncrawl/software/redistributionScripts -x "-warc.gz" -p /home/idytrych/CC-2015-14-rel.txt >moves.txt
        

where CC-2015-14-rel.txt contains:

 /mnt/data/commoncrawl/CC-2015-14/uri	-warc.domain
 /mnt/data/commoncrawl/CC-2015-14/uri	-warc.domain.srt
 /mnt/data/commoncrawl/CC-2015-14/uri	-warc.netloc
 ...
        

afterwards scripts in screen are launched:

 parallel-ssh -h servery_b9_idytrych.txt -t 0 -A -i "bash /mnt/data/commoncrawl/software/redistributionScripts/\$HOSTNAME.sh"
        

and removed:

 parallel-ssh -h servery_b9_idytrych.txt -t 0 -A -i "rm /mnt/data/commoncrawl/software/redistributionScripts/\$HOSTNAME.sh"
        

Servers where nothing will be moved have no script, error will be printed.

1.13 Where can you test it?

A running search engine is located on server athena1, try this as an example.

Indexes are running almost on all servers (for this search engine). For wikipedia you can restart deamons like this(only the person who launched the deamons can actually restart them):

 python daemon.py restart -i /mnt/data/indexes/wikipedia/enwiki-20150901/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIW.txt -b /mnt/data/wikipedia/software/mg4j
        

For CC (In order of incremental deduplication):

 python daemon.py restart -i /mnt/data/indexes/CC-2015-32/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC32.txt -b /mnt/data/commoncrawl/software/mg4j -p 12001
 python daemon.py restart -i /mnt/data/indexes/CC-2015-35/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC35.txt -b /mnt/data/commoncrawl/software/mg4j -p 12002
 python daemon.py restart -i /mnt/data/indexes/CC-2015-40/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC40.txt -b /mnt/data/commoncrawl/software/mg4j -p 12003
 python daemon.py restart -i /mnt/data/indexes/CC-2015-27/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC27.txt -b /mnt/data/commoncrawl/software/mg4j -p 12004
        

Indexes created on Salomon require symlinks - they contain paths such as /scratch/work/user/idytrych/CC-2015-32/mg4j_index.

1.14 Repacking warc by vertical

Script allows to repack input warc.gz file according to the respective vertical. Only files that are in input vertical are extracted from the original warc.gz file and the result is saved to file warc.xz. The script can be executed for a single warc.gz file and specific vertical, or for a folder containing input warc.gz files and folder containing verticals. No other combinations are allowed.

Script location:

 ./processing_steps/1/warc_from_vert/warc_from_vert.py
        

Launching:

 -i --input    input warc.gz file or folder
 -v --vertical input vertical or folder
 -o --output   output folder, where the results are saved
        

1.15 Packing wikipedia into warc

Script allows packing of preprocessed wikipedia to warc.gz or warc.xz file. Input can be a single file or a folder containing files to be processed.

Script location:

 ./processing_steps/1/pack_wikipedia/pack_wiki.py
        

Launching:

 -i --input    input file or folder
 -o --output   output folder
 -d --date     date of wiki download, format: DD-MM-YYYY
 -f --format   output file format, gz or xz
        

1.16 Network overload monitoring

Script network_monitor.py allows monitoring of the current CESNET link load. It can be launched as a standalone script or used as a python library.

When used as a library, it offers NetMonitor class, which can be given a network name should we wish to monitor it. The instance of this class provides the get_status function, that returns result containing incoming and outgoing bits per second and load in percentages (again in JSON).

When launched as a standalone script in a terminal, it prints incoming/outgoing load in percentages, alternatively prints the worst one of them according to the specified parameters.

Values are refreshed every minute. More frequent refresh rate is pointless.

Script location:

 ./processing_steps/1/network_monitor.py
        

Launching:

 -i --datain    displays incoming load in percentages
 -o --dataout   displays outgoing load in percentages
 -w --worst     displays the worst load in percentages
 -l --link      link to be monitored, if not specified then the Telia link is monitored
        

1.17 Automated CommonCrawl downloading

Used for automated downloading large amounts of data CC, while maintaining the load in a certain range. The earlier mentioned script network_monitor.py is used to monitor the status of a network. Script is given a low network load limit in percentages and will maintain the load between the specified valu and a 10% higher value. Input file is warcs.lst or it's part if downloaded on multiple servers. Highest number of downloading processes in range 1 to 15 can be specified via parameter (see below). It is advised to use the script on less servers with more processes.

It was tested on Salomon, downloading CC-2016-44. The script ran on 4 servers with a maximum of 12 processes and low network load limit set to 55%, while maintaining network load between 55-65%.

Script location:

 ./processing_steps/1/cc_auto_download.py
        

Launching:

 -i --input_file   path to input file warcs.lst or it's part
 -o --output_dir   path to output folder
 -p --process      number of downloading processes (1-15)
 -l --limit        lower network load limit in percentages (10-80)
        

1.18 Zimlib

Program that enables the extraction of wikipedia page from a file in zim format. Output is in preprocessed wikipedia format, it is saved to the output folder and separated in multiple files of approx. 100 MB size.

Program location:

 /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/zim_data/zimlib-1.2
        

Compiled using the make command, executable program zimdump is created when compiled, can be found here:

 /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/zim_data/zimlib-1.2/src/tools
        

Launch example:

 ./zimdump -a ~/output -J cs wikipedia_en_2016_05.zim

Parameters:


2 Getting it working on Salomon

  1. Clone repository with SEC API (minerva1.fit.vutbr.cz:/mnt/minerva1/nlp/repositories/decipher/secapi) to home on Salomon
  2. Download KB (in directory secapi/NER ./downloadKB.sh)
  3. Copy to home on Salomon:
  4. Search for string "idytrych" in all files in home (and in subdirectories), eventually "smrz" and change absolute paths accordingly
  5. Build mpidedup (cd mpidedup; make)
  6. Create working directory by script createWorkDirs_single.sh
  7. File warcDownloadServers.cfg contains list of nodes, on those CommonCrawl will be downloaded - it is appropriate to check if it's up to date.
  8. For downloading of CommonCrawl it is necessary to somehow get file .s3cfg and put it into home.

For 4. version of scripts:

  1. Clone repository with SEC API (minerva1.fit.vutbr.cz:/mnt/minerva1/nlp/repositories/decipher/secapi) to home on Salomon
  2. Download KB (in directory secapi/NER ./downloadKB.sh)
  3. Copy to home on Salomon:
  4. Build mpidedup (cd mpidedup; make)
  5. Create working directory by script createWorkDirs_single.sh
  6. File warcDownloadServers.cfg contains list of nodes, on those CommonCrawl will be downloaded - it is appropriate to check if it's up to date.
  7. For downloading of CommonCrawl it is necessary to somehow get file .s3cfg and put it into home.

3 Launching on our servers

3.1 Processing of Wikipedia

Complete sequence for launching (not tested yet):

 (dump is labeled RRRRMMDD)
 cd 1a
 ./download_wikipedia_and_extract_html.sh RRRRMMDD
 cd ..
 
 python ./1/distribute.py -i ./2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a
 python ./2/verticalize.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/html_from_xml/AA/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/vert/ -s servers.txt -b /mnt/data/wikipedia/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar

 python ./1/distribute.py -i ./4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a
 python ./4/tag.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/vert/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/tagged/ -s servers.txt -b /mnt/data/wikipedia/software/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar

 python ./1/distribute.py -i ./5/MDP-package/MDP-1.0/build/jar/mdp.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a
 python ./5/parse.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/tagged/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/parsed/ -s servers.txt -b /mnt/data/wikipedia/software/mdp.jar

 python ./6/ner.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/parsed/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/secresult/ -s servers.txt

 python ./1/distribute.py -i ./7/mg4j/ -o /mnt/data/wikipedia/software/mg4j/ -s servers.txt -a
 python ./7/index.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/secresult/ -o /mnt/data/indexes/enwiki-RRRRMMDD/ -s servers.txt start        
        

Launch of daemons corresponding to queries:

 python ./daemon.py -i /mnt/data/indexes/enwiki-RRRRMMDD/final -s servers.txt -b /mnt/data/wikipedia/software/mg4j/ start
        

3.2 Processing of CommonCrawl

Complete sequence for launching (not tested yet):

 (CC labeled RRRR-MM)
 cd processing_steps/1b/download_commoncrawl
 ./dl_warc.sh RRRR-MM
 cd ../..

 python ./1/distribute.py -i ./2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a
 python ./2/verticalize.py -i /mnt/data/commoncrawl/CC-RRRR-MM/warc/ -o /mnt/data/commoncrawl/CC-RRRR-MM/vert/ -s servers.txt -b /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 
 python ./1/distribute.py -i ./3/dedup/server -o /mnt/data/commoncrawl/software/dedup/ -s servers.txt -a
 python ./1/distribute.py -i ./3/dedup/dedup -o /mnt/data/commoncrawl/software/dedup/ -s servers.txt -a
 cd 3
 parallel-ssh -h servers_only.txt -t 0 -i "mkdir /mnt/data/commoncrawl/CC-RRRR-MM/hashes/"
 ( to load hashes from previous processing, use parameter -i /mnt/data/commoncrawl/CC-RRRR-MM/hashes/)
 python ./server.py start -s servers.txt -w workers.txt -o /mnt/data/commoncrawl/CC-RRRR-MM/hashes/ -b /mnt/data/commoncrawl/software/dedup/server
 python ./deduplicate.py -i /mnt/data/commoncrawl/CC-RRRR-MM/vert/ -o /mnt/data/commoncrawl/CC-RRRR-MM/dedup/ -w workers.txt -s servers.txt -b /mnt/data/commoncrawl/software/dedup/dedup
 python ./server.py stop -s servers.txt -w workers.txt -b /mnt/data/commoncrawl/software/dedup/server
 cd ..

 python ./1/distribute.py -i ./4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a
 python ./4/tag.py -i /mnt/data/commoncrawl/CC-RRRR-MM/dedup/ -o /mnt/data/commoncrawl/CC-RRRR-MM/dedup/tagged/ -s servers.txt -b /mnt/data/commoncrawl/software/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 
 python ./1/distribute.py -i ./5/MDP-package/MDP-1.0/build/jar/mdp.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a
 python ./5/parse.py -i /mnt/data/commoncrawl/CC-2RRRR-MM/tagged/ -o /mnt/data/commoncrawl/CC-RRRR-MM/parsed/ -s servers.txt -b /mnt/data/commoncrawl/software/mdp.jar

 python ./6/ner.py -i /mnt/data/commoncrawl/CC-RRRR-MM/parsed/ -o /mnt/data/commoncrawl/CC-RRRR-MM/secresult/ -s servers.txt

 python ./1/distribute.py -i ./7/mg4j/ -o /mnt/data/commoncrawl/software/mg4j/ -s servers.txt -a
 python ./7/index.py -i /mnt/data/commoncrawl/CC-RRRR-MM/secresult/ -o /mnt/data/indexes/CC-RRRR-MM/ -s servers.txt start
        

Launch of daemons corresponding to queries:

 python ./daemon.py -i /mnt/data/indexes/CC-RRRR-MM/final -s servers.txt -b /mnt/data/commoncrawl/software/mg4j/ start
        

3.3 Run_local

Scripts used to launch processing on KNOT servers unanimously. Processing is launched using the run.py script, that launches required step according to the configuration file config.ini, this file has to be edited before processing. Set the path to folder to be processed in root_data parameter in the [shared] section. Other variables point to folders using scripts etc, these variables do not need to be changed. It is advised to copy this file to home for example, so it remains unchanged.

When launching on every server, where the processing happens, screen with name USERNAME-proc-STEP_NUMBER is created. All screens can be terminated using the script with -a kill parameter.

Script location:

 ./processing_steps/run_local/run.py
        

Launching:

 -c --config   required parameter, specifies path to config file
 -p --proc     required parameter, specifies processing step, options:
                       vert, tag, pars, sec, index, index_daemon, shards or their numerical values (2, 4, 5, 6, 7, 8, 9)
                       test value can be used to test steps, this does not have a numerical equivalent
                       when using this parameter, using the step in -e parameter is also required
 -a --action   optional, specifies an action to be executed, options:
                       start, kill, check, progress, eval (default is start)
 -s --servers  optional, specifies path to file containing server list used to process, if not set, the processing will only execute locally
 -t --threads  optional, specifies the number of threads for processing
 -e --examine  required if the value of parameter -p is set to test
 -l --logging  optional, turns logging of test results to files in folder containing logs, test_* and main_* files
 -h --help     prints help
        

3.3.1 Verticalization

Settings in the [vert] section of config file. Variable exe_path specifies the path to verticalizator script. Variables input_dir, output_dir and log_path specify the paths to input, output and log folders. Variable no_log is used to turn logging on and off. Variable Stoplist_path specifies path to file containing the list of stop words.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p vert
        

3.3.2 Deduplication

Servers are sequentially launched as well as workers on port specified in config file. Variables in section [dedup] need to be set. The folder containing run.py must also contain script dedup_handler.py, responsible for servers and workers launching.

Variables:

 exe_path          path to folder containing files server.py and deduplicate.py
 bin_path          path to folder containing executable files
 input_path        path to folder containing verticals
 output_path       path to output folder
 map_file          path to hashmap.conf file
 log_path          path to folder containing logs
 progress_tester   path to file dedup_check.py
 hash_path         path to folder from hash
 port              port where servers and workers run
 dropped           True or False, responsible for -dr --dropped
 droppeddoc        True or False, responsible for -dd --droppeddoc
 debug             True or False, responsible for -d --debug
 neardedup         True or False, responsible for -n --near
        

Launch example:

 python3 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p dedup
        

Deduplication proces verification:

 python3 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p dedup -a progress
        

3.3.3 Tagging

Settings in the [tag] section of config file. Variables exe_path, input_path, output_path, log_path with the same meaning in verticalization. Additional variables remove_uri which turns deletion of URIs from links on and off, and ttagger_path specifying path to install directory TreeTagger.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p tag
        

3.3.4 Parsing

Settings in the [pars] section of config file. Variables exe_path, input_path, output_path, log_path with the same meaning as in verticalization and tagging. Variable config_path specifies path to config file for parsing.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p pars
        

3.3.5 SEC

Settings in the [sec] section of config file. Contains same variables as verticalization, tagging and parsing. One new variable config_path specifies path to file containing SEC queries in JSON.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p sec
        

3.3.6 Creating a populating shards

Script to prepare for indexing. Required number of collections are created in the output folder of each servers, files in MG4J format (SEC output) are distributed equally in these collections. The [shards] section of config file contains variables for input and output folders as well as a number of required collections (based on the number of CPU cores). Output file is not required.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p shards
        

3.3.7 Indexing

Runs collection indexing. Input folder contains all collections, output is saved to a single folder (final). Variables can be changed in the [index] section of config file.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p index
        

3.3.8 Launching daemon for indexes

Settings in the [index_daemon] section of config file. Variables exe_path specifying path to indexer, input_path specifying path to folder final, on which the daemon runs and log_path specifying path to log folder. If log_path is set to "/", no logs are saved. Another variable port_number specifies the port number, where the daemon runs and config_path specifies path to config file for indexer.

Launch example:

 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p index_deamon
        

3.3.9 Checking outputs after each step

Scripts providing options for checking of the output files and their content after each step of processing. Links from input, output and log files are loaded and their occurrence is compared to the other files. If a mismatch is detected, an error will be printed to STDOUT or to a file. The newest version can be found on "test_scripts_xgrigo02" branch in folder processing_steps/step_check/.

Usage:

 python3 step_check.py -i IN_PATH -o OUT_PATH -l LOGS_PATH -t STEP -s [-c PATH_TO_CONFIG]
        

Parameters:

 -c PATH, --config PATH    Path to config file. 
                                    If config path was specified, then arguments -i, -o and -l 
                                    will be ignored, there values will be obtained from config.
 -t STEP, --target STEP    Step for script to test. 
                                    Possible STEP values:
                                     '2' or 'vert'
                                     '3' or 'dedup'
                                     '4' or 'tag'
                                     '5' or 'pars'
                                     '6' or 'sec'
                                     '7' or 'index'
                                    Implicit value is 'vert'
 -i IN_PATH, --input_path   Path with input files
 -o OUT_PATH, --output_path Path with output files
 -l LOG_PATH, --log_path    Path with log files to check. Also test output will
                                     be saved there, if -s option was specified
 -s, --save_out             Enable output writing to log file
 -h, --help                 Prints this message
 --inline                   Enables inline output
        

You can either launch with parameters -i, -o and -l, or set path to config file, all the required settings will be read from there. In that case, arguments -i, -o and -l will be ignored.

When specifying parameter -s, the output of scripts will be saved to files with suffixes .test_out or .main_out. When testing of a pair of folders is finished, another file .tested will be created. This indicates that scripts finished without an error.

To run using the run_local script, you need to set exe_path in [test] section of config file. This file can be found in the corpora_processing_sw/processing_steps/run_local/ folder.

Launch example using run.py:

 python3 run.py -a start -p test -e vert -l -c ~/config.ini -s ~/servers.txt
        

Manual launch example:

 python3 step_check.py -i ./warc_path -o ./vert_path -l ./logfiles/verticalization -t vert -s
        

alternatively:

 python3 step_check.py -c ./config.ini -t vert
        

4 Launching on Salomon


5 Launching on Salomon with new scripts


6 Launching on Salomon with new scripts v5


7 Data format

7.1 Manatee

Good example of manatee format can be downloaded here.

Susane corpus
This one differs from ours - it has only 4 columns (we have 27). All tags that begins with < keep this format, nothing is transformed such as in case of MG4J.Considering necessary changes, it will only not add GLUE as a token variant and it will not generate things such as %%#DOC PAGE PAR SEN. In manatee instead of of empty anotation underscore is used (in mg4j 0). In addition in manatee configuration file, which defines tags and determines path to vertical file, has to be created in order to make it possible to index via program encodevert.

7.2 Elasticsearch

Format ElasticSearch used for sematic anotations looks as follows:

 Word[anotation1;anotation2...] and[anotation1;anotation2...] other[...;anotation26;anotation27] word[...;anotation26;anotation27]
        

Forms of anotations may be arbitrary, however only alfanumeric symbols and underscore are allowed. At the moment, this format is used - each anotation is in form typeOfAnotation_value

Types of annotations:

position token tag lemma parpos function parword parlemma paroffset link length docuri lower nerid nertag param0 param1 param2 param3 param4 param5 param6 param7 param8 param9 nertype nerlength

Actual anotated text looks as follows:

 Word[position_1;token_Word...]
        

For sematic questioning typical Lucenic queries are used(see testing query in project directory).

8 Preparation of new tagging and parsing

Two new scripts (revert.py, reparse.py) were created to use the new syntaxnet utility. It is used to analyze text and utilize well known tools. Script revert.py is used to convert Vertical to input suited for syntaxnet so that every single tag and link is placed at the end of the file and they're assigned numerical identifiers. These identifiers are used by the reparse.py script, that restores tags and links from the syntaxnet output.

Script revert.py:

 ./processing_steps/5/syntaxnet/revert.py
        

Usage:

 ./processing_steps/5/syntaxnet/revert.py INPUT_FILE OUTPUT_FILE
        

Script reparse.py:

 ./processing_steps/5/syntaxnet/reparse.py
        

Usage:

 ./processing_steps/5/syntaxnet/reparse.py INPUT_FILE OUTPUT_FILE
        

9 Description of files in directory old

9.1 CC-2014-35

Scripts to download CommonCrawl 2014-35:

9.2 ccrawl_dloader

9.3 wet2vert

9.4 tt

link to TreeTagger, run using the following commands:

 [path_to_corpproc/]tt/bin/tree-tagger -token -lemma -sgml -no-unknown 
 [path_to_corpproc/]tt/lib/english.par
        

or by script tagger.py from directory vert2ner.

9.5 mdparser

9.6 vert2ner

10 Processing of CommonCrawl 2014-35

10.1 Statistics

Quantity:

WARC: 43430,7952731 GB (46633461334359 bytes), WAT: 14702,6299882 GB (15786828741075 bytes), WET: compressed 5300,3036399 GB (5691157698008 bytes), uncompressed 12319,1 GB.

Statistics after verticalisation (old verticaliser):

Number of files: 52 849

Number of documents: 2 744 133 462

Number of paragraphs: 316 212 991 122

Number of sentences: 358 101 267 144

Number of words (tokens): 2 534 513 098 452