Processing is divided into several steps. Scripts were created for each step to simplify the work using arguments and they allow the processing performed on one machine, or in parallel on multiple machines.
All source codes of programs and scripts are in repository corpora_processing_sw and potential missing libraries are in /mnt/minerva1/nlp/projects/corpproc. Anywhere else than here you might find old nonfunctional versions!
Script enables reallocating data (files) between servers. Data are divided according to their size so all servers use similiar amounts of data. Parameter -a switches reallocating off and each file is copied to all servers. This is particularly suitable for the distribution of processing programs.
./processing_steps/1/distribute.py
Usage:
./distribute.py [-i INPUT_DIRECTORY] -o OUTPUT_DIRECTORY -s SERVER_LIST [-a] [-e ERRORS_LOG_FILE] -i --input input directory/file for distribution (if it is not set, file names are expected on stdin separated by '\n') -o --output output directory on target server, if directory doesn't exist script tries to create it -a --all each file will be copied to all servers (suitable for scripts/programs) -s --servers file containing list of servers where each line is hostname of one machine, if line contains tabulator it is treated as a separator and with the text before the separator as hostname (it is for compatibility of this configuration file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line) -e --errors if set, errors are logged to this file with current date and time
Examples:
./distribute.py -i ~/data_to_distribution/ -o /mnt/data/project/data/ -s ~/servers # All files from directory ~/data_to_distribution/ are reallocated between servers set by file ~/servers. # Output directory is /mnt/data/project/data/. If directory data on some machine doesn't exist, script tries to create it. ./distribute.py -i ~/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s ~/servers -a # Copies program NLP-slave to servers set by file~/servers. # Output directory is /mnt/data/commoncrawl/software/. If directory "software" doesn't exist on a machine, the script creates it.
There is a program for extraction plain text from Wikipedia called WikiExtractor
Launching:
cd /mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/ ./download_wikipedia_and_extract_html.sh 20151002
Script in working directory expects file hosts.txt containing list of servers (1 on each line) on which it is supposed to run.
/mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/tools/WikiExtractor.py
Output is collection of files of size approx. 100MB divided into servers mentioned in /mnt/data/wikipedia/enwiki-.../html_from_xml/enwiki=...
To download WARC you need to know the exact specification of CommonCrawl, e.g. "2015-18".
./processing_steps/1b/download_commoncrawl/dl_warc.sh
downloaded files are located in:
/mnt/data/commoncrawl/CC-Commoncrawl_specification/warc/
support files are located in:
/mnt/minerva1/nlp-2/download_commoncrawl/CC-Commoncrawl_specification/download/
Consequently it is possible to calculate URI statistics using:
./processing_steps/1b/uri_stats.sh
result is saved in:
/mnt/minerva1/nlp-2/download_commoncrawl/CC-Commoncrawl_specification/uri/
data on individual machines are located in:
/mnt/data/commoncrawl/CC-Commoncrawl_specification/uri/
To get url from entered RSS source use:
./processing_steps/1c/collect.py
Usage:
./collect.py [-i INPUT_FILE] [-o OUTPUT_FILE [-a]] [-d DIRECTORY|-] [-e ERRORS_LOG_FILE] -i --input input file containing RSS url separated by '\n' (if not set, it is expected on stdin) -o --output output file for saving parsed url (if not set, it is printed to stdout) -a --append output file is in adding mode -d --dedup deduplication of obtained links (according to matched url), you can optionally enter folder with files with list of already colled url, these url will be included in deduplication, in mode -a the output file is also included -e --errors if set, errors are logged in this file with current date and time
Examples:
./collect.py -i rss -o articles -a # Adds the url from RSS sources in the file articles, urls found in file rss
To download websites according to entered list of url and save to WARC archive use:
./processing_steps/1c/download.py
Usage:
./download.py [-i INPUT_FILE] -o OUTPUT_FILE [-e ERRORS_LOG_FILE] -i --input input file containing url (if it is not set, it's expected on stdin) -o --output output file for saving warc archive -r --requsts limit the number of requests per minute on one domain (default is 10) -e --errors if set, errors are logged into this file with current date and time
Script downloads sites equally according to domain, not in order according to input file, to avoid possible "attack". Simultaneously it's possible to set limit of requests to domain per minute. When the limit is reached for all domains, downloading is paused until the limit is restored. The limit is restored every 6 seconds (1/10 minute) on 1/10 of total limit.
If errors occur during the download (except for code 200), the whole domain is excluded.
Example:
./download.py -i articles -o today.warc.gz # it downloads sites with url included in file articles into archive today.warc.gz
Vertical file format description
Input for verticalization program is file warc.gz. From this file the program unpacks individual records (documents), filters HTML from it, filters non-English articles and executes tokenization. It can process also 1 webpage in .html and Wikipedia in preprocesed HTML. Output is saved with extension vert into target folder.
New verticalizer:
./processing_steps/2/vertikalizator/main.py
Compile using the following command:
make
In case of error during compiling justextcpp (aclocal.m4), run the following commands in justextcpp/htmlcxx:
touch aclocal.m4 configure Makefile.in ./configure
Salomon may require modules (some of the following modules and dependencies):
module load OpenMPI/1.8.8-GNU-4.9.3-2.25 module load Autoconf/2.69 module load Automake/1.15 module load Autotools/20150215 module load Python/2.7.9 module load GCC/4.9.3-binutils-2.25
If LZMA-compressed WARC output is required, backports.lzma module has to be installed:
pip install --user backports.lzma
This module's installation can fail during compilation, which means lzma library is not installed on the server. In that case run:
make lzma
liblzma-dev package needs to be installed, use command:
apt-get install -y liblzma-dev
Usage:
./main.py main.py [-h] [-i INPUT] [-o OUTPUT] [-n] [-t INPUTTYPE] -s STOPWORDS [-l LOG] [-a WARCOUTPUT] [-d] -i --input input file for verticalization (if it is not set, expected WARC on stdin) -o --output output file (if it is not set, it will use stdout) -n --nolangdetect turns off detection of language (speeds it up with minimal difference in output) -t --inputtype determines expected input type, see Input types -s --stopwords file with list of stop words, required form boilerplate removal using Justext algorithm -l --log logfile containing debugging information, STDOUT or STDERR need to be set to print the debugging information to their respective output. If not present, no log is stored or displayed. -a --warcoutput path to output WARC file. This file will contain HTTP response records from the input file, whose contents were not completely deleted by justext algorithm or language detection. No WARC output is generated if this parameter is not set. The parameter has no effect if the input for the verticalization is one HTML file or a Wikipedia archieve. Does not cancel standard vertical format output generation, the verticalization process will have 2 different outputs at the same time. Compression will be used if the filename ends with .gz(GZip) or .xz(LZMA). -d --dedup vertical output will be deduplicated. File dedup_servers.txt has to be modified to contain hostnames of the servers where deduplication server processes should run. Deduplication servers will keep running even after the verticalization process ends, so they can be used by other verticalization processes. They have to be shut down manually. -m --map configuration map for deduplication(now only in hash redistribution branch).
Stdin input:
Verticalizator is faster with stdin input than input file. For example:
xzcat file | python main.py [-o OUTPUT] [-n] -s STOPWORDS [-l LOG] [-a WARCOUTPU] [-d]
Warcreader is now located in the verticalizator folder.
Old warcreader When upgrading to newer version of verticalizer it might be required to reinstall the warcreader package, which is developed simultaneously with the verticalizer, but published on Python Package Index as a standalone library. This package is installed to current user's home directory using make
command and cannot be removed or upgraded by pip utility. It has to be removed manually using: '
rm -r ~/.local/lib/pythonX.X/site-packages/warcreader*
where X.X is version of Python, usually 2.7. Then you can install new version of warcreader using pip:
Verticalizer can process several types of input. Expected input type must be defined by the -t/--inputtype
argument. Possible values are:
warc
- verticalizer will expect WARC archive. The archive can be gzip-compressed, but in that case .gz suffix is required in the input file namewiki
- verticalizer will expect wikipedia pages archive in internal research group formathtml
- verticalizer will expect HTML source code of a single web pageuniversal
- verticalizer will expect Universal Verticalization Format input. The file can be gzip-compressed, but in that case .gz suffix is required in the input file nameDO NOT USE: - old verticalizer
Originally, NLP-slave was used for verticalization (dependent on local langdetect), source code is in repository:
./processing_steps/2/NLP-slave
This program expects file warc.gz as input (first parameter) from which it gradually extracts particular records, filters HTML, filters articles not in English and does tokenization. It is also possible to process 1 website in .html and Wikipedia in preprocessed plaintext or HTML. Result is saved with extension vert into destination directory.
Compile using this command:
mvn clean compile assembly:single
Usage of old verticalizer:
java -jar package [opts] input_file.html output_dir file_URI [langdetect_profiles] java -jar package [opts] input_file.txt output_dir [langdetect_profiles] java -jar package [opts] input_file-warc.gz output_dir [langdetect_profiles] java -jar package [opts] input_file output_dir [langdetect_profiles]
package
- binary file of program, e.g. ./processing_steps/2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jaropts
are switches:
-d
- turns on logging of debugging information-z
- turns on setting of the document URI for links to zero-o
- if it is set, there will not be any document URI in links (column will disappear)input_file.html
- is input file in uncompressed HTML (extension .html)input_file-warc.gz
- is input file in compressed format warc (tail of name is warc and extension is .gz)input_file.txt
- is input file from Wikipedia in preprocessed plaintext (extension .txt)input_file
- is input file from Wikipedia in preprocessed HTML (extension is not .html, .txt, nor warc.gz)output_dir
- path to output directory w/o slash in the endfile_URI
- is URI of file (unable to get directly from file, because it is not warc)langdetect_profiles
- optional parameter with path to profiles langdetect - default is /usr/share/langdetect/langdetect-03-03-2014/profilesScript for launching in parallel on multiple machines:
./processing_steps/2/verticalize.py
Usage:
./verticalize.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-b BINARY] [-t THREADS] [-l] [-d] [-s STOPWORDS_LIST] [-e ERRORS_LOG] -i --input input directory containing files to verticalize -o --output output directory, if it does not exist, script attempts its creation -s --servers file with list of servers with format HOSTNAME '\t' THREADS '\n' (one line per one machine) -e --errors if it is set errors are logged to this file with current date and time -b --binary argument specifying path to verticalizer (default is ./processing_steps/2/vertikalizator/main.py) -t --threads sets number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority) -l --no-lang turns off language detection -d --debug verticalizer prints debugging information -w --stopwords file with list of stop words (default is ./processing_steps/2/vertikalizator/stoplists/English.txt)
Example:
./verticalize.py -i /mnt/data/commoncrawl/CC-2015-18/warc/ -o /mnt/data/commoncrawl/CC-2015-18/vert/ -s ~/servers -b /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar # Data from directory /mnt/data/project/CC-2015-18/warc/ verticalizates to directory/mnt/data/commoncrawl/CC-2015-18/vert/ # verticalization will be executed on all servers determined by file -s ~/servers # It uses program /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar, which has to be on all machines
All scripts can be found in verticalization folder profiling.
files_compare.py
Script for comparing two files. There are two arguments. Both are paths to files which we want compare.
Run example:
python files_compare.py testregularpuvodni.txt testregular11.vert
profiling.py
Script to simplify profiling. All variants to profile are uploaded to the folder. If there are multiple occurences of files of the same type e.g. tokenizer.py
, add "-" before the endpoint, and the rest of the text.
Example:
tokenizer-2.py or tokenizer-slouceniVyrazu.py
First verticalizator runs the original version of verticalizator to compare profiling result. The script then takes individual files, rewrites test file and runs verticalization. When verticalization ends, the original file is taken from reset folder and is rewritten into verticalizator. (Precaution in case verticalize.py
is tested first and then tokenizer.py
, so that the edited verticalize.py
is not used.)
Requirements: Create folder reset and copy the original files here. Then create folder profiling. This is where the results will be saved. Script and folders must be located in folder verticalizator.
Script requires two arguments. Path to the input file and file type.
Run example:
python profiling.py /mnt/data/commoncrawl/CC-2016-40/warc/1474738659833.43_20160924173739-00094-warc.xz warc
profiling_results.py
Script used to display the results. The script requires three arguments. Folder with results, sorting criteria and number of items to display for each file.
Run example:
python profiling_result.py profiling tottime 20
Most important sorting options:
ncalls
- for the number of callstottime
- for the total time spent in the given function (and excluding time made in calls to sub-functions)cumtime
- is the cumulative time spent in this and all subfunctions (from invocation till exit). This figure is accurate even for recursive functions.Detailed documentation of deduplication can be found here.
Programs dedup and server are used for deduplication. Both are available to compile via Makefile
in folder:
processing_steps/3/dedup/ in repository corpproc_dedup
Launch parameters:
./server -m=~/hashmap.conf [-i=INPUT_FILE] [-o=OUTPUT_FILE [-d]] [-s=STRUCT_SIZE] [-p=PORT] [-j=JOURNAL_FILE] [-k] [a] [-P=RPORT] -i --input input file -o --output output file -h --help lists launch arguments and their use -p --port server port (default 1234) -s --size change size of structure for saving hashes (default size is 300,000,000) -d --debug simultaneously with output file a file containing debugging dumps is generated -j --journal recovery of hashes from journal file - (use with -j at client) -k --keep archiving journal file after successfully saving hashes to output file -m --map hash distribution map -a --altered distribution map changed, enable hash migration -P --rport port for hash migration (default -p + 1)
Server runs until it's "killed". hashes are saved into output file (if specified) as a reaction to signals SIGHUP and SIGTERM.
./dedup -i=INPUT_DIR -o=OUTPUT_DIR -s=SERVERS_STRING [-p=PORT] [-t=THREADS] [-n] [-d] [-wl] [-uf=FILTER_FILE] -i --input input folder with files for deduplication -o --output output folder, if it does not exist, script attempts its creation -p --port server port (default 1234) -t --threads sets number of threads (default 6) -n --near uses algorithm "nearDedup" -d --debug for every output file .dedup file .dedup.debug containing debugging logs is generated -wl --wikilinks deduplication of format WikiLinks -dr --dropped for each output file .dedup file .dedup.dropped containing deleted duplicates is generated -f --feedback records in file .dedup.dropped containing reference to records responsible for its elimination (more below) -dd --droppeddoc for each output file *.dedup a .dedup.dd file containing list of URL addresses of completely eliminated documents is created -j --journal continue in dedup after system crash - already processed files are skipped, unfinished ones are about to be processed (use with -j on server) -h --help prints help -m --map hash distribution map
For easier launching there are scripts server.py
and deduplicate.py
(described below), which allow launching deduplication in parallel on more machines. Programs have to be pre-distributed on all used machines and have to be on the same place e.g.: /mnt/data/bin/dedup
Launching servers for deduplication
Script with argument start first launches screens and then servers in them. If argument stop is used script closes the screens. If start nor stop nor restart is not entered, then script firstly tests if screens are running and then servers.
./processing_steps/3/server.py
Usage:
./server.py [start|stop|restart|migrate] -m MAP_FILE [-i INPUT_FILE] [-o OUTPUT_FILE [-d]] [-a] [-t THREADS] [-p PORT] [-e ERRORS_LOG_FILE] [-b BINARY] [-d] [-j] [-k] [-r SPARSEHASH_SIZE] [-P rport] start start servers stop stop servers, wait for hash serialization restart restart servers migrate migrate hashes and exit. -i --input input file -o --output output file -t --threads number of threads containing workers (default = 384) -p --port server port (default 1234) -r --resize change size of structure for saving hashes (default size is 300000000) -e --errors if set, errors are logged into this file with current date and time -b --binary argument specifying path to deduplication server (default is /mnt/data/commoncrawl/corpproc/bin/server) -d --debug simultaneously with output file a file containing debugging dumps is generated -j --journal recover hashes from journal file (suggested to use with -j at client) -k --keep archiving journal file after successful save of hashes to output file -v --valgrind Runs server in valgrind (emergency debugging) -P --rport Port for hash migration service (default is -p + 1) -E --excluded List of excluded servers for hash migration (old->new) -a --altered Enables hash migration. see -a at server -m --map hash distribution
Note: -j, --journal: path to input file to recover hashes from journal is required, which was specified as output file before server crash. Script verifies, if file "input_file".backup exists. If input file does not exist on a servers, it will be created.
For example:
./server.py start -m ~/hashmap # Launches screens and then launches servers in them on machines specified in file ~/hashmap # Servers are waiting for workers to connect ./server.py -m ~/hashmap # Tests if screens and servers are launched on machines specified in file ~/hashmap ./server.py stop -m ~/hashmap -E ~/excluded.list # Closes screens and servers on machines specified in file~/hashmap and ~/excluded.list ./server.py migrate -m ~/hashmap -E ~/excluded.list -i ~/input.hash -o ~/output.hash # Runs hash migration according to distribution map, hashes are saved and servers are stopped when migration ends.
Launching workers for deduplication
It is necessary to launch servers with the exact parameters -m and -p.
./processing_steps/3/deduplicate.py
Usage:
./deduplicate.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY -w WORKERS_FILE -m MAP_FILE [-p PORT] [-e ERRORS_LOG_FILE] [-b BINARY] [-n] [-d] [-wl] -i --input input folder with files for deduplication -o --output output folder, if it does not exist, script attepts its creation -w --workers file with list of workers, the format is: HOSTNAME '\t' THREADS '\n' (one line per one machine) (beware - replacing tabulators with spaces will not work) -p --port server port (default 1234) -e --errors if set, errors are logged into this file with current date and time -b --binary argument specifying path to deduplication program (default is /mnt/data/commoncrawl/corpproc/bin/dedup) -t --threads number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority) -n --near it uses algorithm "nearDedup" -d --debug simultaneously with output file .dedup a .dedup.dropped containing removed duplicates and file .dedup.debug containing debugging dumps are generated -wl --wikilinks deduplication of format Wikilinks -dr --dropped for each output file .dedup a .dedup.dropped containing deleted duplicates is generated -f --feedback records in file .dedup.dropped containing reference on records responsible for its elimination (see below) -dd --droppeddoc file "droppedDocs.dd" containing list of completely excluded documents will be created in the output directory -j --journal continue dedup after system crash - already processed files are skipped, unfinished ones are about to be processed (use with -j on server) -m --map hash distribution map
Deduplication of format wikilinks is implemented for computing hash for concatenation of columns 2, 3, 5, 6, all the columns have to be the same to evaluate row as duplicated. In neardedup computing of hashes using the N-grams is concatenation of columns 5, 3, 6 (in this order), then it computes hash of 2. column. The row is duplicated if both last entries found conjunction.
For example:
./deduplicate.py -i /mnt/data/commoncrawl/CC-2015-18/vert/ -o /mnt/data/commoncrawl/CC-2015-18/dedup/ -w ~/workers-s ~/servers # Data from folder /mnt/data/commoncrawl/CC-2015-18/vert/ deduplicate to folder /mnt/data/commoncrawl/CC-2015-18/dedup/ # Deduplicaton runs on machines specified in file ~/workers # Running servers on machines specified in file ~/servers are expected
This step only differs for Salomon, because it deals with distributed calculation. In standart execution the communication is through sockets via standard TCP/IP. On Salomon it is not possible (InfiniBand), thats why library mpi is used. Source codes are in:
/mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/salomon/mpidedup
It is highly recomended to use module OpenMPI/1.8.8-GNU-4.9.3-2.25 due to compatibility issues with some other versions.
Compilation:
module load OpenMPI/1.8.8-GNU-4.9.3-2.25 make
Launching on salomon
Deduplication is easiest to run using the v5/start.sh
script, launch parameters are set in the script. It is recommended to use 4 nodes:
bash start.sh dedup 4 qexp
Parameters:
-h --hash - relative number of servers holding particular subspaces of total hashing space -w --work - relative number of worker servers executing their own deduplication -i --input - input directory with files for deduplication -o --output - output directory -l --load - optional directory used to load existing hashes (for incremental deduplication - obviously needs to be launched the same way (infrastructure, number of workers and servers with hashing space), the way they were saved before) -s --store - optional directory to save hashes -r --resize - optional parameter to change size of structure to save hashes -d --debug - turns on debugging mode (generates logs) -p --dropped - generates *.dropped file for each input vert, containing removed paragraphs -c --droppeddoc - generates *.dd file for each input vert, containing list of processed documents -n --near - switches to neardedup algorithm -j --journal - takes processed files and journals into consideration (unsaved hashes and unfinished files) - attempts to recover after crash and to continue deduplication -m --map - loads distribution map from KNOT servers and use it
What to do when a crash occurs?
Examine what caused the crash, check dedup.eXXXXX
and dedup.oXXXXX
files and run the process again.
When launching using v5/start.sh
, which then calls v5/dedup.sh
after the environment parameters are set, the -j parameter is pre-set, to ensure that the deduplication continues whether or not it ended successfully. Keep in mind that -j causes already processed verticals to be skipped, to start the process over from scratch, delete the contents of the output folder with deduplicated verticals. Recovery mode can be turned off in the v5/dedup.sh
file.
How does recovery mode work?
Launching on KNOT servers:
mpiexec dedup -h H -w W -i ~/vert/ -o ~/dedup/ -s ~/hash/
More information can be found here.
Hashes distributed according to hash map on KNOT servers can be imported, used on Salomon and exported back. Use scripts in processing_steps/salomon/migration.
Hash import:
python3 importhashes.py [-h] [-s SERVERS] [-m MAP] FILE TARGET FILE path to hash map on KNOT servers, for example: /tmp/dedup/server.hashes TARGET target folder on Salomon, for example: /scratch/work/user/$(whoami)/deduphashes -h --help prints help -s SERVERS path to file containing list of servers for import (1 hostname per line), for example: athena5 -m MAP path to distribution map, alternative to -s, list of servers for import extracted from the distribution map
Hash export:
python3 exporthashes.py [-h] SOURCE TARGET SOURCE source folder, for example: /scratch/work/user/$(whoami)/deduphashes TARGET target folder on KNOT servers, for example: tmp/dedup/dedup.hash
Use of the imported hashes on Salomon
Set the MAP_PATH
parameter when launching the process, for example in file v5/start.sh:
MAP_PATH="/scratch/work/user/${LOGIN}/hashmap.conf"
Deduplication changes the number of hashholders based on the number of servers in distribution map. If the distribution map contains 46 KNOT servers, then at least 47 cores are required (46 hashholders and 1 worker), optimally at least double. Parameters NUM_HASHES
and NUM_WORKERS
are ignored in this case.
Tagging is executed via program TT-slave (dependent on /opt/TreeTagger), which can be found in:
./processing_steps/4/TT-slave
Compile using the following command:
mvn clean compile assembly:single
Usage:
java -jar package [opts] input_file output_dir [treetagger.home]
package
- binary file of program, e.g. ./processing_steps/4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jarops
- switches
-d
- turns on listing of debugging information-o
- turns on deletion of URI from linksinput_file
- input file in vertical program (extension is .vert)output_dir
- path to output directory w/o slash in the endtreetagger.home
- path to installation directory TreeTagger, default is /opt/TreeTaggerScript for parallel execution on multiple servers:
./processing_steps/4/tag.py
Usage:
./tag.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] [-t THREADS] [-d] [-u] -i --input input directory with files for tagging -o --output output folder, if it does not exist, script attempts its creation -s --servers file containing list of servers, with format HOSTNAME '\t' THREADS '\n' (each line per one machine) -e --errors if it is set, errors are logged to this file with current date and time -b --binary argument specifying path to tagging .jar program (default is ./processing_steps/4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar) -t --threads sets number of threads (default is 6, if there are numbers of threads in file with list of the servers, they have higher priority) -d --debug debugging listing -u --uri turns on deletion of URI from links
Example:
./tag.py -i /mnt/data/commoncrawl/CC-2015-18/dedup/ -o /mnt/data/commoncrawl/CC-2015-18/dedup/tagged/ -s ~/servers # tags files from folder/mnt/data/commoncrawl/CC-2015-18/dedup/ and saves them to # directory/mnt/data/commoncrawl/CC-2015-18/dedup/tagged/ on machines specified in file ~/servers
Parsing is done by modified MDParser, which can be found in:
./processing_steps/5/MDP-package/MDP-1.0
Compile by command:
ant make-mdp
Usage:
java -jar package [opts] input_file output_dir [path_to_props]
package
- binary file of program, e.g. ./processing_steps/5/MDP-package/MDP-1.0/build/jar/mdp.jaropts
- switches
-o
- turns on deleting of unnecessary URI documents from linksinput_file
- is tagged input file (extension is .tagged)output_dir
- Path to output directory without slash at the endpath_to_props
- Path to .xml file with program parameters (in case of omission of parameters, parameters from file ./processing_steps/5/MDP-package/MDP-1.0/resources/props/props.xml)(on servers knot and anthena is appropriate file ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml)Important files, if something was changed:
./processing_steps/5/MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/test/MDParser.java ./processing_steps/5/MDP-package/MDP-1.0/src/de/dfki/lt/mdparser/outputformat/ConllOutput.java
Script for parallel execution on multiple servers:
./processing_steps/5/parse.py
Usage:
./parse.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] [-t THREADS] [-u] [-p XML_FILE] -i --input input directory with files for parsing -o --output output directory, if directory doesn't exist, script will attempt to create it -s --servers file with list of servers, where is format: HOSTNAME '\t' THREADS '\n' (one line for one machine) -e --errors if set, errors are logged to this file with current date and time -b --binary argument specifies path to parsing .jar program (default is ./processing_steps/5/MDP-package/MDP-1.0/build/jar/mdp.jar) -t --threads sets number or threads (default is 6, if numbers of threads is stored in file with list of servers, they have a higher priority -u --uri turns on deleting URI from links -p --props Path to .xml file with program parameters (default is ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml)
Exaples:
./parse.py -i /mnt/data/commoncrawl/CC-2015-18/tagged/ -o /mnt/data/commoncrawl/CC-2015-18/parsed/ -s ~/servers # parses files from folder /mnt/data/commoncrawl/CC-2015-18/tagged/ to # folders /mnt/data/commoncrawl/CC-2015-18/parsed/ on servers specified in file ~/servers ./parse.py -i /mnt/data/commoncrawl/CC-2015-18/tagged/ -o /mnt/data/commoncrawl/CC-2015-18/parsed/ -s ~/servers \ -p ./processing_steps/5/MDP-package/MDP-1.0/resources/props/propsKNOT.xml # parses files from folder /mnt/data/commoncrawl/CC-2015-18/tagged/ do # folders /mnt/data/commoncrawl/CC-2015-18/parsed/ on machines specified in file ~/servers # simultaneously hands config file propsKNOT.xml to MDParser (must be available on all machines we're trying to run it)
SEC (see SEC) is used for recognition of named entities. Client sec.py can be parallely run on multiple servers by:
./processing_steps/6/ner.py
Usage:
./ner.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] -i --input input directory -o --output output directory -s --servers file with list of servers, where hostname of one of the machines is on each line, if the line includes tabulator, it's considered a separator and text before separator is used as a hostname (due to compability of this config file with scripts, where you can specify number of threads for specific machine, you can use this format: HOSTNAME \t THREADS for each line) -e --errors if it's set, errors are logged to this file with current date and time -b --binary change of path to the SEC client (default is /var/secapi/SEC_API/sec.py)
Exaple:
./ner.py -i /mnt/data/commoncrawl/CC-2015-18/parsed/ -o /mnt/data/commoncrawl/CC-2015-18/secresult/ -s ~/servers # processes files from folder /mnt/data/commoncrawl/CC-2015-18/parsed/ to # folder /mnt/data/commoncrawl/CC-2015-18/secresult/ on machines specified in file ~/servers
Example SEC configuration, mapping input vertical to output vertical, that annotates vertical to format MG4J
{ "annotate_vertical": { "annotation_format": "mg4j", "vert_in_cols": [ "position", "token", "postag", "lemma", "parabspos", "function", "partoken", "parpostag", "parlemma", "parrelpos", "link", "length" ], "vert_out_cols": [ "position", "token", "postag", "lemma", "parabspos", "function", "partoken", "parpostag", "parlemma", "parrelpos", "link", "length" ] }
Introduction to MG4J: http://www.dis.uniroma1.it/~fazzone/mg4j-intro.pdf
Source files of a program designed for semantic indexing are located in directory:
./processing_steps/7/corpproc
Compile using this command:
mvn package
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
mkdir /mnt/data/indexes/CC-2015-18/collPart001
-c <CONFIG_FILE>
):
java -jar processing_steps/7/corpproc/target/corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar -i
java -jar processing_steps/7/corpproc/target/corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar index /mnt/data/indexes/CC-2015-18/collPart001 /mnt/data/indexes/CC-2015-18/final
Script for parallel execution on multiple servers
The script starts indexation so it creates 6 Shard , they are filled and a collection is created from them, than indexation on the collection is started . To do this, you need to specify the argument start . If not set, the status of the individual screens and state programs will be printed. The argument stop terminates screens.
./processing_steps/7/index.py
Use of the script:
./index.py [start|stop] -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] -i --input input directory -o --output output directory -s --servers file containing list of servers where each line is hostname of one machine, if line contains tabulator it is treated as separator and hostname is text before the separator (it is for compatibility of this configuration file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line) -e --errors if set, errors are logged to this file with current date and time -b --binary path to directory containing .jar files (default is /mnt/data/wikipedia/scripts/mg4j/)
Examples:
./index.py -s ~/servers # displays the status of startup screens and program on all servers from file ~/servers ./index.py -i /mnt/data/commoncrawl/CC-2015-18/secresult/ -o /mnt/data/indexes/CC-2015-18/ -s ~/servers start # runs indexation on selected servers ./index.py -s ~/servers stop # ends screens and thereby processes in all servers from file ~/servers
The columns can be found here:
https://docs.google.com/spreadsheets/d/1S4sJ00akQqFTEKyGaVaC3XsCYDHh1xhaLtk58Di68Kk/edit#gid=0
Note: Its not clear if the special XML charactes in columns should be escaped. The current implementation of SEC doesn't escape them (respectively any escaping is removed). tagMG4JMultiproc.py
doesn't escape them too. On the other hand it leaves escaping on the input ( only partial - probably just some columns).
If columns are changed, the following things have to be changed:
Daemon requires .collection file in the folder with index, new indexer creates it automatically. It is however possible to use indices created by the old indexer. This older .collection file is automatically converted to the new format. To move indices or data files to a new location, simply change the path to data files in the .collection file (JSON).
Run using these commands:
cd /mnt/data/indexes/CC-2015-18/final java -jar processing_steps/7/mg4j/corpproc-1.0-SNAPSHOT-jar-with-dependencies.jar serve /mnt/data/indexes/CC-2015-18/final
Script for parallel execution on multiple servers
Launches deamons replying to requests. Launch takes place in the screen and parameter start must be set, otherwise it will print the status of screens and daemons. The stop terminates screens:
./processing_steps/7/daemon.py
Usage:
./daemon.py [start|stop|restart] -i INPUT_DIRECTORY [-s SERVERS_FILE] [-e ERRORS_LOG_FILE] [-b BINARY] -i --input input directory (automaticaly adds /final/ to the end of path) -p --port port (default is 12000) -s --servers file containing list of servers where each line is hostname of one machine, if line contains tabulatorit is treated as separator and hostname is text before the separator (it is for compatibility of this configuration file with scripts in which you can specify the number of threads for a particular machine, there is a possible format: HOSTNAME \t THREADS for each line) -e --errors if set, errors are logged to this file with current date and time -b --binary path to directory containing .jar files (default is /mnt/data/wikipedia/scripts/mg4j/)
Examples:
./daemon.py -i /mnt/data/indexes/CC-2015-18/final -s ~/servers start # Runs screens and deamon in them above collection in directory /mnt/data/indexes/CC-2015-18/final # on machines specified in file ~/servers ./daemon.py -s ~/servers # displays the status of running screens and deamons on all machines specified in file ~/servers ./daemon.py -s ~/servers stop # terminates screens on all machines specified in file ~/servers
The source codes of program are in a directory:
./processing_steps/8a/mg4jquery
Compilation:
JAVA_HOME=/usr/lib/jvm/java-8-oracle/; mvn clean compile assembly:single
Launch examples:
JAVA_HOME=/usr/lib/jvm/java-8-oracle/; java -jar mg4jquery-0.0.1-SNAPSHOT-jar-with-dependencies.jar -h ../../servers.txt -m ../src/main/java/mg4jquery/mapping.xml -s ../src/main/java/mg4jquery/config.xml -q "\"was killed\"" JAVA_HOME=/usr/lib/jvm/java-8-oracle/; java -jar mg4jquery-0.0.1-SNAPSHOT-jar-with-dependencies.jar -h ../../servers.txt -m ../src/main/java/mg4jquery/mapping.xml -s ../src/main/java/mg4jquery/config.xml -q "1:nertag:person < 2:nertag:person" -c "1.nerid != 2.nerid"
For example, this returns documents, where are at least two different people. File with servers expects address of server with port on each line, example:
knot01.fit.vutbr.cz:12000
Preparation script:
./prepare_webGUI.py
Usage:
python ./prepare_webGUI.py -po PORT -pa PATH -n NAME [-a] [-e ERRORS_LOG_FILE]
-po --port
- port where the database will be launched-pa --path
- path to directory, where the database will be created-n --name
- name of the database-a --artifacts
- installs required artifacts-e --errors
- error log file, logs are saved with time and date, if not specified then the logs are printed to STDERRExample:
python ./prepare_webGUI.py -po 9094 -pa ./src/main/webapp/WEB-INF/ -n users -a # installs required artifacts, compiles source files and sets a database "mem:users" up in directory ./src/main/webapp/WEB-INF/ on port 9094
Running on port 8086:
mvn jetty:run -Djetty.port=8086
Termination:
./stop_webGUI.py
Usage:
python ./stop_webGUI.py #necessary to launch every time you terminate the web GUI
In subdirectory maven_deps/src are sources of own gwt components to display dynamic tooltips. For successful retrieval of request results from the server, it is necessary to set the adress options of the servers (in the classical format domain_name_of_server:port
). You can set number of results on page, behaviour of info windows (dynamic - hovering over an entity with a cursor, static - it stays visible until the user moves their cursor over another entity, or otherwise change the status of the application) and display type (default is corpus based, but you can switch to document based).
It is similar to MG4J, semantic index automatically remaps requests to the same index, no need to write for example:
"(nertag:person{{nertag-> token}}) killed"
This request is enough:
"nertag:person killed"
Compared to mg4j there is expansion "Global constraints", which allows to tag token and do post-filter on the token.
1:nertag:person < 2:nertag:person 1.fof != 2.fof AND 1.nerid = 2.nerid
For example this returns documents, where is the same person in different forms (often name and coreference). In case of requesting on occurrences within the same sentence, you can use the difference operator.
nertag:person < nertag:person - _SENT_ nertag:person < nertag:person - _PAR_
These requests searche for two persons in a sentence (respectively paragraphs). Difference is that it will take part of the text, where the token does not occur.
There are a lot of indexes which can be queried:
position token tag lemma parpos function parword parlemma paroffset link length docuri lower nerid nertag person.name person.gender person.birthplace person.birthdate person.deathplace person.deathdate person.profession person.nationality artist.name artist.gender artist.birthplace artist.birthdate artist.deathplace artist.deathdate artist.role artist.nationality location.name location.country artwork.name artwork.form artwork.datebegun artwork.datecompleted artwork.movement artwork.genre artwork.author event.name event.startdate event.enddate event.location museum.name museum.type museum.estabilished museum.director museum.location family.name family.role family.nationality family.members group.name group.role group.nationality nationality.name nationality.country date.year date.month date.day interval.fromyear interval.frommonth interval.fromday interval.toyear interval.tomonth interval.today form.name medium.name mythology.name movement.name genre.name nertype nerlength
Warning: all request starting with number or containing character "*" must be surrounded by parentheses otherwise it will cause an error:
nertag:event ^ event.startdate: (19*)
Attribute requests of a named entity should be combined with request to index nertag (because indexes are overlaping, so it will save space). "Global constraints" can additionaly use request "fof", which is abbreviation of "full occurrence from".
Script allows to redistribute data (files) among given servers in order to equally use capacity of disc arrays.
Script location:
./processing_steps/1/redistribute.py
Usage:
./redistribute.py -i INPUT_DIRECTORY [-o OUTPUT_DIRECTORY [-d DISTRIB_DIRECTORY]] -s SERVER_LIST [-p RELATED_PATHS] [-x EXTENSION] [-r] [-m] [-e ERRORS_LOG_FILE]
-i --input
- input directory for distribution-o --output
- output directory for generated scripts-d --distrib
- output directory on all servers, generated scripts are distributed into those servers-s --servers
- file containing list of servers where each line is hostname of one machine\-p --paths
- file with paths and extensions of related files in format PATH \\t EXTENSION-x --extension
- extension in INPUT_DIRECTORY (no extension by default)\n"-m --moves
- print movement on stdout-e --errors
- if it is set errors are logged to this file with current date and timeFinal redistributed scripts have to be launched via parallel ssh and have to be removed afterwards.
Example:
python /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/1/redistribute.py -i /mnt/data/commoncrawl/CC-2015-14/warc -s /mnt/minerva1/nlp/projects/corpproc/corpora_processing_sw/processing_steps/servers.txt -m -o /home/idytrych/redistributionScripts -d /mnt/data/commoncrawl/software/redistributionScripts -x "-warc.gz" -p /home/idytrych/CC-2015-14-rel.txt >moves.txt
where CC-2015-14-rel.txt contains:
/mnt/data/commoncrawl/CC-2015-14/uri -warc.domain /mnt/data/commoncrawl/CC-2015-14/uri -warc.domain.srt /mnt/data/commoncrawl/CC-2015-14/uri -warc.netloc ...
afterwards scripts in screen are launched:
parallel-ssh -h servery_b9_idytrych.txt -t 0 -A -i "bash /mnt/data/commoncrawl/software/redistributionScripts/\$HOSTNAME.sh"
and removed:
parallel-ssh -h servery_b9_idytrych.txt -t 0 -A -i "rm /mnt/data/commoncrawl/software/redistributionScripts/\$HOSTNAME.sh"
Servers where nothing will be moved have no script, error will be printed.
A running search engine is located on server athena1, try this as an example.
Indexes are running almost on all servers (for this search engine). For wikipedia you can restart deamons like this(only the person who launched the deamons can actually restart them):
python daemon.py restart -i /mnt/data/indexes/wikipedia/enwiki-20150901/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIW.txt -b /mnt/data/wikipedia/software/mg4j
For CC (In order of incremental deduplication):
python daemon.py restart -i /mnt/data/indexes/CC-2015-32/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC32.txt -b /mnt/data/commoncrawl/software/mg4j -p 12001 python daemon.py restart -i /mnt/data/indexes/CC-2015-35/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC35.txt -b /mnt/data/commoncrawl/software/mg4j -p 12002 python daemon.py restart -i /mnt/data/indexes/CC-2015-40/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC40.txt -b /mnt/data/commoncrawl/software/mg4j -p 12003 python daemon.py restart -i /mnt/data/indexes/CC-2015-27/ -s /mnt/minerva1/nlp/projects/corpproc/serverLists/serversIC27.txt -b /mnt/data/commoncrawl/software/mg4j -p 12004
Indexes created on Salomon require symlinks - they contain paths such as /scratch/work/user/idytrych/CC-2015-32/mg4j_index.
Script allows to repack input warc.gz
file according to the respective vertical. Only files that are in input vertical are extracted from the original warc.gz
file and the result is saved to file warc.xz
. The script can be executed for a single warc.gz
file and specific vertical, or for a folder containing input warc.gz
files and folder containing verticals. No other combinations are allowed.
Script location:
./processing_steps/1/warc_from_vert/warc_from_vert.py
Launching:
-i --input input warc.gz file or folder -v --vertical input vertical or folder -o --output output folder, where the results are saved
Script allows packing of preprocessed wikipedia to warc.gz
or warc.xz
file. Input can be a single file or a folder containing files to be processed.
Script location:
./processing_steps/1/pack_wikipedia/pack_wiki.py
Launching:
-i --input input file or folder -o --output output folder -d --date date of wiki download, format: DD-MM-YYYY -f --format output file format, gz or xz
Script network_monitor.py
allows monitoring of the current CESNET link load. It can be launched as a standalone script or used as a python library.
When used as a library, it offers NetMonitor class, which can be given a network name should we wish to monitor it. The instance of this class provides the get_status
function, that returns result containing incoming and outgoing bits per second and load in percentages (again in JSON).
When launched as a standalone script in a terminal, it prints incoming/outgoing load in percentages, alternatively prints the worst one of them according to the specified parameters.
Values are refreshed every minute. More frequent refresh rate is pointless.
Script location:
./processing_steps/1/network_monitor.py
Launching:
-i --datain displays incoming load in percentages -o --dataout displays outgoing load in percentages -w --worst displays the worst load in percentages -l --link link to be monitored, if not specified then the Telia link is monitored
Used for automated downloading large amounts of data CC, while maintaining the load in a certain range. The earlier mentioned script network_monitor.py
is used to monitor the status of a network. Script is given a low network load limit in percentages and will maintain the load between the specified valu and a 10% higher value. Input file is warcs.lst or it's part if downloaded on multiple servers. Highest number of downloading processes in range 1 to 15 can be specified via parameter (see below). It is advised to use the script on less servers with more processes.
It was tested on Salomon, downloading CC-2016-44. The script ran on 4 servers with a maximum of 12 processes and low network load limit set to 55%, while maintaining network load between 55-65%.
Script location:
./processing_steps/1/cc_auto_download.py
Launching:
-i --input_file path to input file warcs.lst or it's part -o --output_dir path to output folder -p --process number of downloading processes (1-15) -l --limit lower network load limit in percentages (10-80)
Program that enables the extraction of wikipedia page from a file in zim format. Output is in preprocessed wikipedia format, it is saved to the output folder and separated in multiple files of approx. 100 MB size.
Program location:
/mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/zim_data/zimlib-1.2
Compiled using the make command, executable program zimdump is created when compiled, can be found here:
/mnt/minerva1/nlp/corpora_datasets/monolingual/english/wikipedia/zim_data/zimlib-1.2/src/tools
Launch example:
./zimdump -a ~/output -J cs wikipedia_en_2016_05.zim
Parameters:
-a
- output folder-j
- optional, specifies the language of input file, used to generate correct URI (default is en)For 4. version of scripts:
Complete sequence for launching (not tested yet):
(dump is labeled RRRRMMDD) cd 1a ./download_wikipedia_and_extract_html.sh RRRRMMDD cd .. python ./1/distribute.py -i ./2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a python ./2/verticalize.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/html_from_xml/AA/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/vert/ -s servers.txt -b /mnt/data/wikipedia/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar python ./1/distribute.py -i ./4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a python ./4/tag.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/vert/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/tagged/ -s servers.txt -b /mnt/data/wikipedia/software/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar python ./1/distribute.py -i ./5/MDP-package/MDP-1.0/build/jar/mdp.jar -o /mnt/data/wikipedia/software/ -s servers.txt -a python ./5/parse.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/tagged/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/parsed/ -s servers.txt -b /mnt/data/wikipedia/software/mdp.jar python ./6/ner.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/parsed/ -o /mnt/data/wikipedia/enwiki-RRRRMMDD/secresult/ -s servers.txt python ./1/distribute.py -i ./7/mg4j/ -o /mnt/data/wikipedia/software/mg4j/ -s servers.txt -a python ./7/index.py -i /mnt/data/wikipedia/enwiki-RRRRMMDD/secresult/ -o /mnt/data/indexes/enwiki-RRRRMMDD/ -s servers.txt start
Launch of daemons corresponding to queries:
python ./daemon.py -i /mnt/data/indexes/enwiki-RRRRMMDD/final -s servers.txt -b /mnt/data/wikipedia/software/mg4j/ start
Complete sequence for launching (not tested yet):
(CC labeled RRRR-MM) cd processing_steps/1b/download_commoncrawl ./dl_warc.sh RRRR-MM cd ../.. python ./1/distribute.py -i ./2/NLP-slave/target/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a python ./2/verticalize.py -i /mnt/data/commoncrawl/CC-RRRR-MM/warc/ -o /mnt/data/commoncrawl/CC-RRRR-MM/vert/ -s servers.txt -b /mnt/data/commoncrawl/software/NLP-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar python ./1/distribute.py -i ./3/dedup/server -o /mnt/data/commoncrawl/software/dedup/ -s servers.txt -a python ./1/distribute.py -i ./3/dedup/dedup -o /mnt/data/commoncrawl/software/dedup/ -s servers.txt -a cd 3 parallel-ssh -h servers_only.txt -t 0 -i "mkdir /mnt/data/commoncrawl/CC-RRRR-MM/hashes/" ( to load hashes from previous processing, use parameter -i /mnt/data/commoncrawl/CC-RRRR-MM/hashes/) python ./server.py start -s servers.txt -w workers.txt -o /mnt/data/commoncrawl/CC-RRRR-MM/hashes/ -b /mnt/data/commoncrawl/software/dedup/server python ./deduplicate.py -i /mnt/data/commoncrawl/CC-RRRR-MM/vert/ -o /mnt/data/commoncrawl/CC-RRRR-MM/dedup/ -w workers.txt -s servers.txt -b /mnt/data/commoncrawl/software/dedup/dedup python ./server.py stop -s servers.txt -w workers.txt -b /mnt/data/commoncrawl/software/dedup/server cd .. python ./1/distribute.py -i ./4/TT-slave/target/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a python ./4/tag.py -i /mnt/data/commoncrawl/CC-RRRR-MM/dedup/ -o /mnt/data/commoncrawl/CC-RRRR-MM/dedup/tagged/ -s servers.txt -b /mnt/data/commoncrawl/software/TT-slave-0.0.1-SNAPSHOT-jar-with-dependencies.jar python ./1/distribute.py -i ./5/MDP-package/MDP-1.0/build/jar/mdp.jar -o /mnt/data/commoncrawl/software/ -s servers.txt -a python ./5/parse.py -i /mnt/data/commoncrawl/CC-2RRRR-MM/tagged/ -o /mnt/data/commoncrawl/CC-RRRR-MM/parsed/ -s servers.txt -b /mnt/data/commoncrawl/software/mdp.jar python ./6/ner.py -i /mnt/data/commoncrawl/CC-RRRR-MM/parsed/ -o /mnt/data/commoncrawl/CC-RRRR-MM/secresult/ -s servers.txt python ./1/distribute.py -i ./7/mg4j/ -o /mnt/data/commoncrawl/software/mg4j/ -s servers.txt -a python ./7/index.py -i /mnt/data/commoncrawl/CC-RRRR-MM/secresult/ -o /mnt/data/indexes/CC-RRRR-MM/ -s servers.txt start
Launch of daemons corresponding to queries:
python ./daemon.py -i /mnt/data/indexes/CC-RRRR-MM/final -s servers.txt -b /mnt/data/commoncrawl/software/mg4j/ start
Run_local
Scripts used to launch processing on KNOT servers unanimously. Processing is launched using the run.py script, that launches required step according to the configuration file config.ini
, this file has to be edited before processing. Set the path to folder to be processed in root_data
parameter in the [shared] section. Other variables point to folders using scripts etc, these variables do not need to be changed. It is advised to copy this file to home for example, so it remains unchanged.
When launching on every server, where the processing happens, screen with name USERNAME-proc-STEP_NUMBER
is created. All screens can be terminated using the script with -a kill parameter.
Script location:
./processing_steps/run_local/run.py
Launching:
-c --config required parameter, specifies path to config file -p --proc required parameter, specifies processing step, options: vert, tag, pars, sec, index, index_daemon, shards or their numerical values (2, 4, 5, 6, 7, 8, 9) test value can be used to test steps, this does not have a numerical equivalent when using this parameter, using the step in -e parameter is also required -a --action optional, specifies an action to be executed, options: start, kill, check, progress, eval (default is start) -s --servers optional, specifies path to file containing server list used to process, if not set, the processing will only execute locally -t --threads optional, specifies the number of threads for processing -e --examine required if the value of parameter -p is set to test -l --logging optional, turns logging of test results to files in folder containing logs, test_* and main_* files -h --help prints help
Settings in the [vert] section of config file. Variable exe_path
specifies the path to verticalizator script. Variables input_dir, output_dir
and log_path
specify the paths to input, output and log folders. Variable no_log
is used to turn logging on and off. Variable Stoplist_path
specifies path to file containing the list of stop words.
Launch example:
./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p vert
Servers are sequentially launched as well as workers on port specified in config file. Variables in section [dedup] need to be set. The folder containing run.py
must also contain script dedup_handler.py
, responsible for servers and workers launching.
Variables:
exe_path path to folder containing files server.py and deduplicate.py bin_path path to folder containing executable files input_path path to folder containing verticals output_path path to output folder map_file path to hashmap.conf file log_path path to folder containing logs progress_tester path to file dedup_check.py hash_path path to folder from hash port port where servers and workers run dropped True or False, responsible for -dr --dropped droppeddoc True or False, responsible for -dd --droppeddoc debug True or False, responsible for -d --debug neardedup True or False, responsible for -n --near
Launch example:
python3 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p dedup
Deduplication proces verification:
python3 ./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p dedup -a progress
Settings in the [tag] section of config file. Variables exe_path, input_path, output_path, log_path
with the same meaning in verticalization. Additional variables remove_uri which turns deletion of URIs from links on and off, and ttagger_path
specifying path to install directory TreeTagger.
Launch example:
./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p tag
Settings in the [pars] section of config file. Variables exe_path, input_path, output_path, log_path
with the same meaning as in verticalization and tagging. Variable config_path
specifies path to config file for parsing.
Launch example:
./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p pars
Settings in the [sec] section of config file. Contains same variables as verticalization, tagging and parsing. One new variable config_path
specifies path to file containing SEC queries in JSON.
Launch example:
./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p sec
Script to prepare for indexing. Required number of collections are created in the output folder of each servers, files in MG4J format (SEC output) are distributed equally in these collections. The [shards] section of config file contains variables for input and output folders as well as a number of required collections (based on the number of CPU cores). Output file is not required.
Launch example:
./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p shards
Runs collection indexing. Input folder contains all collections, output is saved to a single folder (final). Variables can be changed in the [index] section of config file.
Launch example:
./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p index
Settings in the [index_daemon] section of config file. Variables exe_path
specifying path to indexer, input_path
specifying path to folder final, on which the daemon runs and log_path
specifying path to log folder. If log_path
is set to "/", no logs are saved. Another variable port_number
specifies the port number, where the daemon runs and config_path
specifies path to config file for indexer.
Launch example:
./processing_steps/run_local/run.py -c ~/config.ini -s ~/servers.txt -p index_deamon
Scripts providing options for checking of the output files and their content after each step of processing. Links from input, output and log files are loaded and their occurrence is compared to the other files. If a mismatch is detected, an error will be printed to STDOUT or to a file. The newest version can be found on "test_scripts_xgrigo02" branch in folder processing_steps/step_check/.
Usage:
python3 step_check.py -i IN_PATH -o OUT_PATH -l LOGS_PATH -t STEP -s [-c PATH_TO_CONFIG]
Parameters:
-c PATH, --config PATH Path to config file. If config path was specified, then arguments -i, -o and -l will be ignored, there values will be obtained from config. -t STEP, --target STEP Step for script to test. Possible STEP values: '2' or 'vert' '3' or 'dedup' '4' or 'tag' '5' or 'pars' '6' or 'sec' '7' or 'index' Implicit value is 'vert' -i IN_PATH, --input_path Path with input files -o OUT_PATH, --output_path Path with output files -l LOG_PATH, --log_path Path with log files to check. Also test output will be saved there, if -s option was specified -s, --save_out Enable output writing to log file -h, --help Prints this message --inline Enables inline output
You can either launch with parameters -i, -o and -l, or set path to config file, all the required settings will be read from there. In that case, arguments -i, -o and -l will be ignored.
When specifying parameter -s, the output of scripts will be saved to files with suffixes .test_out or .main_out. When testing of a pair of folders is finished, another file .tested will be created. This indicates that scripts finished without an error.
To run using the run_local script, you need to set exe_path
in [test] section of config file. This file can be found in the corpora_processing_sw/processing_steps/run_local/ folder.
Launch example using run.py:
python3 run.py -a start -p test -e vert -l -c ~/config.ini -s ~/servers.txt
Manual launch example:
python3 step_check.py -i ./warc_path -o ./vert_path -l ./logfiles/verticalization -t vert -s
alternatively:
python3 step_check.py -c ./config.ini -t vert
ls -1 /scratch/work/user/idytrych/warc | sed 's/\-warc.gz//' > ~/namelist or ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
seq 14 > numtasks
bash createCollectionsList_single.sh 192
#PBS -q qexp for example by #PBS -q qprod #PBS -A IT4I-9-16 (and add the following) #PBS -l walltime=48:00:00
bash prepare_dl_warc_single.sh 2015-3 Download CC (manually on login nodes): /scratch/work/user/idytrych/CC-2015-32/download/dload_login1_all.sh /scratch/work/user/idytrych/CC-2015-32/download/dload_login2_all.sh /scratch/work/user/idytrych/CC-2015-32/download/dload_login3_all.sh /scratch/work/user/idytrych/CC-2015-32/download/dload_login4_all.sh ls /scratch/work/user/idytrych/CC-2015-32/warc/ | sed 's/\-warc.*//g' > ~/namelist mv /scratch/work/user/idytrych/CC-2015-32/warc /scratch/work/user/idytrych/ wc -l namelist (printed number of rows in namelist is NNN - will be needed below) (Count NU = NNN / MU, where MU is maximal number of nodes, can be used in parallel according to Salomon's documentation) seq NU >numtasks qsub -N vert -J 1-NNN:NU vert.sh qsub dedup.sh qsub -N tag -J 1-NNN:NU tag.sh qsub -N parse -J 1-NNN:NU parse.sh secapi/SEC_API/salomon/v3/start.sh 3 (M is number of collections - should be ca. 6 for each destination server) bash createCollectionsList_single.sh M (it is assumed that there is enough nodes to process all collections in parallel in case of 24 as well as in case of 8 processes) seq 24 >numtasks qsub -N createShards -J 1-M:24 createShards.sh qsub -N populateShards -J 1-M:24 populateShards.sh qsub -N makeCollections -J 1-M:24 makeCollections.sh seq 8 >numtasks qsub -N makeIndexes -J 1-M:8 makeIndexes.sh
(NP is number of parts on which we want Wikipedia to be divided - which means number of destination servers) bash createWikiParts_single.sh NP (Not tested part: ) bash download_and_split_wikipedia_single.sh 20150805 qsub -N extract_wikipedia extract_wikipedia_html.sh ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist wc -l namelist (printed number of rows in namelist is NNN - will be needed below) (Count NU = MAX(14, (NNN / MU)), where MU is maximal number of nodes, which can be used in parallel according to Salomon's documentation - so it makes no sense to use less than 14 per node) qsub -N vertWiki -J 1-NNN:NU vertWiki.sh qsub -N tagWiki -J 1-NNN:NU tag.sh qsub -N parseWiki -J 1-NNN:NU parseWiki.s (Count NUL = MAX(3, (NNN / MUL)), where MUL is maximal number of nodes, which can be used in parallel in qlong according to Salomon's documentation - so it makes no sense to use less than 3 per node) secapi/SEC_API/salomon/v3/start.sh NUL (M is number of collections - should be one for each destination server) bash createCollectionsList_single.sh M (it is assumed that there is enough nodes to process all collections in parallel in case of 24 as well as in case of 8 processes) seq 24 >numtasks qsub -N createShards -J 1-M:24 createShards.sh qsub -N populateShards -J 1-M:24 populateShards.sh qsub -N makeCollections -J 1-M:24 makeCollections.sh seq 8 >numtasks qsub -N makeIndexes -J 1-M:8 makeIndexes.sh
ls -1 /scratch/work/user/idytrych/warc | sed 's/\-warc.gz//' > ~/namelist or ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
bash prepare_dl_warc_single.sh 2015-32 Download CC (manually for login uzlech): /scratch/work/user/idytrych/CC-2015-32/download/dload_login1_all.sh /scratch/work/user/idytrych/CC-2015-32/download/dload_login2_all.sh /scratch/work/user/idytrych/CC-2015-32/download/dload_login3_all.sh /scratch/work/user/idytrych/CC-2015-32/download/dload_login4_all.sh ls /scratch/work/user/idytrych/CC-2015-32/warc/ | sed 's/\-warc.*//g' > ~/namelist mv /scratch/work/user/idytrych/CC-2015-32/warc /scratch/work/user/idytrych/ bash start.sh vert 10 qprod (can be repeated N-times and add more nodes - instead of 10 it is possible to use 20 - but overall it is not right to use over 70 nodes) bash start.sh dedup 4 qprod (nodes cannot be added) bash start.sh tag 10 qprod (can be repeated N-times and add more nodes - instead of 10 it is possible to use 20 - but overall it is not right to use over 40 nodes) bash start.sh parse 10 qprod (can be repeated N-times and add more nodes - instead of 10 it is possible to use 25 - but overall it is not right to use over 90 nodes) bash start.sh sec 1 qprod (can be repeated N-times and add more nodes - instead of 1 it is possible to use 50 - but for the fist time after 1. 10 min. pause for build and overall it is not right to use over 100 nodes) mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index/final (MMM is number of desired collections - number of destination servers * 6 might be appropriated) bash startIndexing.sh cList MMM bash startIndexing.sh cShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index bash startIndexing.sh pShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index bash startIndexing.sh colls qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index bash startIndexing.sh indexes qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
bash start.sh vert 1:f qprod bash start.sh vert 50 qprod (1. note is launched separately and it needs to be guaranteed that they will run for the whole time of processing- alternatively use qlong) (another nodes can be launched at any speed and at unlimited quantity - even though it does not make much sense to use over 150 - discs cannot handle it) bash start.sh dedup 4 qprod bash start.sh tag 1:f qprod bash start.sh tag 50 qprod bash start.sh parse 1:f qprod bash start.sh parse 50 qprod bash start.sh sec 1:f qlong bash start.sh sec 20 qprod
bash startCheckSec.sh qprod namelist 2 1 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult bash startCheckSec.sh qprod namelist 2 2 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult (2 1 is number of parts and which one should be launched - with multiple parts the result can be achieved quicker; number of parts is given to another command by -n) python remove.py -n 2 -s /scratch/work/user/idytrych/secsgeresult >rm.sh sed "s/secsgeresult/sec_finished/;s/.vert.dedup.parsed.tagged.mg4j//" rm.sh >rmf.sh bash rm.sh bash rmf.sh rm rm.sh rm rmf.sh bash create_restlist.sh namelist /scratch/work/user/idytrych/secsgeresult >restlist rm /scratch/work/user/idytrych/counter/* bash start.sh sec 1:f qlong restlist bash start.sh sec 20 qprod restlist
bash startCheckIndexes.sh /home/idytrych/collectionlist /scratch/work/user/idytrych/CC-2015-32/mg4j_index "/home/idytrych/check_i.txt" (on each line in check_i.txt should be the right number of files of index and at the same time there should not be any exception in any error log)
ls -1 /scratch/work/user/idytrych/warc | sed 's/\-warc.gz//' > ~/namelist nebo ls -1 /scratch/work/user/idytrych/wikitexts/ > ~/namelist
bash prepare_dl_warc_single.sh 2015-32 Download CC (manually on login nodes): /scratch/work/user/idytrych/CC-2015-32/download/dload_login1_all.sh /scratch/work/user/idytrych/CC-2015-32/download/dload_login2_all.sh /scratch/work/user/idytrych/CC-2015-32/download/dload_login3_all.sh /scratch/work/user/idytrych/CC-2015-32/download/dload_login4_all.sh ls /scratch/work/user/idytrych/CC-2015-32/warc/ | sed 's/\-warc.*//g' > ~/namelist mv /scratch/work/user/idytrych/CC-2015-32/warc /scratch/work/user/idytrych/ rm /scratch/work/user/idytrych/counter/* bash start.sh vert 1:f qprod bash start.sh vert 50 qprod (first node launches separately and it's runtime has to be ensured for the whole processing time, use qlong if needed) (other nodes can be launched at any speed and there can be unlimited number of them - even though there is not much point running over 150 of them, discs will be delayed) bash start.sh dedup 4 qprod rm /scratch/work/user/idytrych/counter/* bash start.sh tag 1:f qprod bash start.sh tag 50 qprod rm /scratch/work/user/idytrych/counter/* bash start.sh parse 1:f qprod bash start.sh parse 50 qprod rm /scratch/work/user/idytrych/counter/* bash start.sh sec 1:f qlong bash start.sh sec 20 qprod mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index mkdir /scratch/work/user/idytrych/CC-2015-32/mg4j_index/final (MMM represents number of required collections - advised number is number of target servers * 6) bash startIndexing.sh cList MMM bash startIndexing.sh cShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index bash startIndexing.sh pShards qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index bash startIndexing.sh indexes qprod /scratch/work/user/idytrych/CC-2015-32/mg4j_index
bash startCheckSec.sh qprod namelist 2 1 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult bash startCheckSec.sh qprod namelist 2 2 /scratch/work/user/idytrych/parsed /scratch/work/user/idytrych/secsgeresult (2 1 is the number of parts and which one to launch - running multiple parts leads to a faster achieved result; the following command receives number of parts from -n) python remove.py -n 2 -s /scratch/work/user/idytrych/secsgeresult >rm.sh sed "s/secsgeresult/sec_finished/;s/.vert.dedup.parsed.tagged.mg4j//" rm.sh >rmf.sh bash rm.sh bash rmf.sh rm rm.sh rm rmf.sh bash create_restlist_single.sh namelist /scratch/work/user/idytrych/secsgeresult >restlist rm /scratch/work/user/idytrych/counter/* bash start.sh sec 1:f qlong restlist bash start.sh sec 20 qprod restlist
bash startCheckIndexes.sh /home/idytrych/collectionlist /scratch/work/user/idytrych/CC-2015-32/mg4j_index "/home/idytrych/check_i.txt" (every line in check_i.txt should have the correct number of index files and no error log should contain an exception)
Good example of manatee format can be downloaded here.
Susane corpus
This one differs from ours - it has only 4 columns (we have 27). All tags that begins with < keep this format, nothing is transformed such as in case of MG4J.Considering necessary changes, it will only not add GLUE as a token variant and it will not generate things such as %%#DOC PAGE PAR SEN. In manatee instead of of empty anotation underscore is used (in mg4j 0). In addition in manatee configuration file, which defines tags and determines path to vertical file, has to be created in order to make it possible to index via program encodevert.
Format ElasticSearch used for sematic anotations looks as follows:
Word[anotation1;anotation2...] and[anotation1;anotation2...] other[...;anotation26;anotation27] word[...;anotation26;anotation27]
Forms of anotations may be arbitrary, however only alfanumeric symbols and underscore are allowed.
At the moment, this format is used - each anotation is in form typeOfAnotation_value
Types of annotations:
position token tag lemma parpos function parword parlemma paroffset link length docuri lower nerid nertag param0 param1 param2 param3 param4 param5 param6 param7 param8 param9 nertype nerlength
Actual anotated text looks as follows:
Word[position_1;token_Word...]
For sematic questioning typical Lucenic queries are used(see testing query in project directory).
Two new scripts (revert.py, reparse.py
) were created to use the new syntaxnet utility. It is used to analyze text and utilize well known tools. Script revert.py
is used to convert Vertical to input suited for syntaxnet so that every single tag and link is placed at the end of the file and they're assigned numerical identifiers. These identifiers are used by the reparse.py
script, that restores tags and links from the syntaxnet output.
Script revert.py
:
./processing_steps/5/syntaxnet/revert.py
Usage:
./processing_steps/5/syntaxnet/revert.py INPUT_FILE OUTPUT_FILE
INPUT_FILE
- input file, expected file is a verticalizator output fileOUTPUT_FILE
- output file usable by the syntaxnetScript reparse.py
:
./processing_steps/5/syntaxnet/reparse.py
Usage:
./processing_steps/5/syntaxnet/reparse.py INPUT_FILE OUTPUT_FILE
INPUT_FILE
- input file, expected file is a syntaxnet output fileOUTPUT_FILE
- output file enriched with tags and linksScripts to download CommonCrawl 2014-35:
dload_warc.sh
- downloading script for WARC. BEWARE: it is necessary to add target file to the end of a line or download content of each directory separately.dload_wat.sh
- downloading script for WAT. BEWARE: it is necessary to add target file to the end of a line or download content of each directory separately.dload_wet.sh
- downloading script for WET. BEWARE: it is necessary to add target file to the end of a line or download content of each directory separately.lister.sh
- main script that downloads segment.path and creates files dload_*.sh, list.*.sh, *.lst and counts size of file for WARC, WAT and WETlist.WARC.sh
- script creating WARC listinglist.WAT.sh
- script creating WAT listinglist.WET.sh
- script creating WET listingsegment.paths
- downloaded file with description of segmentswarcs.lst
- listing of WARC fileswats.lst
- listing of WAT fileswets.lst
- listing of WET filesdl_warc.sh
- gradually launches lister.sh in order to create listing, creator.py with file warc.lst and starts downloading on machines that are stated in servers.cfgcreator.py
- creates downloading scripts, for standard input one of files w*.lst is used, argument is path for downloading, list of machines is stated in servers.cfgservers.cfg
- list of machines, format name_of_machine number_of_thread
coefficient(everything separated by one space), number of threads and coefficient are integers. Coefficient determines proportion that will be downloaded (e.g. if there are 3 machines in file that have coefficients 2, 1, 1, on the first machine 2/4=1/2 of all files will be downloaded, on the other two 1/4 per each).wet2vert1.py
- script for processing vertical files from WET: arguments are input directory, output directory, number of processeslink to TreeTagger, run using the following commands:
[path_to_corpproc/]tt/bin/tree-tagger -token -lemma -sgml -no-unknown [path_to_corpproc/]tt/lib/english.par
or by script tagger.py
from directory vert2ner.
parser.py
- divides files between threads and then runs on every group of files MDParser.mdp.jar
- main file, run by java -Xmx1g -jar mdp.jar props.xml (props.xml - is config file)props1.xml
- example config file. WARNING: do not change path nor content of this file otherwise script parser.py will not work.tagger.py
- TreeTagger, arguments are input directory, output directory, number of processesQuantity: WARC: 43430,7952731 GB (46633461334359 bytes), WAT: 14702,6299882 GB (15786828741075 bytes), WET: compressed 5300,3036399 GB (5691157698008 bytes), uncompressed 12319,1 GB. Statistics after verticalisation (old verticaliser): Number of files: 52 849 Number of documents: 2 744 133 462 Number of paragraphs: 316 212 991 122 Number of sentences: 358 101 267 144 Number of words (tokens): 2 534 513 098 452