Wikilinks

The goal of this project was to develop a program to extract links for ClueWeb and CommonCrawl data sets from Wikipedia, similar to those in Wikilinks data set. Another goal was to evaluate the speed of the developed tool. The entire functionality is packed into the wikilinks.py script. Detailed description can be found below.

1 wikilinks.py description
2 Wikilinks tool launch automation
- 2.1 Example launch
3 CommonCrawl 2014 processing speed
4 Comparison with the original Wikilinks
- 4.1 Results
5 Histogram of links with identical contexts
6 Visualizing processing time
- 6.1 Time comparison
  - 6.1.1 Scripts
  - 6.1.2 Outputs
- 6.2 Processed data size comparison
  - 6.2.1 Scripts
  - 6.2.2 Outputs
7 Wikilinks records deduplication
8 Datasets processing
- 8.1 CC and ClueWeb processing
9 Wikilinks for distribution
10 Description of input TSV files
11 Comparison of results from VERT and WARC files processing
12 Page table processing
13 Decoding Wikilinks dataset from apache thrift

1 `wikilinks.py` description

Considering the fact that we have to determine the type of date, respectively charset for every single WARC record, records that are not application/http;msgtype=response or application/https; msgtype=response are excluded from the analysis. Each record that meets the requirements should contain HTTP header and the specified type of content inside. The only analyzed type of content is text/html as it is not expected that image/jpeg content contains links.

The charset of a document can be sometimes determined from the HTTP header. Next, the script searches for a specific charset in the HTTP record body (more specifically HTML metatags). Charset from HTML has a higher priority than that from HTTP header. If the charset could not be determined, default HTTP ISO-8859-1 charset is used. If a character can't be converted to UTF-8, it is excluded.

Consequently, the script looks for links. The search algorithm uses regular expressions. Acceptable links are HTML links (<a>). If a link to Wikipedia or Freebase is found, the script determines:

link offset
referred text
referred text without HTML elements
text before and after the link (again without HTML elements)
normalized address without HTML elements
address of the source
referred text offset

These are consequently saved to the output file together with the charset id and name of the source file.

If any of the information can't be found, program saves an empty string in their place instead, this can be changed in constants.py.

The program only saves Wikipedia links from namespace. This filter is executed based on the occurence of : in name of the Wikipedia page. If the name of the page contains :[^_], it is considered a page from different Wikipedia namespace. This function is implemented like this for the sake of simplicity. Wikipedia namespace names change with the language of Wikipedia.

The program can search for Freebase or Wikipedia links, use parameters -f or -w.

Links can also be filtered to search only for a specific language version of Wiki, -e for english and -l for other versions. Program determines the language of a given Wikipedia page using the 3rd (resp 4th) order domain.

If a file containing data about redirecting to Wikipedia exists, the program can save the URL of redirected page, rather than the URL in WARC document, using the -r parameter.

The program saves logs to a file named <output_file>.log.

Number of saved words before and after the link can be changed in constants.py.

1.1 Vert format processing extension

Script wikilinks.py can now process VERT files, along with WARC files. The launching of the script is the same, it can determine the file type on it's own and process it accordingly.

VERT file contains verticalized data, where tokens (words) are saved under eachother as a list, seperated by \n. A record in this file starts and ends with tags <doc where id, url and title of a document is located> and </doc>. Individual links in the document start with <link="wikipedia/freebase/other url"/>, element <length="number"/> can be found on the same line and it represents number of tokens (words) that the link text is comprised of. The tokens are located before tag <link>. Regular expressions are used to process the format.

1.2 Replacing CUID in URL extension

Used for Wikipedia addresses and their normalization. Address that originally contained CUID of an article e.g. http://en.wikipedia.org/wiki/index.html?curid=75960 has the .../wiki/ part replaced with name of the article e.g. https://en.wikipedia.org/wiki/KPMG. To activate this function, launch the program with parameter -p followed by .tsv file where each line contains CUID and title (e.g. "75960\tKPMG"). This file is loadad to memory upon launch and suffix with CUID in url is replaced with corresponding article title.

Path to file:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/enwiki-latest-page.sql

1.3 `wikilinks.py` parameters

Launch example:

 wikilinks.py -i <input_file> -o <output_file> [-p <path>] [-v] [-w] [-f] [-e] [-l <language>[:<language>]] [-r <redirection_file>] [-s] [-d <disambiguation_file>] [-W] [-h] [-a <all_url_file>] [-k]

-i - specifies file containing list of WARC/VERT files to be analysed; it should contain one record per line; the record should be in the format <[path]warc|vert_file_name>\t<number> or <[path]warc|vert_file_name>
-p - specifies path that shares all files from input file
-o - specifies name of output file; log file will be called <output_file_name>.log
-v - verbose mode
-w - searches for links to Wikipedia
-f - searches for links to Freebase
-e - only links to English Wikipedia will be recognized
-l - only links to a given language-specific Wikipedia will be recognized
-r - a tsv file containing redirections from English Wikipedia
-d - a tsv file containing disambiguation pages from English Wikipedia
-a - a tsv file containing all URL from English Wikipedia
-W - filtering out documents from Wikipedia
-s - filtering out all contexts that are not referenced from anchor text similar to the Wikipedia page name
-I - filtering out URLs with fragment identifier
-P - a tsv file containing Wikipedia page table (CUIDs and its titles)
-h - prints help
-k - filtering out documents from unwanted domains {wikipedia|wikidata|wiktionary|wikimedia|wikisource|edwardbetts|instantchess|viaf|werelate|wmflabs|paheal|fallingrain|isni-url.oclc|blekko}

Parameter -p can be used to analyze Clueweb12 when a file containing list of WARC files with their relative paths is available.

Example launch for a single archiv can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/test/test.sh

1.4 Source code

Wikilinks project source files can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/program/

What does each file do:

 wikilinks.py - the program extractig Wikipedia and Freebase links
 warcm.py - WARC file processing module
 vert.py - VERT file processing module
 httpm.py - HTTP header processing module
 constants.py - constants

1.5 Output format description

Output files are in TSV (Tab-Separated Values) format. The meaning of each column is as follows:

 ------+-------------------------------------------------
 Number| Value
 ======+=================================================
  1    | original address      
 ------+-------------------------------------------------
  2    | normalized address   
 ------+-------------------------------------------------
  3    | link text (no entities)
 ------+-------------------------------------------------
  4    | original link text  
 ------+-------------------------------------------------
  5    | context before link (no entities)
 ------+-------------------------------------------------
  6    | context after link (no entities)
 ------+-------------------------------------------------
  7    | encoding
 ------+-------------------------------------------------
  8    | address offset in WARC/VERT file
 ------+-------------------------------------------------
  9    | text offset in WARC/VERT file
 ------+-------------------------------------------------
  10   | WARC/VERT file name                         
 ------+-------------------------------------------------
  11   | URL of corresponding document in WARC/VERT file                      
 ------+-------------------------------------------------

2 Wikilinks tool launch automation

Automatic launch script for Wikilinks can be found here:

 /mnt/data/nlp/projects/wikilinks/runWikilinks.sh

Required launch parameters:

-l LOGIN - username
-s SOURCE_FOLDER - folder containing files to be processed
-d DESTINATION_FOLDER - folder for output files (has to already exist)
-w | -v process WARC or VERT files

To launch successfully, set login using ssh keys to machines, where files will be processed and save their names to:

 /mnt/data/nlp/projects/wikilinks/servers.txt

Script launch order:

list.sh, list_dirs.sh - create list of files to be processed
 split.sh - splits the list for multiple processes
 prepare.sh - prepares directiories for Wikilinks outputs
 run.sh - runs Wikilinks tool on servers specified in servers.txt
 copy.sh - copies output from machines on minerva1.fit.vutbr.cz

Logs of the process are saved here:

 /mnt/data/nlp/projects/wikilinks/autorunOutput.log
 /mnt/data/nlp/projects/wikilinks/autorunError.log

2.1 Example launch

 ./runWikilinks.sh -l iotrusina -s /mnt/data/commoncrawl/CC-2015-06/ -d /mnt/data/nlp/projects/wikilinks/results.CC-2015-06/ -w

Note:
It is advised to run the script through screens as it takes a long time.

3 CommonCrawl 2014 processing speed

Durations listed below represent the estimated time it takes to process the respective CommonCrawl 2014 part on individual machines.

Machine	No of files	1 process	2 processes	3 processes	4 processes	5 processes	6 processes
knot01	2500	192:21:07	98:59:31	70:23:59	53:55:42	40:49:35	37:28:19
knot03	2500	112:46:48	61:00:58	39:32:38	31:49:12	26:10:57	25:22:56
knot04	2500	116:28:20	63:05:04	41:55:47	33:26:21	24:45:13	24:34:53
knot05	2600	128:58:11	65:43:50	44:50:05	35:59:16	28:19:13	27:28:51
knot06	2500	115:49:10	58:59:10	42:36:59	30:47:32	24:38:07	21:32:13
knot07	2500	114:32:38	58:57:01	42:39:27	30:55:40	26:51:25	22:19:48
knot08	2700	122:51:36	62:10:48	42:11:48	34:15:05	31:29:38	26:20:27
knot10	2600	115:47:47	52:34:57	36:45:08	27:36:03	23:27:30	19:33:34
knot11	2200	87:14:54	46:49:46	10:38:42	24:32:21	19:27:05	16:42:27
athena3	2700	104:35:51	56:23:20	37:19:09	27:11:53	23:42:00	20:12:17
athena2	2100	65:13:28	33:27:54	24:09:19	17:06:51	14:07:46	12:39:36
knot14	2100	83:13:48	37:41:42	37:08:53	26:42:57	22:29:44	17:42:52
knot15	2100	52:24:24	29:11:31	20:07:23	14:50:38	12:16:03	09:29:28
knot16	2100	56:42:56	29:29:54	24:36:32	20:29:35	17:52:11	17:07:22
knot17	2100	69:30:29	31:06:02	24:48:12	21:45:46	19:29:25	16:10:47
knot18	2100	92:17:35	49:52:09	34:15:19	26:23:49	20:55:41	17:59:25
knot19	2100	75:19:05	44:09:02	34:44:27	28:17:23	22:39:17	18:55:44
knot20	2100	67:19:14	38:36:01	31:22:48	26:02:59	21:03:23	16:37:21
knot21	2100	64:03:42	32:14:10	23:18:45	20:10:55	18:43:55	17:07:27
knot22	2100	86:28:38	49:30:10	33:37:08	25:25:29	20:17:49	16:01:15
knot23	2100	63:20:18	50:32:17	33:36:58	25:24:19	21:35:29	16:58:37
knot24	2100	50:41:02	27:54:03	19:10:53	14:48:27	10:59:45	09:59:11
knot25	2700	147:13:30	76:50:11	50:02:03	39:45:54	32:17:29	27:46:18
athena1	2700	75:02:42	39:11:11	27:23:12	20:47:35	17:07:30	14:20:14

4 Comparison with the original Wikilinks

For this comparison a raw file [2] from the Wikilinks set was processed and compared to data extracted with the Wikilinks project [1].

Due to the computing complexity only a part of the files were processed.

The original full-content file [2] had to be split in parts and those were consequently merged into WARC file, that was then analyzed using Wikilinks. The program can't process files that large (13gb) at a decent memory complexity. Only one WARC file is processed at a time and they usually contain a single page (KB).

Comparison could only be made between individual wikipedia page adresses and text used in their links.

The difference between the original Wikilinks and our Wikilinks is that extracted links from the original Wikilinks contain links to other Wikipedia namespaces than 0. This means that they contain links to images, files and so on. Some differences happened during the normalization of redirection, other differences came from varying records of white spaces and so on.

 [1] http://iesl.cs.umass.edu/downloads/wiki-link/context-only/001.gz
 [2] http://iesl.cs.umass.edu/downloads/wiki-link/full-content/part1/001.gz

4.1 Results

 Number of links found through Wikilinks (original): 123810
 Number of links found through Wikilinks: 187985
 Number of overlapping links: 102185
 Number of unique links: 110547

Number of unique links represents the number of links that differ from the ones found using the original Wikilinks. Number of overlapping links represents the number of identical links.

Results were generated like this:

 diff clueweb12.out wikilinks.out | wc -l
 comm -1 -2 out1.out.sorted 001.out.sorted | wc -l

They can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/porovnani/

5 Histogram of links with identical contexts

The goal was to develop a script to analyze identical contexts of links and create a histogram of results from Wikilinks tool on CommonCrawl data.

5.1 Launch and execution

Script duplicateContexts.py loads data from specified directory and searches for duplicate link contexts among the files. Duplicate link contexts are merged and counted, consequently they are ordered by links and saved to output file in RESULTS directory with additional data.

Script can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/duplicateContexts.py

Launch:

 ./duplicateContexts.py [-f FOLDER] [-n NUMBER] [-W]

-f FOLDER - location of a folder to be processed, if not set, files in the location of the script are processed instead
-n NUMBER - number of the most commonly occurring duplicate link contexts to be saved to files
-W - only processes records that are not from Wikipedia pages in foreign languages, their histogram is split based on their source to 3 files: english, other and ones that contain at least one of both english and other languages; if parameter -n is set, the script prints the most commonly occurring duplicate link contexts to files duplicate.wdocs.N.result, where no foreign language Wikipedia sources are located and to files duplicate.wdocs.BOTH.N.result, that contain at least one record from english Wikipedia and other page

Directory RESULT is created for output of the script and contains the following files:

script.result

Contains link contexts histogram:

 01 Normalized address
 02 Context before link (no entities)
 03 Context after link
 04 Number of duplicate link contexts

Individual columns in the file are separated by tabs.

script.result.log

Contains addition information, such as total number of lines in all files, number of unique link contexts, number of duplicate link contexts and duration of processing the smaller parts.

duplicate.N.result

These files contain the most commonly occurring duplicate link contexts.

5.2 Output

Data tested by the script:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49

Output is saved here:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/RESULTS

5.3 Listing of domains from results

Script domains.py loads data from specified folder, that contains the results of duplicateContexts.py script and lists domains from duplicateN.result files.

Script can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/domains.py

Launch:

 ./domains.py [-f FOLDER] [-H]

-f - location of a folder to be processed, if not set, files in the location of the script are processed instead
-H - H prints histogram of domains from all files

5.3.1 Output files

domains.result

Contains names of files and domains including their count under each name.

domains.histogram.result

Contains histogram of domains and their count from all processed files.

moreDomainFiles.result

Contains files with more than 1 domains and their count.

6 Visualizing processing time

6.1 Time comparison

The goal was to compare how long it takes to process CC-2014-49 and CC-2014-52 on individual machines based on the *.result.log files in the following folders, as well as duration of each process. Finally, to create graph using the gnuplot tool.

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49
 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52

6.1.1 Scripts

Script processingTime.py can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/charts/processingTime.py

Launch:

 ./processingTime.py DIR OUTPUTFILENAME

DIR - directory containing processed *.result.log files
OUTPUTFILENAME - name of the file

File format:

 [number] [machine_name.thread] [time in hh:mm:ss] [time in seconds]

6.1.2 Outputs

Processing times of CC-2014-49 and CC-2014-52 can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/timeComparision-2014-49.dat
 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52/timeComparision-2014-52.dat

gnuplot script timeChart.gpl creates graphs based on the two files above:

 /mnt/minerva1/nlp/projects/wikilinks/charts/timeChart.gpl

The output files timeComparision.CC-2014-49.png and timeComparision.CC-2014-52.png compare the total processing time on individual machine threads in seconds.

 /mnt/minerva1/nlp/projects/wikilinks/charts/timeComparision.CC-2014-49.png
 /mnt/minerva1/nlp/projects/wikilinks/charts/timeComparision.CC-2014-52.png

6.2 Processed data size comparison

The goal was to create a graph displaying the size of data processed by individual processes based on files located here:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/list
 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52/list

6.2.1 Scripts

Script dataSize.py searches for files that match the regular expression, calculates the sum of available files and saves it to the output file.

Script location:

 /mnt/minerva1/nlp/projects/wikilinks/charts/dataSize.py

Launch:

 ./dataSize.py DIR REGEX OUTPUTFILE

DIR - directory containing input files
REGEX - regular expression to find the desired files
OUTPUTFILE - output file

Script associateFilesData.py merges data from files created by the first script into a single output file, files containing partial data are then deleted.

Script location:

 /mnt/minerva1/nlp/projects/wikilinks/charts/associateFilesData.py

Launch:

 ./associateFilesData.py DIR OUTPUTFILE

DIR - directory containing files with partial data
OUTPUTFILE - output file for merged data

File format:

 [number] [machine_name.thread] [size in GB]

6.2.2 Outputs

Output files are saved here:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/list/dataSizeComparision.CC-2014-49
 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52/list/dataSizeComparision.CC-2014-52

Graphs are saved here:

 /mnt/minerva1/nlp/projects/wikilinks/charts/dataSizeComparision.CC-2014-49.png
 /mnt/minerva1/nlp/projects/wikilinks/charts/dataSizeComparision.CC-2014-52.png

7 Wikilinks records deduplication

Scripts and programs for deduplication of Wikilinks format are described here.

7.1 Launch

Deduplication is executed by two programs, dedup and server.

Source files (wikilinks branch of git repository):

 /mnt/minerva1/nlp/repositories/corpora_processing_sw/

Directory:

 /mnt/minerva1/nlp/repositories/corpora_processing_sw/processing_steps/3/dedup/

Use Makefile.

Program dedup is "worker" and executes deduplication. It needs a running program server which is "hashholder". To run deduplication, it is necessary to have at least one server and one worker running. Deduplication principle in short: workers calculate hashes from records and request responses from servers. If servers do not recognize a hash, it is considered a unique record. Multiple workers and servers can run simultaneously. Usually one worker on each machine, a few specific machines run server as well.

7.1.1 Scripts

Since running N workers and M servers on different machines is inefficient, scripts deduplicate.py and server.py were created to manage this.

They are located here:

 /mnt/minerva1/nlp/repositories/corpora_processing_sw/processing_steps/3/

Scripts require that programs dedup and server are available on all machines in the same location (parameter -b). Specify files containing list of workers (parameter -w) and list of servers (parameter -s). Both files have the same format: one HOSTNAME per line. Both scripts use parameter -p to set the port.

Script server.py has optional parameters -i and -o specifying input and output files to load and save hashes. Servers run in the background, use parameter start to launch them, stop to terminate them or restart to restart them. If neither of these parameters are specified, the script only checks status of the servers and screens where they run.

Script deduplicate.py uses parameters -i and -o to specify input and output folders for deduplication. Use parameter -wl to process Wikilinks. Another optional paramater is -n that enables so called near deduplication.

7.1.2 Context truncation

Columns "context before link" and "context after link" need to be identical in all deduplicated files.

There is a script responsible for that:

 /mnt/minerva1/nlp/projects/wikilinks/wikilinksConvertContext.py

Truncates contexts to 10 words.

Example use:

 wikilinksConvertContext.py < input_file >  output_file

7.1.3 Launch example

 python server.py -s servers.txt -w workers.txt -p 11111 start
 # launch servers on machines specified in servers.txt on port 11111
 python server.py -s servers.txt -w workers.txt -p 11111
 # check if servers are running
 python deduplicate.py -s servers.txt -w workers.txt -p 11111 -wl -i /mnt/data/wikilinks -o /mnt/data/wikilinks_dedup
 # launch deduplication of files in /mnt/data/wikilinks and save output to /mnt/data/wikilinks_dedup
 # running in parallel on machines specified in workers.txt (add -n to run nearDedup)

7.2 Hash calculation

Deduplication of format wikilinks is implemented so that, hash is calculated for concatenation of columns 2, 3, 5, 6, all the columns have to be the same to evaluate row as duplicate. Neardedup calculates hashes with the help of N-grams on concatenation of columns 5, 3, 6 (in this order), then it calculates hash of 2. column. The row is a duplicate if both last entries found conjunction.

7.3 CommonCrawl deduplication

Table contaning numbers of Wikilinks records from processed CCs:

	CC-2014-42	CC-2014-49	CC-2014-52	CC-2015-06	CC-2015-11	CC-2015-14	CC-2015-18	CC-2015-22	CC-2015-27	CC-2015-32	CC-2015-35	CC-2015-40	CC-2015-48	CC-2016-07	CC-2016-18	CC-2016-22	CC-2016-26	CC-2016-30	Sum
before deduplication	92,814,880	45,171,854	40,843,668	36,321,605	38,847,017	35,095,339	42,807,744	41,439,440	34,624,096	38,110,741	40,552,152	28,382,908	40,465,043	38,538,458	31,179,951	30,295,311	25,488,631	42,734,222	723,713,060
after deduplication	15,454,213	13,160,368	15,007,065	13,497,642	14,595,606	13,713,196	15,482,543	15,294,670	13,497,513	14,283,612	14,417,289	11,401,264	14,225,466	14,258,449	12,869,089	13,065,771	10,712,890	13,986,207	248,922,853
after nearDedup	11,675,540	10,235,720	11,497,064	10,364,118	11,368,100	10,739,974	11,928,285	11,856,781	10,605,308	11,125,573	11,296,278	9,061,115	11,252,212	11,159,926	10,046,860	10,266,312	8,333,024	10,897,033	193,709,223
after incremental deduplication	15,453,792	1,382,063	1,726,778	1,227,687	1,453,991	1,062,187	1,260,128	1,244,134	955,326	924,314	836,878	714,636	945,781	1,130,096	1,087,456	936,315	722,957	985,277	34,049,796
after incremental nearDedup	10,429,073	463,743	542,288	262,169	475,620	209,076	256,643	260,088	157,469	137,950	154,390	132,593	217,288	246,024	215,418	310,357	102,328	197,985	14,770,502

Incremental deduplication means that CC were processed from left to right, hashes preserved. Incremental deduplication was executed in this order: Wikipedia -> CCs -> Cluewebs

Table containing sizes of individual processed CCs

	CC-2014-42	CC-2014-49	CC-2014-52	CC-2015-06	CC-2015-11	CC-2015-14	CC-2015-18	CC-2015-22	CC-2015-27	CC-2015-32	CC-2015-35	CC-2015-40	CC-2015-48	CC-2016-07	CC-2016-18	CC-2016-22	CC-2016-26	CC-2016-30	Sum
before deduplicaton	70 GB	36 GB	35 GB	31 GB	33 GB	30 GB	37 GB	36 GB	30 GB	33 GB	34 GB	25 GB	34 GB	33 GB	26 GB	27 GB	21 GB	34 GB	605 GB
after deduplication	6.2 GB	5.3 GB	6 GB	5.4 GB	5.9 GB	5.5 GB	6.2 GB	6.1 GB	5.4 GB	5.7 GB	5.8 GB	4.6 GB	5.7 GB	5.7 GB	5.1 GB	5.2 GB	4.3 GB	5.6 GB	99.7 GB
after nearDedup	4.7 GB	4.1 GB	4.6 GB	4.2 GB	4.6 GB	4.3 GB	4.8 GB	4.8 GB	4.2 GB	4.4 GB	4.5 GB	3.6 GB	4.5 GB	4.5 GB	4 GB	4.1 GB	3.3 GB	4.4 GB	77.6 GB
after incremental deduplication	6.2 GB	590 MB	735 MB	524 MB	622 MB	453 MB	544 MB	533 MB	409 MB	397 MB	357 MB	305 MB	408 MB	483 MB	462 MB	399 MB	304 MB	421 MB	14 GB
after incremental nearDedup	4.2 GB	196 MB	232 MB	110 MB	204 MB	89 MB	110 MB	106 MB	65 MB	57 MB	66 MB	58 MB	94 MB	106 MB	94 MB	137 MB	43 MB	87 MB	6 GB

Results of each CC are saved on these machines:

 athena1        knot01        knot14        knot26
 athena2        knot02        knot15        knot27
 athena3        knot03        knot16        knot28
 athena4        knot04        knot17        knot29
 athena5        knot05        knot18        knot30
 athena6        knot06        knot19        knot31
 athena7        knot07        knot20        knot32
 athena8        knot08        knot21        knot33
 athena9        knot10        knot22        knot34
 athena10       knot11        knot23        knot35
 athena11       knot12        knot24        minerva2
 athena12       knot13        knot25        minerva3

Results are located on each machine here:

 /mnt/data/commoncrawl/${CC}/wikilinks.dedup/
 /mnt/data/commoncrawl/${CC}/wikilinks.near_dedup/ 
 /mnt/data/commoncrawl/${CC}/wikilinks.dedup_inc/
 /mnt/data/commoncrawl/${CC}/wikilinks.near_dedup_inc/

7.3.1 `wikilinksDedup.sh`

Deduplication script for Wikilinks from CommonCrawl can be found here:

 /mnt/minerva1/nlp/repositories/corpora_processing_sw/processing_steps/3/wikilinksDedup.sh

Suitable for incremental deduplicaiton. Example:

 wikilinksDedup.sh  CC-2015-14 CC-2015-32
 # first process CC-2015-14 and then CC-2015-32, preserve hashes
 wikilinksDedup.sh  CC-2015-14
 # process CC-2015-14

More specific paths can be changed within the script, as well as paths to files specifying machines to run deduplication on. (add parameter -n to run nearDedup)

7.4 ClueWeb deduplication

Table containing number of Wikilins records from processed sets:

	ClueWeb09	ClueWeb12	Sum
before deduplicaton	11,744,461	19,803,070	31,547,531
after deduplication	6,907,001	7,186,902	14,093,903
after neardedup	5,135,733	5,631,135	10,766,868
after incremental deduplication	6,268,126	5,173,682	11,441,808
after incremental nearDedup	3,820,834	3,036,299	6,857,133

Incremental deduplication was executed in this order: Wikipedia -> CCs -> Cluewebs.

Table containing sizes of individual processed sets:

	ClueWeb09	ClueWeb12	Sum
before deduplicaton	5.3 GB	8.0 GB	13.3 GB
after deduplication	2.4 GB	2.5 GB	4.9 GB
after nearDedup	1.8 GB	2 GB	3.8 GB
after incremental deduplication	2.2 GB	1.8 GB	4 G
after incremental nearDedup	1.3 GB	1.1 GB	2.4 GB

Results of both Cluewebs are saved on these machines:

 athena1        knot01        knot14        knot26
 athena2        knot02        knot15        knot27
 athena3        knot03        knot16        knot28
 athena4        knot04        knot17        knot29
 athena5        knot05        knot18        knot30
 athena6        knot06        knot19        knot31
 athena7        knot07        knot20        knot32
 athena8        knot08        knot21        knot33
 athena9        knot10        knot22        knot34
 athena10       knot11        knot23        
 athena11       knot12        knot24        
 athena12       knot13        knot25

and they are located in these folders:

 /mnt/data/clueweb/${09|12}/wikilinks.dedup/
 /mnt/data/clueweb/${09|12}/wikilinks.near_dedup/ 
 /mnt/data/clueweb/${09|12}/wikilinks.dedup_inc/
 /mnt/data/clueweb/${09|12}/wikilinks.near_dedup_inc/

7.5 Wikipedia deduplication

	Wikipedia	Size of individual versions
before deduplication	125,192,634	36 GB
after deduplication	119,021,063	34,4 GB
after nearDedup	76,920,163	22,2 GB
after incremental deduplication	119,021,063	34.4 GB
after incremental nearDedup	76,919,387	22.2 GB

Results are saved on the same machins as CommonCrawl in folder:

 /mnt/data/wikipedia/wikipedia.wikilinks/

Wikipedia serves as input for incremental deduplication of CCs and Cluewebs.

7.6 Thrift deduplication

	thrift(hashes from CC)	thrift(hashes from Wikipedia)	thrift(hashes from Clueweb)	thrift(hashes from CC+Wikipedia)	thrift(hashes from CC+Clueweb)	thrift(hashes from CC+Wikipedia+Clueweb)
before deduplication	35,359,606	35,359,606	35,359,606	35,359,606	35,359,606	35,359,606
after deduplication	20,927,543	20,927,543	20,927,543	20,927,543	20,927,543	20,927,543
after nearDedup	18,622,532	18,622,532	18,622,532	18,622,532	18,622,532	18,622,532
after incremental deduplication	20,927,499	20,926,875	20,927,493	20,926,831	20,927,450	20,926,782
after incremental nearDedup	15,750,839	16,815,066	18,618,274	14,432,668	14,560,364	13,348,602

8 Datasets processing

8.1 CC and ClueWeb processing

CC and ClueWeb sets were processed. All the datasets were processed by a script for automatic Wikilinks tool launch runWikilinks.sh. The script runs Wikilinks tool with the following parameters:

 wikilinks.py -i <input_file> -o <output_file> -v -w -e -r <redirection_file> -d <disambiguation_file> -a <all_url_file> -W -I -P <pageTable_file> -k

-i - specifies file containing list of WARC files to be analysed; it should contain one record per line; the record should be in the format <[path]warc_file_name>\t<number> or <[path]warc_file_name>
-o - specifies name of output file; log file will be called <output_file_name>.log
-v - verbose mode
-w - searches for links to Wikipedia
-e - only links to English Wikipedia will be recognized
-r - a tsv file containing redirections from English Wikipedia (in this case: /mnt/minerva1/nlp/projects/wikilinks/program/wikipedia-redirects.tsv)
-d - a tsv file containing disambiguation pages from English Wikipedia (in this case: /mnt/minerva1/nlp/projects/wikilinks/program/wikipedia-disambiguations.tsv)
-a - a tsv file containing all URL from English Wikipedia (in this case: /mnt/minerva1/nlp/projects/wikilinks/program/wikipedia-all.tsv)
-W - filtering out documents from Wikipedia
-I - filtering out URLs with fragment identifier
-P - a tsv file containing Wikipedia page table (CUIDs and its titles) (in this case: /mnt/minerva1/nlp/projects/wikilinks/xbolje00/wikipedia-pageTable.tsv)
-k - filtering out documents from unwanted domains {wikipedia|wikidata|wiktionary|wikimedia|wikisource|edwardbetts|instantchess|viaf|werelate|wmflabs|paheal|fallingrain|isni-url.oclc|blekko}

Processed datasets:

 CC-2014-42         CC-2015-32         ClueWeb09
 CC-2014-49         CC-2015-35         ClueWeb12
 CC-2014-52         CC-2015-40
 CC-2015-06         CC-2015-48
 CC-2015-11         CC-2016-07
 CC-2015-14         CC-2016-18
 CC-2015-18         CC-2016-22
 CC-2015-22         CC-2016-26
 CC-2015-27         CC-2016-30

Results can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/results.{CC|clueweb}

Or on each machine in:

 /mnt/data/commoncrawl/{CC}/wikilinks
 /mnt/data/clueweb/{09|12}/wikilinks

9 Wikilinks for distribution

Wikilinks version sutable for distribution contains incrementally processed CommonCrawl, ClueWeb and Wikipedia datasets (known together as Total Dataset. A subset containing only ambiguous names of people and locations of wikipedia (Disambiguation Subset).

9.1 Total dataset

Total dataset contains incremental version of deduplication, executed in this order: Wikipedia -> CommonCrawl -> Clueweb. It is represented via multiple files with *.result suffix. A total of 22 datasets. A file containing only unique links - Total_dataset.referred_entities.

9.2 Disambiguation dataset

This chapter deals with extracting subpart of ambiguous names of people and location from processed CC, Clueweb and Wikipedia dataset.

File containing ambiguous names of people can be found here:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/wikipedia_statistics/person_statistics

File containing ambiguous names of locations can be found here:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/wikipedia_statistics/location_statistics/data/

Both files were merged into a single file:

 /mnt/minerva1/nlp/projects/wikilinks/disambiguation_subset_extraction/all_ambiguous_names

Exctaction is executed by scripts that were taken and changed accordingly from project Opal Disambiguation. Script all_servers_extract_context.sh runs script extract_contexts.sh that executes extraction itself on all machines. Script all_servers_extract_context.sh requires that a project_dir path is set manually (directory containing these scripts), output_file_destination (directory, where results of extraction are saved) and ambiguous_names_file (file containing ambiguous names).

Changed versions of scripts used for extraction can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/disambiguation_subset_extraction/all_servers_extract_contexts.sh
 /mnt/minerva1/nlp/projects/wikilinks/disambiguation_subset_extraction/extract_contexts.sh

Disambiguation subset is split in 3 files:

 CommonCrawl.result
 Clueweb.result
 Wikipedia.result

Again, these were used to create a single file contaning only unique links:

 Disambiguation_subset.referred_entities

9.3 Results

	Total Dataset	Disambiguation Subset
Number of mentions	164,512,667	10,557,592
Referred entities	4,417,107	286,273
Size in GB	52	3.1

Total Dataset represents a total dataset of wikilinks (cc+cw+wikipedia). Disambiguation Subset is extracted subpart containing ambiguous names of people and locations from Total Dataset. Number of mentions represents a total number of wiki links in individual datasets. Referred entities represents number of unique links within a dataset.

Total Dataset [1] and Disambiguation Subset [2] can be found here:

 [1] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Total_dataset/*.result
 [2] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Disambiguation_subset/*.result

Total Dataset and Disambiguation Subset were archived:

 [1] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Total_dataset.tar.gz
 [2] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Disambiguation_subset.tar.gz

Both archives contain README.txt file with a description of the archive and it's files.

10 Description of input TSV files

TSV fils required for processing of datasets using the Wikilinks tool are located here:

 /mnt/minerva1/nlp/projects/wikilinks/program

Script tsv_update.py to keep them up to date can be found here:

 /mnt/minerva1/nlp/projects/wikilinks

To increase the efficiency of Wikilinks tool, full URL addresses were excluded from the tsv files and only their titles remain (e.g. prefix http://en.wikipedia.org/wiki/ is removed and "_" replaced with " "). wikilinks.py creates URL adresses using the names for further processing.

10.1 wikipedia-all.tsv

Contains all URL address names of english Wikipedia, one name per row. Both names and redictions are included. 2 files from Decipher wikipedia project are used for updating.

 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/wikipedia-regular-pages.tsv
 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/wikipedia-redirects.tsv

These two files combined give us the final:

 wikipedia-all.tsv

Format example:

 Napoleon
 Napoleon's Barber
 Napoleon's Campaigns in Miniature

10.2 wikipedia-disambiguations.tsv

Contains only names of disambiguating pages, one per row. A file from Wikipedia disambiguation pages project is used for updating.

 /mnt/minerva1/nlp/projects/wikipedia_disambiguation_pages/output/wp_disambiguation_titles.sorted

This gives us the final:

 wikipedia-disambiguations.tsv

Format example:

 Napoleon at St. Helena
 Napoleon (disambiguation)
 Napoleone Orsini (disambiguation)

10.3 wikipedia-redirects.tsv

Contains all redirecting URL address names. Final address and a group of all names redirecting to that address separated by "|" per row.

Format:

 final_adress_name\tname_of_redirecting_adress1|name_of_redirecting_adress2|...

A file from Decipher wikipedia project is used for updating:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/wikipedia-redir-table.tsv

This gives us the final:

 wikipedia-redirects.tsv

Format example:

 Napoleon            Napolean Bonapart|Napolean Bonaparte
 Napoleon Dynamite   Napolean Dynamite

11 Comparison of results from VERT and WARC files processing

Comparison between datasets CC-2015-32 and CC-2015-40. Both datasets were first processed from input WARC format, then from input VERT format. In both cases, Wikilinks tool was launched with the same parameters. Surprisingly enough, VERT format produced more wiki links, see table:

Dataset	CC-2015-32	CC-2015-32	CC-2015-40	CC-2015-40
File type	WARC	VERT	WARC	VERT
Total number of found wiki links	60,488,837	87,742,422	34,500,684	66,292,376
Total processing time	10 hr, 32 min	4 hr	14 hr, 49 min	1 hr, 50 min

12 Page table processing

Executed by script:

 /mnt/minerva1/nlp/projects/wikilinks/xbolje00/pageTable.py

The script filters information from SQL script enriched with commands to create and fill a table. More can be found here.

Table format:

 +--------------------+---------------------+------+-----+----------------+----------------+
 | Field              | Type                | Null | Key | Default        | Extra          |
 +--------------------+---------------------+------+-----+----------------+----------------+
 | page_id            | int(10) unsigned    | NO   | PRI | NULL           | auto_increment |
 | page_namespace     | int(11)             | NO   | MUL | NULL           |                |
 | page_title         | varbinary(255)      | NO   |     | NULL           |                |
 | page_restrictions  | tinyblob            | NO   |     | NULL           |                |
 | page_counter       | bigint(20) unsigned | NO   |     | 0              |                |
 | page_is_redirect   | tinyint(3) unsigned | NO   | MUL | 0              |                |
 | page_is_new        | tinyint(3) unsigned | NO   |     | 0              |                |
 | page_random        | double unsigned     | NO   | MUL | NULL           |                |
 | page_touched       | binary(14)          | NO   |     |                |                |
 | page_latest        | int(10) unsigned    | NO   |     | NULL           |                |
 | page_len           | int(10) unsigned    | NO   | MUL | NULL           |                |
 | page_content_model | varbinary(32)       | YES  |     | NULL           |                |
 | page_links_updated | varbinary(14)       | YES  |     | NULL           |                |
 +--------------------+---------------------+------+-----+----------------+----------------+

Rows page_id and page_title are important for further processing. Their values with the INSERT command are stored during processing and consquently saved to the output file.

Launch:

 ./pageTable.py -i {input_file} -o {output_file}

Input SQL script:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/enwiki-latest-page.sql

Output file path:

 /mnt/minerva1/nlp/projects/wikilinks/xbolje00/wikipedia-pageTable.tsv

Output file format:

 page_id\tpage_title
 page_id\tpage_title
 ...

13 Decoding Wikilinks dataset from apache thrift

Program gsoc-wikilinks, is used to decode Wikilinks dataset from apache thrift format to the required format. Program is launched with 2 parameters, specifying input and output directories. Files in input directory are read, decoded and saved to output directory. If an error reading a file occurrs: can't be read, not in apache drift format etc, an exception is thrown and the process continues with the next file. Output has to contain address offset and text. Text offset is always given, but address offset has to be calculated. That is only possible if the original html document is available.

The program can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/gsoc-wikilinks

13.1 Build

Run the following command in the program directory:

 mvn package

When the build successfully ends, the following jar archives are created in this directory:

 gsoc-wikilinks-1.0-SNAPSHOT.jar
 gsoc-wikilinks-1.0-SNAPSHOT-jar-with-dependencies.jar

13.2 Launch

Launch using:

 java -Xmx1g -jar gsoc-wikilinks-1.0-SNAPSHOT-jar-with-dependencies.jar -i "input_directory" -o "output_directory"

input_directory - directory containing files to be processed
output_directory - directory for output files

13.3 Decoding

Used dataset can be found here:

 /mnt/data-2/nlp/wiki-links/Dataset (with complete webpages)

Output can be found here:

 /mnt/data/nlp/projects/wikilinks/result_from_thrift

13.3.1 Statistics

Number of files	109
Total time	14:33:31
Number of documents	10,629,486
Number of links	36,166,829
Calculated offsets	32,638,607
Missing offsets	3,528,222
Calculated offsety in %	89%

Wikilinks

Table of Contents

1 wikilinks.py description

1.1 Vert format processing extension

1.2 Replacing CUID in URL extension

1.3 wikilinks.py parameters

1.4 Source code

1.5 Output format description

2 Wikilinks tool launch automation

2.1 Example launch

3 CommonCrawl 2014 processing speed

4 Comparison with the original Wikilinks

4.1 Results

5 Histogram of links with identical contexts

5.1 Launch and execution

5.2 Output

5.3 Listing of domains from results

5.3.1 Output files

6 Visualizing processing time

6.1 Time comparison

6.1.1 Scripts

6.1.2 Outputs

6.2 Processed data size comparison

6.2.1 Scripts

6.2.2 Outputs

7 Wikilinks records deduplication

7.1 Launch

7.1.1 Scripts

7.1.2 Context truncation

7.1.3 Launch example

7.2 Hash calculation

7.3 CommonCrawl deduplication

7.3.1 wikilinksDedup.sh

7.4 ClueWeb deduplication

7.5 Wikipedia deduplication

7.6 Thrift deduplication

8 Datasets processing

8.1 CC and ClueWeb processing

9 Wikilinks for distribution

9.1 Total dataset

9.2 Disambiguation dataset

9.3 Results

10 Description of input TSV files

10.1 wikipedia-all.tsv

10.2 wikipedia-disambiguations.tsv

10.3 wikipedia-redirects.tsv

11 Comparison of results from VERT and WARC files processing

12 Page table processing

13 Decoding Wikilinks dataset from apache thrift

13.1 Build

13.2 Launch

13.3 Decoding

13.3.1 Statistics

1 `wikilinks.py` description

1.3 `wikilinks.py` parameters

7.3.1 `wikilinksDedup.sh`