Wikilinks

The goal of this project was to develop a program to extract links for ClueWeb and CommonCrawl data sets from Wikipedia, similar to those in Wikilinks data set. Another goal was to evaluate the speed of the developed tool. The entire functionality is packed into the wikilinks.py script. Detailed description can be found below.

Table of Contents

1 wikilinks.py description

Considering the fact that we have to determine the type of date, respectively charset for every single WARC record, records that are not application/http;msgtype=response or application/https; msgtype=response are excluded from the analysis. Each record that meets the requirements should contain HTTP header and the specified type of content inside. The only analyzed type of content is text/html as it is not expected that image/jpeg content contains links.

The charset of a document can be sometimes determined from the HTTP header. Next, the script searches for a specific charset in the HTTP record body (more specifically HTML metatags). Charset from HTML has a higher priority than that from HTTP header. If the charset could not be determined, default HTTP ISO-8859-1 charset is used. If a character can't be converted to UTF-8, it is excluded.

Consequently, the script looks for links. The search algorithm uses regular expressions. Acceptable links are HTML links (<a>). If a link to Wikipedia or Freebase is found, the script determines:

These are consequently saved to the output file together with the charset id and name of the source file.

If any of the information can't be found, program saves an empty string in their place instead, this can be changed in constants.py.

The program only saves Wikipedia links from namespace. This filter is executed based on the occurence of : in name of the Wikipedia page. If the name of the page contains :[^_], it is considered a page from different Wikipedia namespace. This function is implemented like this for the sake of simplicity. Wikipedia namespace names change with the language of Wikipedia.

The program can search for Freebase or Wikipedia links, use parameters -f or -w.

Links can also be filtered to search only for a specific language version of Wiki, -e for english and -l for other versions. Program determines the language of a given Wikipedia page using the 3rd (resp 4th) order domain.

If a file containing data about redirecting to Wikipedia exists, the program can save the URL of redirected page, rather than the URL in WARC document, using the -r parameter.

The program saves logs to a file named <output_file>.log.

Number of saved words before and after the link can be changed in constants.py.

1.1 Vert format processing extension

Script wikilinks.py can now process VERT files, along with WARC files. The launching of the script is the same, it can determine the file type on it's own and process it accordingly.

VERT file contains verticalized data, where tokens (words) are saved under eachother as a list, seperated by \n. A record in this file starts and ends with tags <doc where id, url and title of a document is located> and </doc>. Individual links in the document start with <link="wikipedia/freebase/other url"/>, element <length="number"/> can be found on the same line and it represents number of tokens (words) that the link text is comprised of. The tokens are located before tag <link>. Regular expressions are used to process the format.

1.2 Replacing CUID in URL extension

Used for Wikipedia addresses and their normalization. Address that originally contained CUID of an article e.g. http://en.wikipedia.org/wiki/index.html?curid=75960 has the .../wiki/ part replaced with name of the article e.g. https://en.wikipedia.org/wiki/KPMG. To activate this function, launch the program with parameter -p followed by .tsv file where each line contains CUID and title (e.g. "75960\tKPMG"). This file is loadad to memory upon launch and suffix with CUID in url is replaced with corresponding article title.

Path to file:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/enwiki-latest-page.sql
        

1.3 wikilinks.py parameters

Launch example:

 wikilinks.py -i <input_file> -o <output_file> [-p <path>] [-v] [-w] [-f] [-e] [-l <language>[:<language>]] [-r <redirection_file>] [-s] [-d <disambiguation_file>] [-W] [-h] [-a <all_url_file>] [-k]
        

Parameter -p can be used to analyze Clueweb12 when a file containing list of WARC files with their relative paths is available.

Example launch for a single archiv can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/test/test.sh
        

1.4 Source code

Wikilinks project source files can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/program/
        

What does each file do:

 wikilinks.py - the program extractig Wikipedia and Freebase links
 warcm.py - WARC file processing module
 vert.py - VERT file processing module
 httpm.py - HTTP header processing module
 constants.py - constants

1.5 Output format description

Output files are in TSV (Tab-Separated Values) format. The meaning of each column is as follows:

 ------+-------------------------------------------------
 Number| Value
 ======+=================================================
  1    | original address      
 ------+-------------------------------------------------
  2    | normalized address   
 ------+-------------------------------------------------
  3    | link text (no entities)
 ------+-------------------------------------------------
  4    | original link text  
 ------+-------------------------------------------------
  5    | context before link (no entities)
 ------+-------------------------------------------------
  6    | context after link (no entities)
 ------+-------------------------------------------------
  7    | encoding
 ------+-------------------------------------------------
  8    | address offset in WARC/VERT file
 ------+-------------------------------------------------
  9    | text offset in WARC/VERT file
 ------+-------------------------------------------------
  10   | WARC/VERT file name                         
 ------+-------------------------------------------------
  11   | URL of corresponding document in WARC/VERT file                      
 ------+-------------------------------------------------
        

2 Wikilinks tool launch automation

Automatic launch script for Wikilinks can be found here:

 /mnt/data/nlp/projects/wikilinks/runWikilinks.sh
        

Required launch parameters:

To launch successfully, set login using ssh keys to machines, where files will be processed and save their names to:

 /mnt/data/nlp/projects/wikilinks/servers.txt
        

Script launch order:

list.sh, list_dirs.sh - create list of files to be processed
 split.sh - splits the list for multiple processes
 prepare.sh - prepares directiories for Wikilinks outputs
 run.sh - runs Wikilinks tool on servers specified in servers.txt
 copy.sh - copies output from machines on minerva1.fit.vutbr.cz
        

Logs of the process are saved here:

 /mnt/data/nlp/projects/wikilinks/autorunOutput.log
 /mnt/data/nlp/projects/wikilinks/autorunError.log
        

2.1 Example launch

 ./runWikilinks.sh -l iotrusina -s /mnt/data/commoncrawl/CC-2015-06/ -d /mnt/data/nlp/projects/wikilinks/results.CC-2015-06/ -w
        

Note:
It is advised to run the script through screens as it takes a long time.


3 CommonCrawl 2014 processing speed

Durations listed below represent the estimated time it takes to process the respective CommonCrawl 2014 part on individual machines.

Machine No of files 1 process 2 processes 3 processes 4 processes 5 processes 6 processes
knot01 2500 192:21:07 98:59:31 70:23:59 53:55:42 40:49:35 37:28:19
knot03 2500 112:46:48 61:00:58 39:32:38 31:49:12 26:10:57 25:22:56
knot04 2500 116:28:20 63:05:04 41:55:47 33:26:21 24:45:13 24:34:53
knot05 2600 128:58:11 65:43:50 44:50:05 35:59:16 28:19:13 27:28:51
knot06 2500 115:49:10 58:59:10 42:36:59 30:47:32 24:38:07 21:32:13
knot07 2500 114:32:38 58:57:01 42:39:27 30:55:40 26:51:25 22:19:48
knot08 2700 122:51:36 62:10:48 42:11:48 34:15:05 31:29:38 26:20:27
knot10 2600 115:47:47 52:34:57 36:45:08 27:36:03 23:27:30 19:33:34
knot11 2200 87:14:54 46:49:46 10:38:42 24:32:21 19:27:05 16:42:27
athena3 2700 104:35:51 56:23:20 37:19:09 27:11:53 23:42:00 20:12:17
athena2 2100 65:13:28 33:27:54 24:09:19 17:06:51 14:07:46 12:39:36
knot14 2100 83:13:48 37:41:42 37:08:53 26:42:57 22:29:44 17:42:52
knot15 2100 52:24:24 29:11:31 20:07:23 14:50:38 12:16:03 09:29:28
knot16 2100 56:42:56 29:29:54 24:36:32 20:29:35 17:52:11 17:07:22
knot17 2100 69:30:29 31:06:02 24:48:12 21:45:46 19:29:25 16:10:47
knot18 2100 92:17:35 49:52:09 34:15:19 26:23:49 20:55:41 17:59:25
knot19 2100 75:19:05 44:09:02 34:44:27 28:17:23 22:39:17 18:55:44
knot20 2100 67:19:14 38:36:01 31:22:48 26:02:59 21:03:23 16:37:21
knot21 2100 64:03:42 32:14:10 23:18:45 20:10:55 18:43:55 17:07:27
knot22 2100 86:28:38 49:30:10 33:37:08 25:25:29 20:17:49 16:01:15
knot23 2100 63:20:18 50:32:17 33:36:58 25:24:19 21:35:29 16:58:37
knot24 2100 50:41:02 27:54:03 19:10:53 14:48:27 10:59:45 09:59:11
knot25 2700 147:13:30 76:50:11 50:02:03 39:45:54 32:17:29 27:46:18
athena1 2700 75:02:42 39:11:11 27:23:12 20:47:35 17:07:30 14:20:14

4 Comparison with the original Wikilinks

For this comparison a raw file [2] from the Wikilinks set was processed and compared to data extracted with the Wikilinks project [1].

Due to the computing complexity only a part of the files were processed.

The original full-content file [2] had to be split in parts and those were consequently merged into WARC file, that was then analyzed using Wikilinks. The program can't process files that large (13gb) at a decent memory complexity. Only one WARC file is processed at a time and they usually contain a single page (KB).

Comparison could only be made between individual wikipedia page adresses and text used in their links.

The difference between the original Wikilinks and our Wikilinks is that extracted links from the original Wikilinks contain links to other Wikipedia namespaces than 0. This means that they contain links to images, files and so on. Some differences happened during the normalization of redirection, other differences came from varying records of white spaces and so on.

 [1] http://iesl.cs.umass.edu/downloads/wiki-link/context-only/001.gz
 [2] http://iesl.cs.umass.edu/downloads/wiki-link/full-content/part1/001.gz
        

4.1 Results

 Number of links found through Wikilinks (original): 123810
 Number of links found through Wikilinks: 187985
 Number of overlapping links: 102185
 Number of unique links: 110547
        

Number of unique links represents the number of links that differ from the ones found using the original Wikilinks. Number of overlapping links represents the number of identical links.

Results were generated like this:

 diff clueweb12.out wikilinks.out | wc -l
 comm -1 -2 out1.out.sorted 001.out.sorted | wc -l
        

They can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/porovnani/
        

5 Histogram of links with identical contexts

The goal was to develop a script to analyze identical contexts of links and create a histogram of results from Wikilinks tool on CommonCrawl data.

5.1 Launch and execution

Script duplicateContexts.py loads data from specified directory and searches for duplicate link contexts among the files. Duplicate link contexts are merged and counted, consequently they are ordered by links and saved to output file in RESULTS directory with additional data.

Script can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/duplicateContexts.py
        

Launch:

 ./duplicateContexts.py [-f FOLDER] [-n NUMBER] [-W]
        

Directory RESULT is created for output of the script and contains the following files:

script.result

Contains link contexts histogram:

 01 Normalized address
 02 Context before link (no entities)
 03 Context after link
 04 Number of duplicate link contexts
        

Individual columns in the file are separated by tabs.

script.result.log

Contains addition information, such as total number of lines in all files, number of unique link contexts, number of duplicate link contexts and duration of processing the smaller parts.

duplicate.N.result

These files contain the most commonly occurring duplicate link contexts.

5.2 Output

Data tested by the script:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49
        

Output is saved here:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/RESULTS
        

5.3 Listing of domains from results

Script domains.py loads data from specified folder, that contains the results of duplicateContexts.py script and lists domains from duplicateN.result files.

Script can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/domains.py
        

Launch:

 ./domains.py [-f FOLDER] [-H]

5.3.1 Output files

domains.result

Contains names of files and domains including their count under each name.

domains.histogram.result

Contains histogram of domains and their count from all processed files.

moreDomainFiles.result

Contains files with more than 1 domains and their count.


6 Visualizing processing time

6.1 Time comparison

The goal was to compare how long it takes to process CC-2014-49 and CC-2014-52 on individual machines based on the *.result.log files in the following folders, as well as duration of each process. Finally, to create graph using the gnuplot tool.

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49
 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52
        

6.1.1 Scripts

Script processingTime.py can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/charts/processingTime.py
        

Launch:

 ./processingTime.py DIR OUTPUTFILENAME
        

File format:

 [number] [machine_name.thread] [time in hh:mm:ss] [time in seconds]
        

6.1.2 Outputs

Processing times of CC-2014-49 and CC-2014-52 can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/timeComparision-2014-49.dat
 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52/timeComparision-2014-52.dat
        

gnuplot script timeChart.gpl creates graphs based on the two files above:

 /mnt/minerva1/nlp/projects/wikilinks/charts/timeChart.gpl
        

The output files timeComparision.CC-2014-49.png and timeComparision.CC-2014-52.png compare the total processing time on individual machine threads in seconds.

 /mnt/minerva1/nlp/projects/wikilinks/charts/timeComparision.CC-2014-49.png
 /mnt/minerva1/nlp/projects/wikilinks/charts/timeComparision.CC-2014-52.png
        

6.2 Processed data size comparison

The goal was to create a graph displaying the size of data processed by individual processes based on files located here:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/list
 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52/list
        

6.2.1 Scripts

Script dataSize.py searches for files that match the regular expression, calculates the sum of available files and saves it to the output file.

Script location:

 /mnt/minerva1/nlp/projects/wikilinks/charts/dataSize.py
        

Launch:

 ./dataSize.py DIR REGEX OUTPUTFILE
        

Script associateFilesData.py merges data from files created by the first script into a single output file, files containing partial data are then deleted.

Script location:

 /mnt/minerva1/nlp/projects/wikilinks/charts/associateFilesData.py
        

Launch:

 ./associateFilesData.py DIR OUTPUTFILE
        

File format:

 [number] [machine_name.thread] [size in GB]
        

6.2.2 Outputs

Output files are saved here:

 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/list/dataSizeComparision.CC-2014-49
 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52/list/dataSizeComparision.CC-2014-52
        

Graphs are saved here:

 /mnt/minerva1/nlp/projects/wikilinks/charts/dataSizeComparision.CC-2014-49.png
 /mnt/minerva1/nlp/projects/wikilinks/charts/dataSizeComparision.CC-2014-52.png
        

7 Wikilinks records deduplication

Scripts and programs for deduplication of Wikilinks format are described here.

7.1 Launch

Deduplication is executed by two programs, dedup and server.

Source files (wikilinks branch of git repository):

 /mnt/minerva1/nlp/repositories/corpora_processing_sw/
        

Directory:

 /mnt/minerva1/nlp/repositories/corpora_processing_sw/processing_steps/3/dedup/
        

Use Makefile.

Program dedup is "worker" and executes deduplication. It needs a running program server which is "hashholder". To run deduplication, it is necessary to have at least one server and one worker running. Deduplication principle in short: workers calculate hashes from records and request responses from servers. If servers do not recognize a hash, it is considered a unique record. Multiple workers and servers can run simultaneously. Usually one worker on each machine, a few specific machines run server as well.

7.1.1 Scripts

Since running N workers and M servers on different machines is inefficient, scripts deduplicate.py and server.py were created to manage this.

They are located here:

 /mnt/minerva1/nlp/repositories/corpora_processing_sw/processing_steps/3/
        

Scripts require that programs dedup and server are available on all machines in the same location (parameter -b). Specify files containing list of workers (parameter -w) and list of servers (parameter -s). Both files have the same format: one HOSTNAME per line. Both scripts use parameter -p to set the port.

Script server.py has optional parameters -i and -o specifying input and output files to load and save hashes. Servers run in the background, use parameter start to launch them, stop to terminate them or restart to restart them. If neither of these parameters are specified, the script only checks status of the servers and screens where they run.

Script deduplicate.py uses parameters -i and -o to specify input and output folders for deduplication. Use parameter -wl to process Wikilinks. Another optional paramater is -n that enables so called near deduplication.

7.1.2 Context truncation

Columns "context before link" and "context after link" need to be identical in all deduplicated files.

There is a script responsible for that:

 /mnt/minerva1/nlp/projects/wikilinks/wikilinksConvertContext.py
        

Truncates contexts to 10 words.

Example use:

 wikilinksConvertContext.py < input_file >  output_file
        

7.1.3 Launch example

 python server.py -s servers.txt -w workers.txt -p 11111 start
 # launch servers on machines specified in servers.txt on port 11111
 python server.py -s servers.txt -w workers.txt -p 11111
 # check if servers are running
 python deduplicate.py -s servers.txt -w workers.txt -p 11111 -wl -i /mnt/data/wikilinks -o /mnt/data/wikilinks_dedup
 # launch deduplication of files in /mnt/data/wikilinks and save output to /mnt/data/wikilinks_dedup
 # running in parallel on machines specified in workers.txt (add -n to run nearDedup)
        

7.2 Hash calculation

Deduplication of format wikilinks is implemented so that, hash is calculated for concatenation of columns 2, 3, 5, 6, all the columns have to be the same to evaluate row as duplicate. Neardedup calculates hashes with the help of N-grams on concatenation of columns 5, 3, 6 (in this order), then it calculates hash of 2. column. The row is a duplicate if both last entries found conjunction.

7.3 CommonCrawl deduplication

Table contaning numbers of Wikilinks records from processed CCs:

CC-2014-42 CC-2014-49 CC-2014-52 CC-2015-06 CC-2015-11 CC-2015-14 CC-2015-18 CC-2015-22 CC-2015-27 CC-2015-32 CC-2015-35 CC-2015-40 CC-2015-48 CC-2016-07 CC-2016-18 CC-2016-22 CC-2016-26 CC-2016-30 Sum
before deduplication 92,814,880 45,171,854 40,843,668 36,321,605 38,847,017 35,095,339 42,807,744 41,439,440 34,624,096 38,110,741 40,552,152 28,382,908 40,465,043 38,538,458 31,179,951 30,295,311 25,488,631 42,734,222 723,713,060
after deduplication 15,454,213 13,160,368 15,007,065 13,497,642 14,595,606 13,713,196 15,482,543 15,294,670 13,497,513 14,283,612 14,417,289 11,401,264 14,225,466 14,258,449 12,869,089 13,065,771 10,712,890 13,986,207 248,922,853
after nearDedup 11,675,540 10,235,720 11,497,064 10,364,118 11,368,100 10,739,974 11,928,285 11,856,781 10,605,308 11,125,573 11,296,278 9,061,115 11,252,212 11,159,926 10,046,860 10,266,312 8,333,024 10,897,033 193,709,223
after incremental deduplication 15,453,792 1,382,063 1,726,778 1,227,687 1,453,991 1,062,187 1,260,128 1,244,134 955,326 924,314 836,878 714,636 945,781 1,130,096 1,087,456 936,315 722,957 985,277 34,049,796
after incremental nearDedup 10,429,073 463,743 542,288 262,169 475,620 209,076 256,643 260,088 157,469 137,950 154,390 132,593 217,288 246,024 215,418 310,357 102,328 197,985 14,770,502

Incremental deduplication means that CC were processed from left to right, hashes preserved. Incremental deduplication was executed in this order: Wikipedia -> CCs -> Cluewebs

Table containing sizes of individual processed CCs

CC-2014-42 CC-2014-49 CC-2014-52 CC-2015-06 CC-2015-11 CC-2015-14 CC-2015-18 CC-2015-22 CC-2015-27 CC-2015-32 CC-2015-35 CC-2015-40 CC-2015-48 CC-2016-07 CC-2016-18 CC-2016-22 CC-2016-26 CC-2016-30 Sum
before deduplicaton 70 GB 36 GB 35 GB 31 GB 33 GB 30 GB 37 GB 36 GB 30 GB 33 GB 34 GB 25 GB 34 GB 33 GB 26 GB 27 GB 21 GB 34 GB 605 GB
after deduplication 6.2 GB 5.3 GB 6 GB 5.4 GB 5.9 GB 5.5 GB 6.2 GB 6.1 GB 5.4 GB 5.7 GB 5.8 GB 4.6 GB 5.7 GB 5.7 GB 5.1 GB 5.2 GB 4.3 GB 5.6 GB 99.7 GB
after nearDedup 4.7 GB 4.1 GB 4.6 GB 4.2 GB 4.6 GB 4.3 GB 4.8 GB 4.8 GB 4.2 GB 4.4 GB 4.5 GB 3.6 GB 4.5 GB 4.5 GB 4 GB 4.1 GB 3.3 GB 4.4 GB 77.6 GB
after incremental deduplication 6.2 GB 590 MB 735 MB 524 MB 622 MB 453 MB 544 MB 533 MB 409 MB 397 MB 357 MB 305 MB 408 MB 483 MB 462 MB 399 MB 304 MB 421 MB 14 GB
after incremental nearDedup 4.2 GB 196 MB 232 MB 110 MB 204 MB 89 MB 110 MB 106 MB 65 MB 57 MB 66 MB 58 MB 94 MB 106 MB 94 MB 137 MB 43 MB 87 MB 6 GB

Results of each CC are saved on these machines:

 athena1        knot01        knot14        knot26
 athena2        knot02        knot15        knot27
 athena3        knot03        knot16        knot28
 athena4        knot04        knot17        knot29
 athena5        knot05        knot18        knot30
 athena6        knot06        knot19        knot31
 athena7        knot07        knot20        knot32
 athena8        knot08        knot21        knot33
 athena9        knot10        knot22        knot34
 athena10       knot11        knot23        knot35
 athena11       knot12        knot24        minerva2
 athena12       knot13        knot25        minerva3
        

Results are located on each machine here:

 /mnt/data/commoncrawl/${CC}/wikilinks.dedup/
 /mnt/data/commoncrawl/${CC}/wikilinks.near_dedup/ 
 /mnt/data/commoncrawl/${CC}/wikilinks.dedup_inc/
 /mnt/data/commoncrawl/${CC}/wikilinks.near_dedup_inc/
        

7.3.1 wikilinksDedup.sh

Deduplication script for Wikilinks from CommonCrawl can be found here:

 /mnt/minerva1/nlp/repositories/corpora_processing_sw/processing_steps/3/wikilinksDedup.sh
        

Suitable for incremental deduplicaiton. Example:

 wikilinksDedup.sh  CC-2015-14 CC-2015-32
 # first process CC-2015-14 and then CC-2015-32, preserve hashes
 wikilinksDedup.sh  CC-2015-14
 # process CC-2015-14
        

More specific paths can be changed within the script, as well as paths to files specifying machines to run deduplication on. (add parameter -n to run nearDedup)

7.4 ClueWeb deduplication

Table containing number of Wikilins records from processed sets:

ClueWeb09 ClueWeb12 Sum
before deduplicaton 11,744,461 19,803,070 31,547,531
after deduplication 6,907,001 7,186,902 14,093,903
after neardedup 5,135,733 5,631,135 10,766,868
after incremental deduplication 6,268,126 5,173,682 11,441,808
after incremental nearDedup 3,820,834 3,036,299 6,857,133

Incremental deduplication was executed in this order: Wikipedia -> CCs -> Cluewebs.

Table containing sizes of individual processed sets:

ClueWeb09 ClueWeb12 Sum
before deduplicaton 5.3 GB 8.0 GB 13.3 GB
after deduplication 2.4 GB 2.5 GB 4.9 GB
after nearDedup 1.8 GB 2 GB 3.8 GB
after incremental deduplication 2.2 GB 1.8 GB 4 G
after incremental nearDedup 1.3 GB 1.1 GB 2.4 GB

Results of both Cluewebs are saved on these machines:

 athena1        knot01        knot14        knot26
 athena2        knot02        knot15        knot27
 athena3        knot03        knot16        knot28
 athena4        knot04        knot17        knot29
 athena5        knot05        knot18        knot30
 athena6        knot06        knot19        knot31
 athena7        knot07        knot20        knot32
 athena8        knot08        knot21        knot33
 athena9        knot10        knot22        knot34
 athena10       knot11        knot23        
 athena11       knot12        knot24        
 athena12       knot13        knot25
        

and they are located in these folders:

 /mnt/data/clueweb/${09|12}/wikilinks.dedup/
 /mnt/data/clueweb/${09|12}/wikilinks.near_dedup/ 
 /mnt/data/clueweb/${09|12}/wikilinks.dedup_inc/
 /mnt/data/clueweb/${09|12}/wikilinks.near_dedup_inc/
        

7.5 Wikipedia deduplication

Wikipedia Size of individual versions
before deduplication 125,192,634 36 GB
after deduplication 119,021,063 34,4 GB
after nearDedup 76,920,163 22,2 GB
after incremental deduplication 119,021,063 34.4 GB
after incremental nearDedup 76,919,387 22.2 GB

Results are saved on the same machins as CommonCrawl in folder:

 /mnt/data/wikipedia/wikipedia.wikilinks/
        

Wikipedia serves as input for incremental deduplication of CCs and Cluewebs.

7.6 Thrift deduplication

thrift(hashes from CC) thrift(hashes from Wikipedia) thrift(hashes from Clueweb) thrift(hashes from CC+Wikipedia) thrift(hashes from CC+Clueweb) thrift(hashes from CC+Wikipedia+Clueweb)
before deduplication 35,359,606 35,359,606 35,359,606 35,359,606 35,359,606 35,359,606
after deduplication 20,927,543 20,927,543 20,927,543 20,927,543 20,927,543 20,927,543
after nearDedup 18,622,532 18,622,532 18,622,532 18,622,532 18,622,532 18,622,532
after incremental deduplication 20,927,499 20,926,875 20,927,493 20,926,831 20,927,450 20,926,782
after incremental nearDedup 15,750,839 16,815,066 18,618,274 14,432,668 14,560,364 13,348,602

8 Datasets processing

8.1 CC and ClueWeb processing

CC and ClueWeb sets were processed. All the datasets were processed by a script for automatic Wikilinks tool launch runWikilinks.sh. The script runs Wikilinks tool with the following parameters:

 wikilinks.py -i <input_file> -o <output_file> -v -w -e -r <redirection_file> -d <disambiguation_file> -a <all_url_file> -W -I -P <pageTable_file> -k
        

Processed datasets:

 CC-2014-42         CC-2015-32         ClueWeb09
 CC-2014-49         CC-2015-35         ClueWeb12
 CC-2014-52         CC-2015-40
 CC-2015-06         CC-2015-48
 CC-2015-11         CC-2016-07
 CC-2015-14         CC-2016-18
 CC-2015-18         CC-2016-22
 CC-2015-22         CC-2016-26
 CC-2015-27         CC-2016-30
        

Results can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/results.{CC|clueweb}
        

Or on each machine in:

 /mnt/data/commoncrawl/{CC}/wikilinks
 /mnt/data/clueweb/{09|12}/wikilinks
        

9 Wikilinks for distribution

Wikilinks version sutable for distribution contains incrementally processed CommonCrawl, ClueWeb and Wikipedia datasets (known together as Total Dataset. A subset containing only ambiguous names of people and locations of wikipedia (Disambiguation Subset).

9.1 Total dataset

Total dataset contains incremental version of deduplication, executed in this order: Wikipedia -> CommonCrawl -> Clueweb. It is represented via multiple files with *.result suffix. A total of 22 datasets. A file containing only unique links - Total_dataset.referred_entities.

9.2 Disambiguation dataset

This chapter deals with extracting subpart of ambiguous names of people and location from processed CC, Clueweb and Wikipedia dataset.

File containing ambiguous names of people can be found here:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/wikipedia_statistics/person_statistics
        

File containing ambiguous names of locations can be found here:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/wikipedia_statistics/location_statistics/data/
        

Both files were merged into a single file:

 /mnt/minerva1/nlp/projects/wikilinks/disambiguation_subset_extraction/all_ambiguous_names
        

Exctaction is executed by scripts that were taken and changed accordingly from project Opal Disambiguation. Script all_servers_extract_context.sh runs script extract_contexts.sh that executes extraction itself on all machines. Script all_servers_extract_context.sh requires that a project_dir path is set manually (directory containing these scripts), output_file_destination (directory, where results of extraction are saved) and ambiguous_names_file (file containing ambiguous names).

Changed versions of scripts used for extraction can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/disambiguation_subset_extraction/all_servers_extract_contexts.sh
 /mnt/minerva1/nlp/projects/wikilinks/disambiguation_subset_extraction/extract_contexts.sh
        

Disambiguation subset is split in 3 files:

 CommonCrawl.result
 Clueweb.result
 Wikipedia.result
        

Again, these were used to create a single file contaning only unique links:

 Disambiguation_subset.referred_entities
        

9.3 Results

Total Dataset Disambiguation Subset
Number of mentions 164,512,667 10,557,592
Referred entities 4,417,107 286,273
Size in GB 52 3.1

Total Dataset represents a total dataset of wikilinks (cc+cw+wikipedia). Disambiguation Subset is extracted subpart containing ambiguous names of people and locations from Total Dataset. Number of mentions represents a total number of wiki links in individual datasets. Referred entities represents number of unique links within a dataset.

Total Dataset [1] and Disambiguation Subset [2] can be found here:

 [1] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Total_dataset/*.result
 [2] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Disambiguation_subset/*.result
        

Total Dataset and Disambiguation Subset were archived:

 [1] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Total_dataset.tar.gz
 [2] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Disambiguation_subset.tar.gz
        

Both archives contain README.txt file with a description of the archive and it's files.


10 Description of input TSV files

TSV fils required for processing of datasets using the Wikilinks tool are located here:

 /mnt/minerva1/nlp/projects/wikilinks/program
        

Script tsv_update.py to keep them up to date can be found here:

 /mnt/minerva1/nlp/projects/wikilinks
        

To increase the efficiency of Wikilinks tool, full URL addresses were excluded from the tsv files and only their titles remain (e.g. prefix http://en.wikipedia.org/wiki/ is removed and "_" replaced with " "). wikilinks.py creates URL adresses using the names for further processing.

10.1 wikipedia-all.tsv

Contains all URL address names of english Wikipedia, one name per row. Both names and redictions are included. 2 files from Decipher wikipedia project are used for updating.

 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/wikipedia-regular-pages.tsv
 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/wikipedia-redirects.tsv
        

These two files combined give us the final:

 wikipedia-all.tsv
        

Format example:

 Napoleon
 Napoleon's Barber
 Napoleon's Campaigns in Miniature
        

10.2 wikipedia-disambiguations.tsv

Contains only names of disambiguating pages, one per row. A file from Wikipedia disambiguation pages project is used for updating.

 /mnt/minerva1/nlp/projects/wikipedia_disambiguation_pages/output/wp_disambiguation_titles.sorted
        

This gives us the final:

 wikipedia-disambiguations.tsv
        

Format example:

 Napoleon at St. Helena
 Napoleon (disambiguation)
 Napoleone Orsini (disambiguation)
        

10.3 wikipedia-redirects.tsv

Contains all redirecting URL address names. Final address and a group of all names redirecting to that address separated by "|" per row.

Format:

 final_adress_name\tname_of_redirecting_adress1|name_of_redirecting_adress2|...
        

A file from Decipher wikipedia project is used for updating:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/wikipedia-redir-table.tsv
        

This gives us the final:

 wikipedia-redirects.tsv
        

Format example:

 Napoleon            Napolean Bonapart|Napolean Bonaparte
 Napoleon Dynamite   Napolean Dynamite
        

11 Comparison of results from VERT and WARC files processing

Comparison between datasets CC-2015-32 and CC-2015-40. Both datasets were first processed from input WARC format, then from input VERT format. In both cases, Wikilinks tool was launched with the same parameters. Surprisingly enough, VERT format produced more wiki links, see table:

Dataset CC-2015-32 CC-2015-32 CC-2015-40 CC-2015-40
File type WARC VERT WARC VERT
Total number of found wiki links 60,488,837 87,742,422 34,500,684 66,292,376
Total processing time 10 hr, 32 min 4 hr 14 hr, 49 min 1 hr, 50 min

12 Page table processing

Executed by script:

 /mnt/minerva1/nlp/projects/wikilinks/xbolje00/pageTable.py
        

The script filters information from SQL script enriched with commands to create and fill a table. More can be found here.

Table format:

 +--------------------+---------------------+------+-----+----------------+----------------+
 | Field              | Type                | Null | Key | Default        | Extra          |
 +--------------------+---------------------+------+-----+----------------+----------------+
 | page_id            | int(10) unsigned    | NO   | PRI | NULL           | auto_increment |
 | page_namespace     | int(11)             | NO   | MUL | NULL           |                |
 | page_title         | varbinary(255)      | NO   |     | NULL           |                |
 | page_restrictions  | tinyblob            | NO   |     | NULL           |                |
 | page_counter       | bigint(20) unsigned | NO   |     | 0              |                |
 | page_is_redirect   | tinyint(3) unsigned | NO   | MUL | 0              |                |
 | page_is_new        | tinyint(3) unsigned | NO   |     | 0              |                |
 | page_random        | double unsigned     | NO   | MUL | NULL           |                |
 | page_touched       | binary(14)          | NO   |     |                |                |
 | page_latest        | int(10) unsigned    | NO   |     | NULL           |                |
 | page_len           | int(10) unsigned    | NO   | MUL | NULL           |                |
 | page_content_model | varbinary(32)       | YES  |     | NULL           |                |
 | page_links_updated | varbinary(14)       | YES  |     | NULL           |                |
 +--------------------+---------------------+------+-----+----------------+----------------+
        

Rows page_id and page_title are important for further processing. Their values with the INSERT command are stored during processing and consquently saved to the output file.

Launch:

 ./pageTable.py -i {input_file} -o {output_file}
        

Input SQL script:

 /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/enwiki-latest-page.sql
        

Output file path:

 /mnt/minerva1/nlp/projects/wikilinks/xbolje00/wikipedia-pageTable.tsv
        

Output file format:

 page_id\tpage_title
 page_id\tpage_title
 ...
        

13 Decoding Wikilinks dataset from apache thrift

Program gsoc-wikilinks, is used to decode Wikilinks dataset from apache thrift format to the required format. Program is launched with 2 parameters, specifying input and output directories. Files in input directory are read, decoded and saved to output directory. If an error reading a file occurrs: can't be read, not in apache drift format etc, an exception is thrown and the process continues with the next file. Output has to contain address offset and text. Text offset is always given, but address offset has to be calculated. That is only possible if the original html document is available.

The program can be found here:

 /mnt/minerva1/nlp/projects/wikilinks/gsoc-wikilinks
        

13.1 Build

Run the following command in the program directory:

 mvn package
        

When the build successfully ends, the following jar archives are created in this directory:

 gsoc-wikilinks-1.0-SNAPSHOT.jar
 gsoc-wikilinks-1.0-SNAPSHOT-jar-with-dependencies.jar
        

13.2 Launch

Launch using:

 java -Xmx1g -jar gsoc-wikilinks-1.0-SNAPSHOT-jar-with-dependencies.jar -i "input_directory" -o "output_directory"
        

13.3 Decoding

Used dataset can be found here:

 /mnt/data-2/nlp/wiki-links/Dataset (with complete webpages)
        

Output can be found here:

 /mnt/data/nlp/projects/wikilinks/result_from_thrift
        

13.3.1 Statistics

Number of files 109
Total time 14:33:31
Number of documents 10,629,486
Number of links 36,166,829
Calculated offsets 32,638,607
Missing offsets 3,528,222
Calculated offsety in % 89%