The goal of this project was to develop a program to extract links for ClueWeb and CommonCrawl data sets from Wikipedia, similar to those in Wikilinks data set. Another goal was to evaluate the speed of the developed tool. The entire functionality is packed into the wikilinks.py
script. Detailed description can be found below.
wikilinks.py
description
wikilinks.py
descriptionConsidering the fact that we have to determine the type of date, respectively charset for every single WARC record, records that are not application/http;msgtype=response
or application/https; msgtype=response
are excluded from the analysis. Each record that meets the requirements should contain HTTP header and the specified type of content inside. The only analyzed type of content is text/html
as it is not expected that image/jpeg
content contains links.
The charset of a document can be sometimes determined from the HTTP header. Next, the script searches for a specific charset in the HTTP record body (more specifically HTML metatags). Charset from HTML has a higher priority than that from HTTP header. If the charset could not be determined, default HTTP ISO-8859-1 charset is used. If a character can't be converted to UTF-8, it is excluded.
Consequently, the script looks for links. The search algorithm uses regular expressions. Acceptable links are HTML links (<a>
). If a link to Wikipedia or Freebase is found, the script determines:
These are consequently saved to the output file together with the charset id and name of the source file.
If any of the information can't be found, program saves an empty string in their place instead, this can be changed in constants.py
.
The program only saves Wikipedia links from namespace. This filter is executed based on the occurence of : in name of the Wikipedia page. If the name of the page contains :[^_]
, it is considered a page from different Wikipedia namespace. This function is implemented like this for the sake of simplicity. Wikipedia namespace names change with the language of Wikipedia.
The program can search for Freebase or Wikipedia links, use parameters -f
or -w
.
Links can also be filtered to search only for a specific language version of Wiki, -e
for english and -l
for other versions. Program determines the language of a given Wikipedia page using the 3rd (resp 4th) order domain.
If a file containing data about redirecting to Wikipedia exists, the program can save the URL of redirected page, rather than the URL in WARC document, using the -r
parameter.
The program saves logs to a file named <output_file>.log
.
Number of saved words before and after the link can be changed in constants.py
.
Script wikilinks.py
can now process VERT files, along with WARC files. The launching of the script is the same, it can determine the file type on it's own and process it accordingly.
VERT file contains verticalized data, where tokens (words) are saved under eachother as a list, seperated by \n
. A record in this file starts and ends with tags <doc where id, url and title of a document is located>
and </doc>
. Individual links in the document start with <link="wikipedia/freebase/other url"/>
, element <length="number"/>
can be found on the same line and it represents number of tokens (words) that the link text is comprised of. The tokens are located before tag <link>
. Regular expressions are used to process the format.
Used for Wikipedia addresses and their normalization. Address that originally contained CUID of an article e.g. http://en.wikipedia.org/wiki/index.html?curid=75960 has the .../wiki/ part replaced with name of the article e.g. https://en.wikipedia.org/wiki/KPMG. To activate this function, launch the program with parameter -p
followed by .tsv
file where each line contains CUID and title (e.g. "75960\tKPMG"). This file is loadad to memory upon launch and suffix with CUID in url is replaced with corresponding article title.
Path to file:
/mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/enwiki-latest-page.sql
wikilinks.py
parametersLaunch example:
wikilinks.py -i <input_file> -o <output_file> [-p <path>] [-v] [-w] [-f] [-e] [-l <language>[:<language>]] [-r <redirection_file>] [-s] [-d <disambiguation_file>] [-W] [-h] [-a <all_url_file>] [-k]
-i
- specifies file containing list of WARC/VERT files to be analysed; it should contain one record per line; the record should be in the format <[path]warc|vert_file_name>\t<number> or <[path]warc|vert_file_name>-p
- specifies path that shares all files from input file-o
- specifies name of output file; log file will be called <output_file_name>.log-v
- verbose mode-w
- searches for links to Wikipedia-f
- searches for links to Freebase-e
- only links to English Wikipedia will be recognized-l
- only links to a given language-specific Wikipedia will be recognized-r
- a tsv file containing redirections from English Wikipedia-d
- a tsv file containing disambiguation pages from English Wikipedia-a
- a tsv file containing all URL from English Wikipedia-W
- filtering out documents from Wikipedia-s
- filtering out all contexts that are not referenced from anchor text similar to the Wikipedia page name-I
- filtering out URLs with fragment identifier-P
- a tsv file containing Wikipedia page table (CUIDs and its titles)-h
- prints help-k
- filtering out documents from unwanted domains {wikipedia|wikidata|wiktionary|wikimedia|wikisource|edwardbetts|instantchess|viaf|werelate|wmflabs|paheal|fallingrain|isni-url.oclc|blekko}Parameter -p
can be used to analyze Clueweb12 when a file containing list of WARC files with their relative paths is available.
Example launch for a single archiv can be found here:
/mnt/minerva1/nlp/projects/wikilinks/test/test.sh
Wikilinks project source files can be found here:
/mnt/minerva1/nlp/projects/wikilinks/program/
What does each file do:
wikilinks.py - the program extractig Wikipedia and Freebase links warcm.py - WARC file processing module vert.py - VERT file processing module httpm.py - HTTP header processing module constants.py - constants
Output files are in TSV (Tab-Separated Values) format. The meaning of each column is as follows:
------+------------------------------------------------- Number| Value ======+================================================= 1 | original address ------+------------------------------------------------- 2 | normalized address ------+------------------------------------------------- 3 | link text (no entities) ------+------------------------------------------------- 4 | original link text ------+------------------------------------------------- 5 | context before link (no entities) ------+------------------------------------------------- 6 | context after link (no entities) ------+------------------------------------------------- 7 | encoding ------+------------------------------------------------- 8 | address offset in WARC/VERT file ------+------------------------------------------------- 9 | text offset in WARC/VERT file ------+------------------------------------------------- 10 | WARC/VERT file name ------+------------------------------------------------- 11 | URL of corresponding document in WARC/VERT file ------+-------------------------------------------------
Automatic launch script for Wikilinks can be found here:
/mnt/data/nlp/projects/wikilinks/runWikilinks.sh
Required launch parameters:
-l
LOGIN - username-s
SOURCE_FOLDER - folder containing files to be processed-d
DESTINATION_FOLDER - folder for output files (has to already exist)-w | -v
process WARC or VERT filesTo launch successfully, set login using ssh keys to machines, where files will be processed and save their names to:
/mnt/data/nlp/projects/wikilinks/servers.txt
Script launch order:
list.sh, list_dirs.sh - create list of files to be processed split.sh - splits the list for multiple processes prepare.sh - prepares directiories for Wikilinks outputs run.sh - runs Wikilinks tool on servers specified in servers.txt copy.sh - copies output from machines on minerva1.fit.vutbr.cz
Logs of the process are saved here:
/mnt/data/nlp/projects/wikilinks/autorunOutput.log /mnt/data/nlp/projects/wikilinks/autorunError.log
./runWikilinks.sh -l iotrusina -s /mnt/data/commoncrawl/CC-2015-06/ -d /mnt/data/nlp/projects/wikilinks/results.CC-2015-06/ -w
Note:
It is advised to run the script through screens as it takes a long time.
Durations listed below represent the estimated time it takes to process the respective CommonCrawl 2014 part on individual machines.
Machine | No of files | 1 process | 2 processes | 3 processes | 4 processes | 5 processes | 6 processes |
---|---|---|---|---|---|---|---|
knot01 | 2500 | 192:21:07 | 98:59:31 | 70:23:59 | 53:55:42 | 40:49:35 | 37:28:19 |
knot03 | 2500 | 112:46:48 | 61:00:58 | 39:32:38 | 31:49:12 | 26:10:57 | 25:22:56 |
knot04 | 2500 | 116:28:20 | 63:05:04 | 41:55:47 | 33:26:21 | 24:45:13 | 24:34:53 |
knot05 | 2600 | 128:58:11 | 65:43:50 | 44:50:05 | 35:59:16 | 28:19:13 | 27:28:51 |
knot06 | 2500 | 115:49:10 | 58:59:10 | 42:36:59 | 30:47:32 | 24:38:07 | 21:32:13 |
knot07 | 2500 | 114:32:38 | 58:57:01 | 42:39:27 | 30:55:40 | 26:51:25 | 22:19:48 |
knot08 | 2700 | 122:51:36 | 62:10:48 | 42:11:48 | 34:15:05 | 31:29:38 | 26:20:27 |
knot10 | 2600 | 115:47:47 | 52:34:57 | 36:45:08 | 27:36:03 | 23:27:30 | 19:33:34 |
knot11 | 2200 | 87:14:54 | 46:49:46 | 10:38:42 | 24:32:21 | 19:27:05 | 16:42:27 |
athena3 | 2700 | 104:35:51 | 56:23:20 | 37:19:09 | 27:11:53 | 23:42:00 | 20:12:17 |
athena2 | 2100 | 65:13:28 | 33:27:54 | 24:09:19 | 17:06:51 | 14:07:46 | 12:39:36 |
knot14 | 2100 | 83:13:48 | 37:41:42 | 37:08:53 | 26:42:57 | 22:29:44 | 17:42:52 |
knot15 | 2100 | 52:24:24 | 29:11:31 | 20:07:23 | 14:50:38 | 12:16:03 | 09:29:28 |
knot16 | 2100 | 56:42:56 | 29:29:54 | 24:36:32 | 20:29:35 | 17:52:11 | 17:07:22 |
knot17 | 2100 | 69:30:29 | 31:06:02 | 24:48:12 | 21:45:46 | 19:29:25 | 16:10:47 |
knot18 | 2100 | 92:17:35 | 49:52:09 | 34:15:19 | 26:23:49 | 20:55:41 | 17:59:25 |
knot19 | 2100 | 75:19:05 | 44:09:02 | 34:44:27 | 28:17:23 | 22:39:17 | 18:55:44 |
knot20 | 2100 | 67:19:14 | 38:36:01 | 31:22:48 | 26:02:59 | 21:03:23 | 16:37:21 |
knot21 | 2100 | 64:03:42 | 32:14:10 | 23:18:45 | 20:10:55 | 18:43:55 | 17:07:27 |
knot22 | 2100 | 86:28:38 | 49:30:10 | 33:37:08 | 25:25:29 | 20:17:49 | 16:01:15 |
knot23 | 2100 | 63:20:18 | 50:32:17 | 33:36:58 | 25:24:19 | 21:35:29 | 16:58:37 |
knot24 | 2100 | 50:41:02 | 27:54:03 | 19:10:53 | 14:48:27 | 10:59:45 | 09:59:11 |
knot25 | 2700 | 147:13:30 | 76:50:11 | 50:02:03 | 39:45:54 | 32:17:29 | 27:46:18 |
athena1 | 2700 | 75:02:42 | 39:11:11 | 27:23:12 | 20:47:35 | 17:07:30 | 14:20:14 |
For this comparison a raw file [2] from the Wikilinks set was processed and compared to data extracted with the Wikilinks project [1].
Due to the computing complexity only a part of the files were processed.
The original full-content file [2] had to be split in parts and those were consequently merged into WARC file, that was then analyzed using Wikilinks. The program can't process files that large (13gb) at a decent memory complexity. Only one WARC file is processed at a time and they usually contain a single page (KB).
Comparison could only be made between individual wikipedia page adresses and text used in their links.
The difference between the original Wikilinks and our Wikilinks is that extracted links from the original Wikilinks contain links to other Wikipedia namespaces than 0. This means that they contain links to images, files and so on. Some differences happened during the normalization of redirection, other differences came from varying records of white spaces and so on.
[1] http://iesl.cs.umass.edu/downloads/wiki-link/context-only/001.gz [2] http://iesl.cs.umass.edu/downloads/wiki-link/full-content/part1/001.gz
Number of links found through Wikilinks (original): 123810 Number of links found through Wikilinks: 187985 Number of overlapping links: 102185 Number of unique links: 110547
Number of unique links represents the number of links that differ from the ones found using the original Wikilinks. Number of overlapping links represents the number of identical links.
Results were generated like this:
diff clueweb12.out wikilinks.out | wc -l comm -1 -2 out1.out.sorted 001.out.sorted | wc -l
They can be found here:
/mnt/minerva1/nlp/projects/wikilinks/porovnani/
The goal was to develop a script to analyze identical contexts of links and create a histogram of results from Wikilinks tool on CommonCrawl data.
Script duplicateContexts.py
loads data from specified directory and searches for duplicate link contexts among the files. Duplicate link contexts are merged and counted, consequently they are ordered by links and saved to output file in RESULTS
directory with additional data.
Script can be found here:
/mnt/minerva1/nlp/projects/wikilinks/duplicateContexts.py
Launch:
./duplicateContexts.py [-f FOLDER] [-n NUMBER] [-W]
-f
FOLDER - location of a folder to be processed, if not set, files in the location of the script are processed instead-n
NUMBER - number of the most commonly occurring duplicate link contexts to be saved to files-W
- only processes records that are not from Wikipedia pages in foreign languages, their histogram is split based on their source to 3 files: english, other and ones that contain at least one of both english and other languages; if parameter -n
is set, the script prints the most commonly occurring duplicate link contexts to files duplicate.wdocs.N.result
, where no foreign language Wikipedia sources are located and to files duplicate.wdocs.BOTH.N.result
, that contain at least one record from english Wikipedia and other pageDirectory RESULT
is created for output of the script and contains the following files:
script.result
Contains link contexts histogram:
01 Normalized address 02 Context before link (no entities) 03 Context after link 04 Number of duplicate link contexts
Individual columns in the file are separated by tabs.
script.result.log
Contains addition information, such as total number of lines in all files, number of unique link contexts, number of duplicate link contexts and duration of processing the smaller parts.
duplicate.N.result
These files contain the most commonly occurring duplicate link contexts.
Data tested by the script:
/mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49
Output is saved here:
/mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/RESULTS
Script domains.py
loads data from specified folder, that contains the results of duplicateContexts.py
script and lists domains from duplicateN.result
files.
Script can be found here:
/mnt/minerva1/nlp/projects/wikilinks/domains.py
Launch:
./domains.py [-f FOLDER] [-H]
-f
- location of a folder to be processed, if not set, files in the location of the script are processed instead-H
- H prints histogram of domains from all filesdomains.result
Contains names of files and domains including their count under each name.
domains.histogram.result
Contains histogram of domains and their count from all processed files.
moreDomainFiles.result
Contains files with more than 1 domains and their count.
The goal was to compare how long it takes to process CC-2014-49 and CC-2014-52 on individual machines based on the *.result.log
files in the following folders, as well as duration of each process. Finally, to create graph using the gnuplot tool.
/mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52
Script processingTime.py can be found here:
/mnt/minerva1/nlp/projects/wikilinks/charts/processingTime.py
Launch:
./processingTime.py DIR OUTPUTFILENAME
DIR
- directory containing processed *.result.log filesOUTPUTFILENAME
- name of the fileFile format:
[number] [machine_name.thread] [time in hh:mm:ss] [time in seconds]
Processing times of CC-2014-49 and CC-2014-52 can be found here:
/mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/timeComparision-2014-49.dat /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52/timeComparision-2014-52.dat
gnuplot script timeChart.gpl
creates graphs based on the two files above:
/mnt/minerva1/nlp/projects/wikilinks/charts/timeChart.gpl
The output files timeComparision.CC-2014-49.png
and timeComparision.CC-2014-52.png
compare the total processing time on individual machine threads in seconds.
/mnt/minerva1/nlp/projects/wikilinks/charts/timeComparision.CC-2014-49.png /mnt/minerva1/nlp/projects/wikilinks/charts/timeComparision.CC-2014-52.png
The goal was to create a graph displaying the size of data processed by individual processes based on files located here:
/mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/list /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52/list
Script dataSize.py
searches for files that match the regular expression, calculates the sum of available files and saves it to the output file.
Script location:
/mnt/minerva1/nlp/projects/wikilinks/charts/dataSize.py
Launch:
./dataSize.py DIR REGEX OUTPUTFILE
DIR
- directory containing input filesREGEX
- regular expression to find the desired filesOUTPUTFILE
- output fileScript associateFilesData.py
merges data from files created by the first script into a single output file, files containing partial data are then deleted.
Script location:
/mnt/minerva1/nlp/projects/wikilinks/charts/associateFilesData.py
Launch:
./associateFilesData.py DIR OUTPUTFILE
DIR
- directory containing files with partial dataOUTPUTFILE
- output file for merged dataFile format:
[number] [machine_name.thread] [size in GB]
Output files are saved here:
/mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-49/list/dataSizeComparision.CC-2014-49 /mnt/minerva1/nlp/projects/wikilinks/results.CC-2014-52/list/dataSizeComparision.CC-2014-52
Graphs are saved here:
/mnt/minerva1/nlp/projects/wikilinks/charts/dataSizeComparision.CC-2014-49.png /mnt/minerva1/nlp/projects/wikilinks/charts/dataSizeComparision.CC-2014-52.png
Scripts and programs for deduplication of Wikilinks format are described here.
Deduplication is executed by two programs, dedup and server.
Source files (wikilinks branch of git repository):
/mnt/minerva1/nlp/repositories/corpora_processing_sw/
Directory:
/mnt/minerva1/nlp/repositories/corpora_processing_sw/processing_steps/3/dedup/
Use Makefile.
Program dedup is "worker" and executes deduplication. It needs a running program server which is "hashholder". To run deduplication, it is necessary to have at least one server and one worker running. Deduplication principle in short: workers calculate hashes from records and request responses from servers. If servers do not recognize a hash, it is considered a unique record. Multiple workers and servers can run simultaneously. Usually one worker on each machine, a few specific machines run server as well.
Since running N workers and M servers on different machines is inefficient, scripts deduplicate.py
and server.py
were created to manage this.
They are located here:
/mnt/minerva1/nlp/repositories/corpora_processing_sw/processing_steps/3/
Scripts require that programs dedup and server are available on all machines in the same location (parameter -b
). Specify files containing list of workers (parameter -w
) and list of servers (parameter -s
). Both files have the same format: one HOSTNAME per line. Both scripts use parameter -p
to set the port.
Script server.py
has optional parameters -i
and -o
specifying input and output files to load and save hashes. Servers run in the background, use parameter start to launch them, stop to terminate them or restart to restart them. If neither of these parameters are specified, the script only checks status of the servers and screens where they run.
Script deduplicate.py
uses parameters -i
and -o
to specify input and output folders for deduplication. Use parameter -wl
to process Wikilinks. Another optional paramater is -n
that enables so called near deduplication.
Columns "context before link" and "context after link" need to be identical in all deduplicated files.
There is a script responsible for that:
/mnt/minerva1/nlp/projects/wikilinks/wikilinksConvertContext.py
Truncates contexts to 10 words.
Example use:
wikilinksConvertContext.py < input_file > output_file
python server.py -s servers.txt -w workers.txt -p 11111 start # launch servers on machines specified in servers.txt on port 11111 python server.py -s servers.txt -w workers.txt -p 11111 # check if servers are running python deduplicate.py -s servers.txt -w workers.txt -p 11111 -wl -i /mnt/data/wikilinks -o /mnt/data/wikilinks_dedup # launch deduplication of files in /mnt/data/wikilinks and save output to /mnt/data/wikilinks_dedup # running in parallel on machines specified in workers.txt (add -n to run nearDedup)
Deduplication of format wikilinks is implemented so that, hash is calculated for concatenation of columns 2, 3, 5, 6, all the columns have to be the same to evaluate row as duplicate. Neardedup calculates hashes with the help of N-grams on concatenation of columns 5, 3, 6 (in this order), then it calculates hash of 2. column. The row is a duplicate if both last entries found conjunction.
Table contaning numbers of Wikilinks records from processed CCs:
CC-2014-42 | CC-2014-49 | CC-2014-52 | CC-2015-06 | CC-2015-11 | CC-2015-14 | CC-2015-18 | CC-2015-22 | CC-2015-27 | CC-2015-32 | CC-2015-35 | CC-2015-40 | CC-2015-48 | CC-2016-07 | CC-2016-18 | CC-2016-22 | CC-2016-26 | CC-2016-30 | Sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
before deduplication | 92,814,880 | 45,171,854 | 40,843,668 | 36,321,605 | 38,847,017 | 35,095,339 | 42,807,744 | 41,439,440 | 34,624,096 | 38,110,741 | 40,552,152 | 28,382,908 | 40,465,043 | 38,538,458 | 31,179,951 | 30,295,311 | 25,488,631 | 42,734,222 | 723,713,060 |
after deduplication | 15,454,213 | 13,160,368 | 15,007,065 | 13,497,642 | 14,595,606 | 13,713,196 | 15,482,543 | 15,294,670 | 13,497,513 | 14,283,612 | 14,417,289 | 11,401,264 | 14,225,466 | 14,258,449 | 12,869,089 | 13,065,771 | 10,712,890 | 13,986,207 | 248,922,853 |
after nearDedup | 11,675,540 | 10,235,720 | 11,497,064 | 10,364,118 | 11,368,100 | 10,739,974 | 11,928,285 | 11,856,781 | 10,605,308 | 11,125,573 | 11,296,278 | 9,061,115 | 11,252,212 | 11,159,926 | 10,046,860 | 10,266,312 | 8,333,024 | 10,897,033 | 193,709,223 |
after incremental deduplication | 15,453,792 | 1,382,063 | 1,726,778 | 1,227,687 | 1,453,991 | 1,062,187 | 1,260,128 | 1,244,134 | 955,326 | 924,314 | 836,878 | 714,636 | 945,781 | 1,130,096 | 1,087,456 | 936,315 | 722,957 | 985,277 | 34,049,796 |
after incremental nearDedup | 10,429,073 | 463,743 | 542,288 | 262,169 | 475,620 | 209,076 | 256,643 | 260,088 | 157,469 | 137,950 | 154,390 | 132,593 | 217,288 | 246,024 | 215,418 | 310,357 | 102,328 | 197,985 | 14,770,502 |
Incremental deduplication means that CC were processed from left to right, hashes preserved. Incremental deduplication was executed in this order: Wikipedia -> CCs -> Cluewebs
Table containing sizes of individual processed CCs
CC-2014-42 | CC-2014-49 | CC-2014-52 | CC-2015-06 | CC-2015-11 | CC-2015-14 | CC-2015-18 | CC-2015-22 | CC-2015-27 | CC-2015-32 | CC-2015-35 | CC-2015-40 | CC-2015-48 | CC-2016-07 | CC-2016-18 | CC-2016-22 | CC-2016-26 | CC-2016-30 | Sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
before deduplicaton | 70 GB | 36 GB | 35 GB | 31 GB | 33 GB | 30 GB | 37 GB | 36 GB | 30 GB | 33 GB | 34 GB | 25 GB | 34 GB | 33 GB | 26 GB | 27 GB | 21 GB | 34 GB | 605 GB |
after deduplication | 6.2 GB | 5.3 GB | 6 GB | 5.4 GB | 5.9 GB | 5.5 GB | 6.2 GB | 6.1 GB | 5.4 GB | 5.7 GB | 5.8 GB | 4.6 GB | 5.7 GB | 5.7 GB | 5.1 GB | 5.2 GB | 4.3 GB | 5.6 GB | 99.7 GB |
after nearDedup | 4.7 GB | 4.1 GB | 4.6 GB | 4.2 GB | 4.6 GB | 4.3 GB | 4.8 GB | 4.8 GB | 4.2 GB | 4.4 GB | 4.5 GB | 3.6 GB | 4.5 GB | 4.5 GB | 4 GB | 4.1 GB | 3.3 GB | 4.4 GB | 77.6 GB |
after incremental deduplication | 6.2 GB | 590 MB | 735 MB | 524 MB | 622 MB | 453 MB | 544 MB | 533 MB | 409 MB | 397 MB | 357 MB | 305 MB | 408 MB | 483 MB | 462 MB | 399 MB | 304 MB | 421 MB | 14 GB |
after incremental nearDedup | 4.2 GB | 196 MB | 232 MB | 110 MB | 204 MB | 89 MB | 110 MB | 106 MB | 65 MB | 57 MB | 66 MB | 58 MB | 94 MB | 106 MB | 94 MB | 137 MB | 43 MB | 87 MB | 6 GB |
Results of each CC are saved on these machines:
athena1 knot01 knot14 knot26 athena2 knot02 knot15 knot27 athena3 knot03 knot16 knot28 athena4 knot04 knot17 knot29 athena5 knot05 knot18 knot30 athena6 knot06 knot19 knot31 athena7 knot07 knot20 knot32 athena8 knot08 knot21 knot33 athena9 knot10 knot22 knot34 athena10 knot11 knot23 knot35 athena11 knot12 knot24 minerva2 athena12 knot13 knot25 minerva3
Results are located on each machine here:
/mnt/data/commoncrawl/${CC}/wikilinks.dedup/ /mnt/data/commoncrawl/${CC}/wikilinks.near_dedup/ /mnt/data/commoncrawl/${CC}/wikilinks.dedup_inc/ /mnt/data/commoncrawl/${CC}/wikilinks.near_dedup_inc/
wikilinksDedup.sh
Deduplication script for Wikilinks from CommonCrawl can be found here:
/mnt/minerva1/nlp/repositories/corpora_processing_sw/processing_steps/3/wikilinksDedup.sh
Suitable for incremental deduplicaiton. Example:
wikilinksDedup.sh CC-2015-14 CC-2015-32 # first process CC-2015-14 and then CC-2015-32, preserve hashes wikilinksDedup.sh CC-2015-14 # process CC-2015-14
More specific paths can be changed within the script, as well as paths to files specifying machines to run deduplication on. (add parameter -n
to run nearDedup)
Table containing number of Wikilins records from processed sets:
ClueWeb09 | ClueWeb12 | Sum | |
---|---|---|---|
before deduplicaton | 11,744,461 | 19,803,070 | 31,547,531 |
after deduplication | 6,907,001 | 7,186,902 | 14,093,903 |
after neardedup | 5,135,733 | 5,631,135 | 10,766,868 |
after incremental deduplication | 6,268,126 | 5,173,682 | 11,441,808 |
after incremental nearDedup | 3,820,834 | 3,036,299 | 6,857,133 |
Incremental deduplication was executed in this order: Wikipedia -> CCs -> Cluewebs.
Table containing sizes of individual processed sets:
ClueWeb09 | ClueWeb12 | Sum | |
---|---|---|---|
before deduplicaton | 5.3 GB | 8.0 GB | 13.3 GB |
after deduplication | 2.4 GB | 2.5 GB | 4.9 GB |
after nearDedup | 1.8 GB | 2 GB | 3.8 GB |
after incremental deduplication | 2.2 GB | 1.8 GB | 4 G |
after incremental nearDedup | 1.3 GB | 1.1 GB | 2.4 GB |
Results of both Cluewebs are saved on these machines:
athena1 knot01 knot14 knot26 athena2 knot02 knot15 knot27 athena3 knot03 knot16 knot28 athena4 knot04 knot17 knot29 athena5 knot05 knot18 knot30 athena6 knot06 knot19 knot31 athena7 knot07 knot20 knot32 athena8 knot08 knot21 knot33 athena9 knot10 knot22 knot34 athena10 knot11 knot23 athena11 knot12 knot24 athena12 knot13 knot25
and they are located in these folders:
/mnt/data/clueweb/${09|12}/wikilinks.dedup/ /mnt/data/clueweb/${09|12}/wikilinks.near_dedup/ /mnt/data/clueweb/${09|12}/wikilinks.dedup_inc/ /mnt/data/clueweb/${09|12}/wikilinks.near_dedup_inc/
Wikipedia | Size of individual versions | |
---|---|---|
before deduplication | 125,192,634 | 36 GB |
after deduplication | 119,021,063 | 34,4 GB |
after nearDedup | 76,920,163 | 22,2 GB |
after incremental deduplication | 119,021,063 | 34.4 GB |
after incremental nearDedup | 76,919,387 | 22.2 GB |
Results are saved on the same machins as CommonCrawl in folder:
/mnt/data/wikipedia/wikipedia.wikilinks/
Wikipedia serves as input for incremental deduplication of CCs and Cluewebs.
thrift(hashes from CC) | thrift(hashes from Wikipedia) | thrift(hashes from Clueweb) | thrift(hashes from CC+Wikipedia) | thrift(hashes from CC+Clueweb) | thrift(hashes from CC+Wikipedia+Clueweb) | |
---|---|---|---|---|---|---|
before deduplication | 35,359,606 | 35,359,606 | 35,359,606 | 35,359,606 | 35,359,606 | 35,359,606 |
after deduplication | 20,927,543 | 20,927,543 | 20,927,543 | 20,927,543 | 20,927,543 | 20,927,543 |
after nearDedup | 18,622,532 | 18,622,532 | 18,622,532 | 18,622,532 | 18,622,532 | 18,622,532 |
after incremental deduplication | 20,927,499 | 20,926,875 | 20,927,493 | 20,926,831 | 20,927,450 | 20,926,782 |
after incremental nearDedup | 15,750,839 | 16,815,066 | 18,618,274 | 14,432,668 | 14,560,364 | 13,348,602 |
CC and ClueWeb sets were processed. All the datasets were processed by a script for automatic Wikilinks tool launch runWikilinks.sh
. The script runs Wikilinks tool with the following parameters:
wikilinks.py -i <input_file> -o <output_file> -v -w -e -r <redirection_file> -d <disambiguation_file> -a <all_url_file> -W -I -P <pageTable_file> -k
-i
- specifies file containing list of WARC files to be analysed; it should contain one record per line; the record should be in the format <[path]warc_file_name>\t<number> or <[path]warc_file_name>-o
- specifies name of output file; log file will be called <output_file_name>.log-v
- verbose mode-w
- searches for links to Wikipedia-e
- only links to English Wikipedia will be recognized-r
- a tsv file containing redirections from English Wikipedia (in this case: /mnt/minerva1/nlp/projects/wikilinks/program/wikipedia-redirects.tsv)-d
- a tsv file containing disambiguation pages from English Wikipedia (in this case: /mnt/minerva1/nlp/projects/wikilinks/program/wikipedia-disambiguations.tsv)-a
- a tsv file containing all URL from English Wikipedia (in this case: /mnt/minerva1/nlp/projects/wikilinks/program/wikipedia-all.tsv)-W
- filtering out documents from Wikipedia-I
- filtering out URLs with fragment identifier-P
- a tsv file containing Wikipedia page table (CUIDs and its titles) (in this case: /mnt/minerva1/nlp/projects/wikilinks/xbolje00/wikipedia-pageTable.tsv)-k
- filtering out documents from unwanted domains {wikipedia|wikidata|wiktionary|wikimedia|wikisource|edwardbetts|instantchess|viaf|werelate|wmflabs|paheal|fallingrain|isni-url.oclc|blekko}Processed datasets:
CC-2014-42 CC-2015-32 ClueWeb09 CC-2014-49 CC-2015-35 ClueWeb12 CC-2014-52 CC-2015-40 CC-2015-06 CC-2015-48 CC-2015-11 CC-2016-07 CC-2015-14 CC-2016-18 CC-2015-18 CC-2016-22 CC-2015-22 CC-2016-26 CC-2015-27 CC-2016-30
Results can be found here:
/mnt/minerva1/nlp/projects/wikilinks/results.{CC|clueweb}
Or on each machine in:
/mnt/data/commoncrawl/{CC}/wikilinks /mnt/data/clueweb/{09|12}/wikilinks
Wikilinks version sutable for distribution contains incrementally processed CommonCrawl, ClueWeb and Wikipedia datasets (known together as Total Dataset. A subset containing only ambiguous names of people and locations of wikipedia (Disambiguation Subset).
Total dataset contains incremental version of deduplication, executed in this order: Wikipedia -> CommonCrawl -> Clueweb. It is represented via multiple files with *.result suffix. A total of 22 datasets. A file containing only unique links - Total_dataset.referred_entities
.
This chapter deals with extracting subpart of ambiguous names of people and location from processed CC, Clueweb and Wikipedia dataset.
File containing ambiguous names of people can be found here:
/mnt/minerva1/nlp/projects/decipher_wikipedia/wikipedia_statistics/person_statistics
File containing ambiguous names of locations can be found here:
/mnt/minerva1/nlp/projects/decipher_wikipedia/wikipedia_statistics/location_statistics/data/
Both files were merged into a single file:
/mnt/minerva1/nlp/projects/wikilinks/disambiguation_subset_extraction/all_ambiguous_names
Exctaction is executed by scripts that were taken and changed accordingly from project Opal Disambiguation. Script all_servers_extract_context.sh
runs script extract_contexts.sh
that executes extraction itself on all machines. Script all_servers_extract_context.sh
requires that a project_dir
path is set manually (directory containing these scripts), output_file_destination
(directory, where results of extraction are saved) and ambiguous_names_file
(file containing ambiguous names).
Changed versions of scripts used for extraction can be found here:
/mnt/minerva1/nlp/projects/wikilinks/disambiguation_subset_extraction/all_servers_extract_contexts.sh /mnt/minerva1/nlp/projects/wikilinks/disambiguation_subset_extraction/extract_contexts.sh
Disambiguation subset is split in 3 files:
CommonCrawl.result Clueweb.result Wikipedia.result
Again, these were used to create a single file contaning only unique links:
Disambiguation_subset.referred_entities
Total Dataset | Disambiguation Subset | |
---|---|---|
Number of mentions | 164,512,667 | 10,557,592 |
Referred entities | 4,417,107 | 286,273 |
Size in GB | 52 | 3.1 |
Total Dataset represents a total dataset of wikilinks (cc+cw+wikipedia). Disambiguation Subset is extracted subpart containing ambiguous names of people and locations from Total Dataset. Number of mentions represents a total number of wiki links in individual datasets. Referred entities represents number of unique links within a dataset.
Total Dataset [1] and Disambiguation Subset [2] can be found here:
[1] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Total_dataset/*.result [2] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Disambiguation_subset/*.result
Total Dataset and Disambiguation Subset were archived:
[1] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Total_dataset.tar.gz [2] /mnt/minerva1/nlp/projects/wikilinks/wikilinks_for_distribution/Disambiguation_subset.tar.gz
Both archives contain README.txt file with a description of the archive and it's files.
TSV fils required for processing of datasets using the Wikilinks tool are located here:
/mnt/minerva1/nlp/projects/wikilinks/program
Script tsv_update.py
to keep them up to date can be found here:
/mnt/minerva1/nlp/projects/wikilinks
To increase the efficiency of Wikilinks tool, full URL addresses were excluded from the tsv files and only their titles remain (e.g. prefix http://en.wikipedia.org/wiki/ is removed and "_" replaced with " "). wikilinks.py
creates URL adresses using the names for further processing.
Contains all URL address names of english Wikipedia, one name per row. Both names and redictions are included. 2 files from Decipher wikipedia project are used for updating.
/mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/wikipedia-regular-pages.tsv /mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/wikipedia-redirects.tsv
These two files combined give us the final:
wikipedia-all.tsv
Format example:
Napoleon Napoleon's Barber Napoleon's Campaigns in Miniature
Contains only names of disambiguating pages, one per row. A file from Wikipedia disambiguation pages project is used for updating.
/mnt/minerva1/nlp/projects/wikipedia_disambiguation_pages/output/wp_disambiguation_titles.sorted
This gives us the final:
wikipedia-disambiguations.tsv
Format example:
Napoleon at St. Helena Napoleon (disambiguation) Napoleone Orsini (disambiguation)
Contains all redirecting URL address names. Final address and a group of all names redirecting to that address separated by "|" per row.
Format:
final_adress_name\tname_of_redirecting_adress1|name_of_redirecting_adress2|...
A file from Decipher wikipedia project is used for updating:
/mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/wikipedia-redir-table.tsv
This gives us the final:
wikipedia-redirects.tsv
Format example:
Napoleon Napolean Bonapart|Napolean Bonaparte Napoleon Dynamite Napolean Dynamite
Comparison between datasets CC-2015-32 and CC-2015-40. Both datasets were first processed from input WARC format, then from input VERT format. In both cases, Wikilinks tool was launched with the same parameters. Surprisingly enough, VERT format produced more wiki links, see table:
Dataset | CC-2015-32 | CC-2015-32 | CC-2015-40 | CC-2015-40 |
---|---|---|---|---|
File type | WARC | VERT | WARC | VERT |
Total number of found wiki links | 60,488,837 | 87,742,422 | 34,500,684 | 66,292,376 |
Total processing time | 10 hr, 32 min | 4 hr | 14 hr, 49 min | 1 hr, 50 min |
Executed by script:
/mnt/minerva1/nlp/projects/wikilinks/xbolje00/pageTable.py
The script filters information from SQL script enriched with commands to create and fill a table. More can be found here.
Table format:
+--------------------+---------------------+------+-----+----------------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------------------+---------------------+------+-----+----------------+----------------+ | page_id | int(10) unsigned | NO | PRI | NULL | auto_increment | | page_namespace | int(11) | NO | MUL | NULL | | | page_title | varbinary(255) | NO | | NULL | | | page_restrictions | tinyblob | NO | | NULL | | | page_counter | bigint(20) unsigned | NO | | 0 | | | page_is_redirect | tinyint(3) unsigned | NO | MUL | 0 | | | page_is_new | tinyint(3) unsigned | NO | | 0 | | | page_random | double unsigned | NO | MUL | NULL | | | page_touched | binary(14) | NO | | | | | page_latest | int(10) unsigned | NO | | NULL | | | page_len | int(10) unsigned | NO | MUL | NULL | | | page_content_model | varbinary(32) | YES | | NULL | | | page_links_updated | varbinary(14) | YES | | NULL | | +--------------------+---------------------+------+-----+----------------+----------------+
Rows page_id and page_title are important for further processing. Their values with the INSERT command are stored during processing and consquently saved to the output file.
Launch:
./pageTable.py -i {input_file} -o {output_file}
Input SQL script:
/mnt/minerva1/nlp/projects/decipher_wikipedia/redirectsFromWikipedia/enwiki-latest-page.sql
Output file path:
/mnt/minerva1/nlp/projects/wikilinks/xbolje00/wikipedia-pageTable.tsv
Output file format:
page_id\tpage_title page_id\tpage_title ...
Program gsoc-wikilinks, is used to decode Wikilinks dataset from apache thrift format to the required format. Program is launched with 2 parameters, specifying input and output directories. Files in input directory are read, decoded and saved to output directory. If an error reading a file occurrs: can't be read, not in apache drift format etc, an exception is thrown and the process continues with the next file. Output has to contain address offset and text. Text offset is always given, but address offset has to be calculated. That is only possible if the original html document is available.
The program can be found here:
/mnt/minerva1/nlp/projects/wikilinks/gsoc-wikilinks
Run the following command in the program directory:
mvn package
When the build successfully ends, the following jar archives are created in this directory:
gsoc-wikilinks-1.0-SNAPSHOT.jar gsoc-wikilinks-1.0-SNAPSHOT-jar-with-dependencies.jar
Launch using:
java -Xmx1g -jar gsoc-wikilinks-1.0-SNAPSHOT-jar-with-dependencies.jar -i "input_directory" -o "output_directory"
input_directory
- directory containing files to be processedoutput_directory
- directory for output filesUsed dataset can be found here:
/mnt/data-2/nlp/wiki-links/Dataset (with complete webpages)
Output can be found here:
/mnt/data/nlp/projects/wikilinks/result_from_thrift
Number of files | 109 |
Total time | 14:33:31 |
Number of documents | 10,629,486 |
Number of links | 36,166,829 |
Calculated offsets | 32,638,607 |
Missing offsets | 3,528,222 |
Calculated offsety in % | 89% |