Tool Decipher NER was developed within the project Decipher and is used to search and to make entities more clear (e.g. person, artist, location, ...) in a text. It uses Knowledge Base, which was created by extracting and combining the relevant information of entities from many sources such as Wikipedia, Freebase, Geonames or Getty ULAN. Custom search of entities is done by tool figa, which is based on finite state machines.
NER tool is a part of a larger project Decipher called secapi. The larger part of the tool is stored in git repository minerva1.fit.vutbr.cz/mnt/minerva1/nlp/repositories/decipher/secapi/
, this page contains relative paths to the repository. However, some parts (e.g. extracting of information about entities from different sources) are not in the repository yet. For these special cases, we provide absolute paths to files located on school servers.
To get an idea what NER actually does, feel free to visit a demo app on server knot24. Tool autocomplete can efficiently find and offer entities from Knowledge Base and is available on the same server.
Knowledge Base (KB) of the project DECIPHER is saved in format TSV (Tab-separated values). This file contains information about entities, one entity per row. Number of columns per line depends on the type of the particular entity. Type of entity is determined by the prefix of column ID (first column, outdated) or by column TYPE (second column) and alternatively column SUBTYPE (if there is, then it is the third column). In the future, the abolition of prefixes is anticipated, that is why it is appropriate to use TYPE column to identify the type. Subtypes extend the parent type with additional columns. (That's why they're numbered with a plus sign). Columns that can contain multiple values are tagged as MULTIPLE VALUES. To separate multiple values the | (vertical bar) sign is used. A complete overview of available types, including a description of relevant columns can be found below. Columns that have a name written in italics are blank in KB.all
, filled in KBstatsMetrics.all
.
The current version of KB can be downloaded from the server athena3 (alternatively KBstatsMetrics.all) or it is available on the server athena3 in /mnt/data/kb/KB.all
.
PERSON (prefix: p) ========================================== 01 ID 02 TYPE 03 SUBTYPE (MULTIPLE VALUES) 04 NAME 05 ALIAS (MULTIPLE VALUES) 06 ROLE (MULTIPLE VALUES) 07 NATIONALITY (MULTIPLE VALUES) 08 DESCRIPTION (MULTIPLE VALUES) 09 DATE OF BIRTH 10 PLACE OF BIRTH 11 DATE OF DEATH 12 PLACE OF DEATH 13 GENDER 14 PERIOD OR MOVEMENT (MULTIPLE VALUES) 15 PLACE LIVED (MULTIPLE VALUES) 16 WIKIPEDIA URL 17 FREEBASE URL 18 DBPEDIA URL 19 IMAGE (MULTIPLE VALUES) 20 WIKI BACKLINKS 21 WIKI HITS 22 WIKI PRIMARY SENSE 23 SCORE WIKI 24 SCORE METRICS 25 CONFIDENCE ARTIST (PERSON subtype) (prefix: a) ========================================== +01 ART FORM (MULTIPLE VALUES) +02 INFLUENCED (MULTIPLE VALUES) +03 INFLUENCED BY (MULTIPLE VALUES) +04 ULAN ID +05 OTHER URL (MULTIPLE VALUES) LOCATION (prefix: l) ========================================== 01 ID 02 TYPE 03 SUBTYPE (MULTIPLE VALUES) 04 NAME 05 ALTERNATIVE NAME (MULTIPLE VALUES) 06 LATITUDE 07 LONGITUDE 08 FEATURE CODE 09 COUNTRY 10 POPULATION 11 ELEVATION 12 WIKIPEDIA URL 13 DBPEDIA URL 14 FREEBASE URL 15 GEONAMES ID (MULTIPLE VALUES) 16 SETTLEMENT TYPE (MULTIPLE VALUES) 17 TIMEZONE (MULTIPLE VALUES) 18 DESCRIPTION 19 IMAGE (MULTIPLE VALUES) 20 WIKI BACKLINKS 21 WIKI HITS 22 WIKI PRIMARY SENSE 23 SCORE WIKI 24 SCORE METRICS 25 CONFIDENCE ARTWORK (prefix: w) ========================================== 01 ID 02 TYPE 03 SUBTYPE (MULTIPLE VALUES) 04 NAME 05 ALIAS (MULTIPLE VALUES) 06 DESCRIPTION 07 IMAGE (MULTIPLE VALUES) 08 ARTIST (MULTIPLE VALUES) 09 ART SUBJECT (MULTIPLE VALUES) 10 ART FORM 11 ART GENRE (MULTIPLE VALUES) 12 MEDIA (MULTIPLE VALUES) 13 SUPPORT (MULTIPLE VALUES) 14 LOCATION (MULTIPLE VALUES) 15 DATE BEGUN 16 DATE COMPLETED 17 OWNER (MULTIPLE VALUES) 18 HEIGHT 19 WIDTH 20 DEPTH 21 WIKIPEDIA URL 22 FREEBASE URL 23 DBPEDIA URL 24 PAINTING ALIGNMENT (MULTIPLE VALUES) 25 MOVEMENT 26 WIKI BACKLINKS 27 WIKI HITS 28 WIKI PRIMARY SENSE 29 SCORE WIKI 30 SCORE METRICS 31 CONFIDENCE MUSEUM (prefix: c) ========================================== 01 ID 02 TYPE 03 SUBTYPE (MULTIPLE VALUES) 04 NAME 05 ALIAS (MULTIPLE VALUES) 06 DESCRIPTION 07 IMAGE (MULTIPLE VALUES) 08 MUSEUM TYPE (MULTIPLE VALUES) 09 ESTABLISHED 10 DIRECTOR 11 VISITORS (MULTIPLE VALUES) 12 CITYTOWN 13 POSTAL CODE 14 STATE PROVINCE REGION 15 STREET ADDRESS 16 LATITUDE 17 LONGITUDE 18 WIKIPEDIA URL 19 FREEBASE URL 20 ULAN ID 21 GEONAMES ID (MULTIPLE VALUES) 22 WIKI BACKLINKS 23 WIKI HITS 24 WIKI PRIMARY SENSE 25 SCORE WIKI 26 SCORE METRICS 27 CONFIDENCE EVENT (prefix: e) ========================================== 01 ID 02 TYPE 03 NAME 04 ALIAS (MULTIPLE VALUES) 05 DESCRIPTION 06 IMAGE (MULTIPLE VALUES) 07 START DATE 08 END DATE 09 LOCATION (MULTIPLE VALUES) 10 NOTABLE TYPE 11 WIKIPEDIA URL 12 FREEBASE URL 13 WIKI BACKLINKS 14 WIKI HITS 15 WIKI PRIMARY SENSE 16 SCORE WIKI 17 SCORE METRICS 18 CONFIDENCE VISUAL ART FORM (prefix: f) ========================================== 01 ID 02 TYPE 03 NAME 04 ALIAS (MULTIPLE VALUES) 05 DESCRIPTION 06 IMAGE (MULTIPLE VALUES) 07 WIKIPEDIA URL 08 FREEBASE URL 09 WIKI BACKLINKS 10 WIKI HITS 11 WIKI PRIMARY SENSE 12 SCORE WIKI 13 SCORE METRICS 14 CONFIDENCE VISUAL ART MEDIUM (prefix: d) ========================================== 01 ID 02 TYPE 03 NAME 04 ALIAS (MULTIPLE VALUES) 05 DESCRIPTION 06 IMAGE (MULTIPLE VALUES) 07 WIKIPEDIA URL 08 FREEBASE URL 09 WIKI BACKLINKS 10 WIKI HITS 11 WIKI PRIMARY SENSE 12 SCORE WIKI 13 SCORE METRICS 14 CONFIDENCE VISUAL ART GENRE (prefix: g) ========================================== 01 ID 02 TYPE 03 NAME 04 ALIAS (MULTIPLE VALUES) 05 DESCRIPTION 06 IMAGE (MULTIPLE VALUES) 07 WIKIPEDIA URL 08 FREEBASE URL 09 WIKI BACKLINKS 10 WIKI HITS 11 WIKI PRIMARY SENSE 12 SCORE WIKI 13 SCORE METRICS 14 CONFIDENCE ART PERIOD MOVEMENT (prefix: m) ========================================== 01 ID 02 TYPE 03 NAME 04 ALIAS (MULTIPLE VALUES) 05 DESCRIPTION 06 IMAGE (MULTIPLE VALUES) 07 WIKIPEDIA URL 08 FREEBASE URL 09 WIKI BACKLINKS 10 WIKI HITS 11 WIKI PRIMARY SENSE 12 SCORE WIKI 13 SCORE METRICS 14 CONFIDENCE NATIONALITY (prefix: n) ========================================== 01 ID 02 TYPE 03 NAME 04 ALIAS (MULTIPLE VALUES) 05 DESCRIPTION 06 IMAGE (MULTIPLE VALUES) 07 SHORT NAME 08 COUNTRY NAME 09 ADJECTIVAL FORM (MULTIPLE VALUES) 10 WIKIPEDIA URL 11 FREEBASE URL 12 WIKI BACKLINKS 13 WIKI HITS 14 WIKI PRIMARY SENSE 15 SCORE WIKI 16 SCORE METRICS 17 CONFIDENCE MYTHOLOGY (prefix: y) ========================================== 01 ID 02 TYPE 03 NAME 04 ALTERNATIVE NAME (MULTIPLE VALUES) 05 WIKIPEDIA URL 06 IMAGE (MULTIPLE VALUES) 07 DESCRIPTION 08 WIKI BACKLINKS 09 WIKI HITS 10 WIKI PRIMARY SENSE 11 SCORE WIKI 12 SCORE METRICS 13 CONFIDENCE FAMILY (prefix: i) ========================================== 01 ID 02 TYPE 03 NAME 04 ALTERNATIVE NAME (MULTIPLE VALUES) 05 WIKIPEDIA URL 06 IMAGE (MULTIPLE VALUES) 07 ROLE (MULTIPLE VALUES) 08 NATIONALITY 09 DESCRIPTION 10 MEMBERS (MULTIPLE VALUES) 11 WIKI BACKLINKS 12 WIKI HITS 13 WIKI PRIMARY SENSE 14 SCORE WIKI 15 SCORE METRICS 16 CONFIDENCE GROUP (prefix: r) ========================================== 01 ID 02 TYPE 03 NAME 04 ALTERNATIVE NAME (MULTIPLE VALUES) 05 WIKIPEDIA URL 06 IMAGE (MULTIPLE VALUES) 07 ROLE (MULTIPLE VALUES) 08 NATIONALITY 09 DESCRIPTION 10 FORMATION 11 HEADQUARTERS 12 WIKI BACKLINKS 13 WIKI HITS 14 WIKI PRIMARY SENSE 15 SCORE WIKI 16 SCORE METRICS 17 CONFIDENCE OTHER (prefix: o) ============================================================== 01 ID 02 TYPE 03 TITLE 04 ALIAS (MULTIPLE VALUES) 05 DESCRIPTION 06 IMAGE (MULTIPLE VALUES) 07 WIKIPEDIA URL 08 WIKI BACKLINKS 09 WIKI HITS 10 WIKI PRIMARY SENSE 11 SCORE WIKI 12 SCORE METRICS 13 CONFIDENCE
Sub-files necessary to generate KB are listed in this section. The files are listed for each type separately. If any type is not listed, it means that there are no necessary scripts to generate that part of KB and therefore there is no git directory for it. Instead, the merge is happening within the script prepare_data
(see below). Each individual has a git directory specified, containing scripts and config files necessary to create KB of said type.
PERSON (secapi/NER/KnowBase/persons) ========================================== /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.persons /mnt/minerva1/nlp/projects/decipher_wikipedia/Ludia_Wikipedia/outputs/categories_death_birth_data.csv ARTIST (secapi/NER/KnowBase/artists) ========================================== /mnt/minerva1/nlp/datasets/art/artbiogs/artbiogs.tsv /mnt/minerva1/nlp/datasets/art/bbc/bbc.tsv /mnt/minerva1/nlp/datasets/art/council/council.tsv /mnt/minerva1/nlp/datasets/art/distinguishedwomen/distinguishedwomen.tsv /mnt/minerva1/nlp/datasets/art/artists2artists/artists2artists.tsv /mnt/minerva1/nlp/datasets/art/the-artists/the-artists.tsv /mnt/minerva1/nlp/datasets/art/nationalgallery/nationalgallery.tsv /mnt/minerva1/nlp/datasets/art/wikipaint/final_data/wikipaint_artist.tsv /mnt/minerva1/nlp/datasets/art/davisart/davisart.tsv /mnt/minerva1/nlp/datasets/art/apr/apr.tsv /mnt/minerva1/nlp/datasets/art/rkd/rkd.tsv /mnt/minerva1/nlp/datasets/art/open/open.tsv /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.artists /mnt/minerva1/nlp/projects/decipher_ner/ULAN/ulan_rel_13/KB.artists /mnt/minerva1/nlp/projects/decipher_wikipedia/wiki_template/artists_extended /mnt/minerva1/nlp/datasets/art/artrepublic/artrepublic.tsv /mnt/minerva1/nlp/datasets/art/biography/biography.tsv /mnt/minerva1/nlp/datasets/art/englandgallery/englandgallery.tsv /mnt/minerva1/nlp/datasets/art/infoplease/infoplease.tsv /mnt/minerva1/nlp/datasets/art/nmwa/nmwa.tsv LOCATION (secapi/NER/KnowBase/locations) ========================================== /mnt/minerva1/nlp/projects/decipher_dbpedia/extraction_results/v39/v39_location_finall.tsv /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.locations /mnt/minerva1/nlp/projects/decipher_geonames/geonames.locations ARTWORK (secapi/NER/KnowBase/artworks) ========================================== /mnt/minerva1/nlp/projects/decipher_dbpedia/extraction_results/v39/v39_artwork_finall.tsv /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.artworks MUSEUM (secapi/NER/KnowBase/museums) ========================================== /mnt/minerva1/nlp/projects/decipher_dbpedia/extraction_results/v39/v39_museum_finall.tsv /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.museums /mnt/minerva1/nlp/projects/decipher_ner/ULAN/ulan_rel_13/KB.corporations /mnt/minerva1/nlp/projects/decipher_geonames/geonames.museums MYTHOLOGY (secapi/NER/KnowBase/mythology) ========================================== /mnt/minerva1/nlp/projects/extrakce_z_wikipedie/xgraca00/MythologyKB.txt FAMILY (secapi/NER/KnowBase/artist_families) ========================================== /mnt/minerva1/nlp/projects/extrakce_z_wikipedie/xdosta40/final/finalFamilies.xml GROUP (secapi/NER/KnowBase/artist_group_or_collective) ========================================== /mnt/minerva1/nlp/projects/extrakce_z_wikipedie/xdosta40/final/finalGroupsAndCollectives.xml OTHER (secapi/NER/KnowBase/other) ========================================== /mnt/minerva1/nlp/projects/wikify/wikipedia/data/new_KB/KB.all Other sub-KB which are integrated into the final KB (see script secapi/NER/prepare_data) ========================================== /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/PERSONS /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTISTS /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/LOCATIONS /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTWORKS /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/MUSEUMS /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/MYTHOLOGY /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTIST_FAMILIES /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTIST_GROUP_OR_COLLECTIVE /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.events /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.visual_art_forms /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.visual_art_mediums /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.art_period_movements /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.visual_art_genres /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.nationalities /mnt/minerva1/nlp/projects/ie_foreign_wikipedia/xbajan01/wiki/output/es_only_data.tsv /mnt/minerva1/nlp/projects/ie_foreign_wikipedia/xklima22/wiki/outputs/de-wiki_only_people_data.tsv File containing statistics ========================================== /mnt/minerva1/nlp/projects/decipher_wikipedia/wikipedia_statistics/wikipedia_statistics_2014-12-09.tsv
File HEAD-KB in /mnt/minerva1/nlp/repositories/decipher/secapi/
(consider this a working path from now on) is used as a configuration file to load KB. It contains a complete specification of all types and columns in KB (see chapter Knowledge Base). Each column has it's type and prefixes if there are any. Individual columns of the header are separated by tabs in order to fit in with the data columns.
It is used in script NER/metrics_knowledge_base.py
when generating KB and file KB-HEAD.all.
KB header syntax:
<KB header> ::= <row> "\n" | <row> <KB header> <row> ::= <first_column> "\n" | <first_column> "\t" <other columns> "\n" <first_column> ::= "<" <type_name> ">" <other_columns> | "<" <type_name> ":" <subtype_name> ">" <other_columns> <other_columns> ::= <column> | <column> "\t" <other_columns> <column> ::= <column_name> | "{" <flags> "}" <column_name> | "{[" <prefix_of_value> "]}" <column_name> | "{" <flags> "[" <prefix_of_value> "]}" <column_name>
where:
type_name
- string specifying a type (see Knowledge Base)subtype_name
- string specifying a subtypeflags
- characters specifying a data type (s-string, ...), multiple values in data in column (m-multivalue) and identifier (i-identifier)
prefix_of_value
- string that will be attached to the data in this columncolumn_name
- string containing a column nameexample of the first column in header: <location>ID example of first column in data: l:21931315183 example of other columns in header: {iu[http://en.wikipedia.org/]}WIKIPEDIA URL example of other columns in data: wiki/city_of_london
Regular expression to get individual items in Python:
PARSER = re.compile(r"""(?ux) ^ (?:<(?P<TYPE>[^:>]+)(?:[:](?P<SUBTYPE>[^>]+))?>)? (?:\{(?P<FLAGS>(?:\w|[ ])*)(?:\[(?P<PREFIX_OF_VALUE>[^\]]+)\])?\})? (?P<NAME>(?:\w|[ ])+) $ """) column = u"<jméno typu:jméno podtypu>{flagy[prefix_hodnoty]}název_sloupce" PARSER.search(column).groupdict() { 'TYPE': u'jmeno typu', 'SUBTYPE': u'jmeno podtypu', 'FLAGS': u'typov\xe9 prefixy', 'PREFIX_OF_VALUE': u'adresov\xfd prefix', 'NAME': u'jm\xe9no sloupce' } PARSER.search(column).group("TYPE") u'jmeno typu'
The task was to create a program in C language, that loads KB as a string into shared memory. Program and libraries are saved in Git repository /mnt/minerva1/nlp/repositories/decipher/secapi/
(consider this a working path from now on). KB for this purpose must contain header, that gives a purpose to individual columns in KB. KB header is separated from the data by one empty line.
Three variants have been created:
The first two versions were created for the original task. It was necessary to determine which one will be faster. The second option was proven as faster.
Each variant has it's own dynamic library libKB_shm.so
and KB_shm.py
in Python that builds on the library. Nowadays the Python extension is only maintained for second variant that we know is used.
Daemon loads KB into shared RAM, eventually also namelist. Copy of this memory will be saved to disk next to KB with the exact name as KB and have suffix .bin
. This copy is used to accelerate the next load and should not be transferred between different architectures. Change of KB and namelist is monitored and if there are newer versions than copies of SHM, they will be reloaded and copies will be re-created.
Once deamon loads data to SHM, it will print "./decipherKB-daemon: Waiting for signal..." on stdout. At this stage it is waiting for SIGTERM, SIGINT or SIGQUIT and when one of those three signals is received, it will delete loaded KB from SHM and terminate.
The two earlier mentioned files libKB_SHM.so
and KB_shm.py
are used to work with data in SHM. Their descriptions can be found in libKB_shm.h and KB_shm.py. So it is be omitted here.
1. variant:
decipherKB-daemon [{path_to_KB | -b path_to_KB_bin}]
2. variant:
decipherKB-daemon [-s SHM_NAME] [{path_to_KB | -b path_to_KB_bin}]
3. variant:
decipherKB-daemon [{path_to_KB path_to_namelist | -b path_to_KB_bin}]
path_to_KB
- path to KB (default value "./KB-HEAD.all")path_to_namelist
- path to namelist (default value "./namelist.ASCII")-b path_to_KB_bin
- loads a copy of SHM-s SHM_NAME
- specification of the object name in the shared memory (default is "/decipherKB-daemon_shm")Tool for detection and disambiguation of named entities is implemented in script ner.py
. This chapter contains information only about its launch. Further information about how it works can be found here.
Script ner.py
uses KB, which is loaded to the shared memory by SharedKB.
The current version is available on git in branch D114-NER (change username to your login).
git clone ssh://username@minerva1.fit.vutbr.cz/mnt/minerva1/nlp/repositories/decipher/secapi/ git checkout -b D114-NER origin/D114-NER
It is necessary to perform the following sequence of commands before attempting a launch:
./downloadKB.sh make
Make sure that downloadKB.sh
, KB and machines (*.fsa) are not located in directories secapi/NER
and secapi/NER/figa
during launch. It is advised to delete them beforehand using script deleteKB.sh
.
The tool works with knowledge base with added columns containing statistical data from Wikipedia and pre-counted score for disambiguation. Search of entities in a text and its disambiguation is enabled by script:
secapi/NER/ner.py
Usage:
ner.py [-h] [-a | -s] [-d] [-f FILE]
-h --help
- prints help and terminates-a --all
- prints all entities from input without recognition-s --score
- prints every possible meaning and evaluation of each of these meanings for each entity in a text-d --daemon-mode
- "Daemon mode" (see below)-f FILE --file FILE
- uses specified file as input-r --remove-accent
- removes accent from input-l --lowercase
- converts input to lowercase and uses special machine with only lowercase lettersIt is also possible to read input from standard input (use redirection).
Test texts for ner.py can be found in directory:
secapi/NER/data/input
Activity mode activated using parameter -d allows processing of multiple texts with one instance. Text is expected on standard input, terminated on a separate line using command:
NER_NEW_FILE - prints found entities with disambiguation NER_NEW_FILE_ALL - prints found entities without disambiguation NER_NEW_FILE_SCORE - prints found entities without disambiguation, including scores for each entity
When you enter the command, the tool will print a list of found entities in input text and the output, terminates using the same command. Another text is expected on the input. Processing of the input text since the last command and termination of the program will be caused by a separate line command:
NER_END - ends the program and prints found entities with disambiguation NER_END_ALL - ends the program and prints found entities without disambiguation NER_END_SCORE - ends the program and prints found entities without disambiguation, including scores for each entity
The tool prints a list of found entities on the standard output in order, in which they occur in the input text. One line and columns separated by tabs belong to each entity.
Output lines are in the following format:
BEGIN_OFFSET END_OFFSET TYPE TEXT OTHER
BEGIN_OFFSET and END_OFFSET represent the position of beginning and end of an entity in text.
TYPE specifies the type of entity: kb for item knowledge base, date and interval for date and interval, coref for coreferrence by pronoun or a part of person's name.
TEXT contains a text from of a particular entity in the same form as it was in the input text.
OTHER for types of kb and coref is a list of corresponding numbers of a line in knowledge base separated by character ";". If disambiguation is on, only one most likely corresponding line is selected. While using the script with the -s flag, a pair of row number and evaluation of entity will be displayed, the pairs are separated from each other by semicolon. For the type date and interval - data in a standardized ISO format are contained.
Finite state machine figa searches for varius types of entities (person, location, artwork, museum, event, art form, art medium, art period movement, art genre, nationality) from KB, while their textual form may correspond to multiple meanings. The task was to disambiguate found entities, choose one among the possible meanings, which most likely corresponds to reality.
For every possible meaning of entity in the text a numerical score is calculated - meaning with the highest score is selected as the result of disambiguation. The final score is the sum of static component and contextual component.
Static score represents significance of the particular item in knowledge base. It is calculated on the basis of statistical data about the article on Wikipedia: number of corrupted links, number of article visits, indication whether the article is primarily important for the keyword. If these are not available, other metrics of the Knowledge Base item are used. For score using statistics from Wikipedia - also other type of score is calculated for each component in range of 0 to 100. Partial values are evenly averaged into the final score.
In order to determine the meaning of the entities in a text that correspond only to a part of person's name (first name or last name), these possible meanings found by figa are traceable for each entity.
Before the disambiguation itself, for each person in knowledge base the columns DISPLAY TERM and PREFERRED TERM and every value of OTHER TERMS is divided into the individual names (originally separated by space). The relevant row in knowledge base is specified for each of these names. The result of the process is a dictionary of all parts of the names occurring in the knowledge base that assigns set of rows from knowledge base to each name, where that particular name is used.
Each entity found in text is divided into words, for each a relevant set of lines in knowledge base is found and their intersection is calculated. This is how the people, whose names include all words found within the entity in text are obtained. These meanings are added to those found by figa.
Contextual disambiguation adds criteria context
to the meaning of entity with a meaning of the rest of the document.
After the first disambiguation iteration, locations have regions
assigned to them, based on which region the location belongs to. Consequently a value representing number of locations that belong to a specific region is calculated. This value is used as a part of the meaning score during second iteration of disambiguation.
Occurrences of individual persons are counted after the first disambiguation iteration. This value is again used in the second iteration of disambiguation.
The tool marks entities from knowledge base, dates and even english pronouns. Consequently it attempts to determine what they refer to. The last found entity of the grammatical gender is considered as the meaning of the pronoun. Pronouns he, him, his and himself correspond to males; she, her, hers and herself corresponds to females; who, whom and whose corresponds to either gender. Furthermore, pronouns here, there, where are used.
Finite state machine figa and script dates.py
can produce multiple results for a one place in the text. The date can also be used as a name of an item of KB. In that case the date is preferred. In case that figa finds multiple overlapping entities, the longest one is preferred.
If the closest word before the beginning or behing the end of the entity starts with a capital - it is very likely that the tagged text is a part of a longer name and its meaning does not have to correspond to the meaning of the marked section. Therefore, such entities are not listed. The exception are cases where the adjacent words are separated by punctuation or where a capital letter marks the start of sentence (the previous word ends with period, question mark or exclamation mark).
Please note that these are outdated.
Text processing speed of a tool securing the search of entities identifying people, places and works of art and disambiguation of their importance and locating dates and time intervals in a text were measured on the server athena1, measurement was repeated three times.
Test data includes entity names (1 name = 1 line) in range of 10,000 to 10,000,000.
Tested data can be found here:
secapi/NER/data/performance
Initialization of the tool takes 4.902 s. It is included in the following time data.
Performance characteristics of NER - without disambiguation
Number of entities on input | Input size | Length of processing |
---|---|---|
10,000 | 146 kB | 5.6 s |
100,000 | 1.5 MB | 13.221 s |
1,000,000 | 15.8 MB | 89.282 s |
10,000,000 | 152.8 MB | 815.583 s |
Performance characteristics of NER - with disambiguation
Number of entities on input | Input size | Length of processing |
---|---|---|
10,000 | 146 kB | 5.822 s |
100,000 | 1.5 MB | 18.197 s |
1,000,000 | 15.8 MB | 240.959 s |
10,000,000 | 152.8 MB | 2,533.679 s |
Creation of partial KB PERSONS, ARTISTS, ARTWORKS, LOCATIONS, MUSEUMS, MYTHOLOGY, ARTIST_FAMILIES, ARTIST_GROUP_OR_COLLECTIVE and OTHER takes place in directory:
secapi/NER/KnowBase
Individual KBs are created by script
start.sh
which launches scripts start.sh
for each type individually in their respective subdirectory (artworks, locations, museums, persons, mythology, artist_families and artist_group_or_collective). Therefore it is possible to re-generate a specific type. The creation of partial KB is executed using the kb_compare.py
script (see the matching section below). This step also assigns alternative names from JRC-Names to types ARTIST and PERSON.
You can find more useful scripts in directory secapi/NER/KnowBase
. Script backupKB.sh
is used to backup input files for creation of partial KB to location /mnt/data-in/knot/iotrusina/KB_data_backups
. Script copyKB.sh
is used to upload newly created partial KB to location /mnt/minerva1/nlp/projects/decipher_ner/KnowBase
. Script deleteKB.sh
deletes created partial KB. Scripts copyKB.sh
and deleteKB.sh
are launched within the script secapi/NER/start.sh
.
KB is based on a KB.all project Wikify:
/mnt/minerva1/nlp/projects/wikify/wikipedia/data/new_KB/KB.all
Names that are already used in previously created partial KBs are filtered. Script checks partial KB PERSONS, ARTISTS, etc..., loads names of individual entities and filters theses names out of the KB.all of Wikify project. Entities that remain and aren't used in any of the partial KBs will be the foundation of new partial KB OTHER. It is necessary that OTHER is created as the last partial KB (if a new partial KB is created, it has to be added to script secapi/NER/KnowBase/other/kb_filter_entity_out.py
).
Knowledge Base KB.all is created by merging of partial KB with data from Freebase and some foreign-language Wikipedia.
This connection is done in script:
secapi/NER/prepare_data
When using the -i parameter, the missing images will be downloaded simultaneously to the image database /mnt/athena3/kb/images
(wikimedia images only at the moment).
Newly created Knowledge Base is located in file:
secapi/NER/KB.all
Knowledge Base KBstatsMetrics.all
is basically the original KB.all
expanded by several columns with statistics. They are also created in script prepare_data
, where relevant statistics are added to each row by scripts wiki_stats_to_KB.py
and metrics_to_KB.py
.
Unlike the described format KB.all each line is expanded by six columns: first three generate statistics of a relevant article on Wikipedia (backlinks, hits, primary sense). The fourth forms score for disambiguation calculated on the basis of the previous three columns. The fifth column contains score for disambiguation calculated according to metrics such as length, number of filled columns in KB or number of inhabitants of the location. The sixth column contains confidence score, which combines all of the previous values to one.
Knowledge Base created this way is located in the file:
secapi/NER/KBstatsMetrics.all
The ner.py tool uses tool figa that is able to recognize entities in text using several dictionaries. The description of tool figa and the dictionaries can be found on the project Ner4 page. This chapter describes only its creation. (Previously, the finite machine described on the project Decipher fsa page were used for the same purpose)
Creation of dictionaries for NER and for autocomplete is carried out by scripts create_cedar.sh
and create_cedar_autocomplete.sh
. These scripts work with the file KBstatsMetrics.all
, from which a list of names will be obtained and subsequently submitted to the tool figav1.0, which will create the particular finite state machines.
secapi/NER/figa/make_automat/create_cedar.sh
Usage:
create_fsa.sh [-h] [-l|-u] [-c|-d] --knowledge-base=KBstatsMetrics.all
Required arguments:
-k KB --knowledge-base=KB
- path to KBstatsMetrics.allOptional arguments:
-h --help
- prints help and terminates-l --lowercase
- converts names to lowercase-u --uri
- generates list of URIs-c --cedar (default)
- generates dictionaries CEDAR (.ct suffix)-d --darts
- generates dictionaries DARTS (.dct suffix)Using script create_cedar.sh
the following dictionaries are generated:
automata.[ct|dct] - basic machine for ner.py automata-lower.[ct|dct] - machine used for lowercase variant of recognition automata-uri.[ct|dct] - machine for URI
Using script create_cedar_autocomplete.sh
the following dictionaries are generated:
art_period_movement_automata.[ct|dct] - dictionary for autocomplete (type ART PERIOD MOVEMENT) artwork_automata.[ct|dct] - dictionary for autocomplete (type ARTWORK) event_automata.[ct|dct] - dictionary for autocomplete (type EVENT) family_automata.[ct|dct] - dictionary for autocomplete (type FAMILY) group_automata.[ct|dct] - dictionary for autocomplete (type GROUP) location_automata.[ct|dct] - dictionary for autocomplete (type LOCATION) museum_automata.[ct|dct] - dictionary for autocomplete (type MUSEUM) mythology_automata.[ct|dct] - dictionary for autocomplete (type MYTHOLOGY) nationality_automata.[ct|dct] - dictionary for autocomplete (type NATIONALITY) person_automata.[ct|dct] - dictionary for autocomplete (type PERSON with subtype ARTIST) visual_art_form_automata.[ct|dct] - dictionary for autocomplete (type VISUAL ART FORM) visual_art_genre_automata.[ct|dct] - dictionary for autocomplete (type VISUAL ART GENRE) visual_art_medium_automata.[ct|dct] - dictionary for autocomplete (type VISUAL ART MEDIUM) x_automata.[ct|dct] - dictionary for autocomplete (type all types together)
Both of these scripts are launched within the script secapi/NER/start.sh
.
The process of creating the Knowledge Base and all necessary automata (including those for autocomplete) is automated. The process can be launched using the followign script:
secapi/NER/start.sh
This script gradually createspartial KB, executes merging into KB.all
, creates KBstatsMetrics.all
, creates machines for NER and autocomplete and makes it possible to upload them to the athena3 server (parameter -u or --upload) to location athena3:/mnt/data/kb
, from where they can be easily downloaded.
In addition to script start.sh
, some other useful scripts are found on git in the NER directory. Script deleteKB.sh
deletes all created KBs and machines. Script uploadKB.sh
uploads all created KBs and machines to athena3:/mnt/data/kb
. Script downloadKB.sh
downloads the latest stable version of KB and machines from the athena3 server. Script TestAndRunStart.sh
performs a test of all the prerequisites necessary to generate finite state machines and KB and enables their managed generation to run (more here) with a list of errors that occurred during generating.
The process of creating KBs and machines should only be performed by authorized people. Students are forbidden to run the start.sh
script with the -u or --upload parameter because they could cause the NER tool to malfunction.
If someone uploads a malfunctioning version of KB or machines to athena3 server, it is possible to manually return to the last functional version. Individual versions are located in athena3:/mnt/data/kb/kb
and are numbered by Unix timestamp (e.g. 1413053397). Thus you can copy all the files in the directory of a particular version to the location /mnt/data/kb/
. If you would like to go back to the version 1413053397, use the following command:
cp /mnt/data/kb/kb/1413053397/* /mnt/data/kb/.
The aim was to match entities from two Knowledge Bases. Create new KB from matched entities. The script works on "any" KB, composed of one type of entity (person or location, possibly other), according to configuration files. Matching is implemented in the kb_compare.py
script, which is available on git:
secapi/NER/KnowBase/kb_compare.py
Script kb_compare.py
can now also request deduplication of it's own KB using the --deduplicate_kb1
or --deduplicate_kb2
attributes. This function can be used separately using script kb_dedup.py
available on git:
secapi/NER/KnowBase/kb_dedup.py
Usage:
kb_compare.py [-h] --first FIRST --second SECOND [--first_fields FIRST_FIELDS] [--second_fields SECOND_FIELDS] --rel_conf REL_CONF [--output_conf OUTPUT_CONF] [--other_output_conf OTHER_OUTPUT_CONF] [--first_sep FIRST_SEP] [--second_sep SECOND_SEP] [--id_prefix ID_PREFIX] [--output OUTPUT]
Optional arguments:
-h --help
- show this help message and exit--first FIRST
- filename of the first KB (also used as a prefix for config files)--second SECOND
- filename of the second KB (also used as a prefix for config files)--first_fields FIRST_FIELDS
- filename of the first KB fields list (default '(--first option).fields')--second_fields SECOND_FIELDS
- filename of the second KB fields list (default '(--second option).fields')--rel_conf REL_CONF
- filename of a relationships config--output_conf OUTPUT_CONF
- filename of an output format--other_output_conf OTHER_OUTPUT_CONF
- filename of an output format--first_sep FIRST_SEP
- first multiple value separator (default: '|')--second_sep SECOND_SEP
- second multiple value separator (default: '|')--id_prefix ID_PREFIX
- prefix for ids--deduplicate_kb1
- deduplicate_kb1 (default: False)--deduplicate_kb2
- deduplicate_kb2 (default: False)--id_fields ID_FIELD [ID_FIELD ...]
- names of fields with unique id for deduplication (default: ['WIKIPEDIA URL', 'FREEBASE URL', 'DBPEDIA URL', 'ULAN ID', 'GEONAMES ID'])--output OUTPUT
filename of an output--second_output SECOND_OUTPUT
- filename of output for the remaining unmatched from second KB--treshold TRESHOLD
- matching treshold--first_fields
and --second_fields
– Contain names of fields of particular KBs separated by a new line character.--output_conf
– A configuration file containing format of the newly created KB (created from UNIQUE and AMBIGUOUS_OK). Contains name_kb.name_field separated by a new line character. Empty fields from the first KB can be found in the second KB by --rel_conf
, and vice versa.--other_output_conf
– Because KB needs to be created from data that were not mathed as well, there is a similar configuration file as at --output_conf
. This file always contains fields with prefix --first
. Field None is a special type of field that creates a blank field. This is used to preserve the consistency of KB format (that means, if some fields from --second
do not exist in --first
, then None is used instead).--first_sep
and --second_sep
– Some data sources do not use the convention of distributing MULTIPLE VALUES using '|' character. For these cases, the separator can be set with these switches.--id_prefix
– It is possible to come across another type in files --output_conf
and --other_output_conf
. This type is ID. ID is the generated within the script. The prefix of this ID is set using --id_prefix
. For instance --id_prefix='p'
means that all IDs will start with string "p:" (universally used for person-type entities).--rel_conf
– These files describe relations between two compared databases. Lines with individual relations defined must begin with tabulator. The configuration file may contain the following labels:
--treshold
– It is a treshold the entity must reach in order to be matched. Value of this treshold was determined experimentally so that the script provided the best possible results. The value depends on the number of fields in KB and the amount of information in them. E.g. in locations it is set to 4. Due to this change, this parameter had to be added to shell scripts start.sh. The --threshold
parameter is required.--second_output=filename
– It allows you to list entities from KB2 that have not been assigned to any entity from KB1 into a special file. When using this parameter, these entities no longer appear in the file specified by the parameter --output
!! Script with this parameter is used to compare already generated KBs of type ARTISTS a PERSONS.--deduplicate_kb1
and deduplicate_kb2
- Enables deduplication of KB1, respectively KB2 according to columns with identifiers. These columns can be changed using the --id_fields
attribute../kb_compare.py --first=DBPEDIA --second=GEONAMES --rel_conf=dbpedia_geonames_rel.conf --output_conf=DG_output.conf --output=DG --id_prefix=l --other_output_conf=DG_other_output.conf --treshold=4
Usage:
kb_dedup.py [-h] --kb KB [--kb_fields KB_FIELDS] [--kb_sep KB_SEP] [--id_fields ID_FIELDS [ID_FIELDS ...]] --output OUTPUT
Remove duplicity from Knowledge Base.
Optional arguments:
-h --help
- show this help message and exit--kb KB
- filename of the KB (also used as a prefix for config files)--kb_fields KB_FIELDS
- filename of the KB fields list (default '(--kb option).fields')--kb_sep KB_SEP
- multiple value separator (default: '|')--id_fields ID_FIELD [ID_FIELD ...]
- names of fields with unique id for deduplication (default: ['WIKIPEDIA URL', 'FREEBASE URL', 'DBPEDIA URL', 'ULAN ID', 'GEONAMES ID'])--output OUTPUT
- filename of an output--kb_fields
- path to file containing names of columns of the particular KB, separated by newline character--kb_sep
- specifies a character separating values in columns with multiple values, specified by string "MULTIPLE VALUES"--id_fields
- deduplication based on columns with identifiers, that can be changed using this attributeSyntax (BNF):
<file> ::= <row> | <row> <file> <row> ::= <column source> "\n" <column source> ::= "ID" | "None" | '"' <column content< '"' | <column from kb> <column from kb> ::= <name of kb> "." <name of column> <name of kb> ::= <name of kb1> | <name of kb2>
where:
ID
generates the identifier as the prefix specified by the parameter --id_prefix
and a hexadecimal hash sha224 from the current counter value on the output columnNone
generates the output column blank<column content>
fills the column with this string<column from kb>
determines which column is used for output
*.fields
file, One value is used, respectively all values from the assigned columns according to the relations given in *_rel.conf
<name of column>
is from the corresponding *.fields file KB <name of kb>
without the possible "(MULTIPLE VALUES)" flag<name of kb1>
specified by parameter --first<name of kb2>
specified by parameter --secondID WF.TYPE WF.SUBTYPE ULAN.DISPLAY TERM WF.ALIAS WF.PROFESSION WF.NATIONALITY WF.DESCRIPTION ULAN.DATE OF BIRTH ULAN.PLACE OF BIRTH ULAN.DATE OF DEATH ULAN.PLACE OF DEATH ULAN.GENDER WF.PERIOD OR MOVEMENT WF.PLACE LIVED WF.WIKIPEDIA URL WF.FREEBASE URL WF.DBPEDIA URL WF.IMAGE WF.ART FORM WF.INFLUENCED WF.INFLUENCED BY ULAN.ID
Syntax (BNF):
<file> ::= <row> | <row> <file> <row> ::= <column source> "\n" <column source> ::= "ID" | "None" | '"' <column content< '"' | <list of columns from kb> <list of columns from kb> ::= <column from kb> | <column from kb> "|" <list of columns from kb> <column from kb> ::= <name of kb> "." <name of column> <name of kb> ::= <name of kb1> | <name of kb2>
where:
ID
generates the identifier as the prefix specified by the parameter --id_prefix
and a hexadecimal hash sha224 from the current counter value on the output columnNone
generates the output column blank<column content>
fills the column with this string<list of columns from kb>
contains more than one column, then we assume that the output column has "(MULTIPLE VALUES)" flag<column from kb>
determines which column is used for output<name of column>
is from the corresponding *.fields
file KB <name of kb>
without the possible "(MULTIPLE VALUES)" flag<name of kb1>
specified by parameter --first
ID "person" "artist" ARTREPUBLIC.NAME None None None ARTREPUBLIC.DESCRIPTION|ARTREPUBLIC.ABOUT None None None None None None None None None None ARTREPUBLIC.LOCAL IMAGE None None None None ARTREPUBLIC.PROFILE LINK
Syntax (BNF):
<file> ::= <unique> <name> <other> <unique> ::= "UNIQUE:" "\n" <list of relations> <name> ::= "NAME:" "\n" <list of relations> <other> ::= "OTHER:" "\n" <list of relations> <list of relations> ::= "" | <relation> <list of relations> <relation> ::= "\t" <column from kb1> "=" <column from kb2> "\n" <column from kb1> ::= <name of kb1> "." <name of column> <column from kb2> ::= <name of kb2> "." <name of column>
where:
<unique>
are the relations between unique values. They are entity identifiers. Typically Wikipedia url or Freebase url.<name>
are relations of names, alternative names and so on.<other>
used to evaluate individual candidates, it is NOT used to assign entities to each other<relation>
is relations between the columns <column from kb1>
and <column from kb2>
*.fields
is in *_output.conf
, then it affects the content of output file<name of kb1>
specified by parameter --first
<name of kb1>
specified by parameter --second
UNIQUE: WIKIPEDIA.WIKIPEDIA URL=FREEBASE.WIKIPEDIA URL WIKIPEDIA.FREEBASE URL=FREEBASE.FREEBASE URL NAME: WIKIPEDIA.NAME=FREEBASE.NAME WIKIPEDIA.NAME=FREEBASE.ALIAS WIKIPEDIA.ALTERNATIVE NAME=FREEBASE.NAME WIKIPEDIA.ALTERNATIVE NAME=FREEBASE.ALIAS OTHER: WIKIPEDIA.DATE OF BIRTH=FREEBASE.DATE OF BIRTH WIKIPEDIA.DATE OF DEATH=FREEBASE.DATE OF DEATH WIKIPEDIA.PLACE OF BIRTH=FREEBASE.PLACE OF BIRTH WIKIPEDIA.PLACE OF DEATH=FREEBASE.PLACE OF DEATH WIKIPEDIA.WORK=FREEBASE.PROFESSION WIKIPEDIA.NATIONALITY=FREEBASE.NATIONALITY WIKIPEDIA.INFLUENCED=FREEBASE.INFLUENCED WIKIPEDIA.INFLUENCED BY=FREEBASE.INFLUENCED BY WIKIPEDIA.PERIOD OR MOVEMENT=FREEBASE.PERIOD OR MOVEMENT WIKIPEDIA.IMAGE=FREEBASE.IMAGE
Data sources available from:
secapi/NER/KnowBase/artworks
There are several files in the artworks folder. These are:
DBPEDIA (A data source obtained from DBPEDIE) DBPEDIA.fields (format of data source DBPEDIE - note especially postfix "(MULTIPLE VALUES)" needed to distinguish multiple values in field) FREEBASE (data source retrieved from FREEBASE) FREEBASE.fields (format of data source FREEBASE) ARTWORKS (newly created KB - which we want to create again) ARTWORKS.fields (The format of the new KB needed for other members co-operating on Decipher) ARTWORKS_other_output.conf (configuration file for switch --conf_other_output) ARTWORKS_output.conf (configuration file for switch --conf_output) dbpedia_freebase_rel.conf (configuration file defining relations between fields DBPEDIE and FREEBASE) start.sh (command to start creation of a new KB with all the necessary switches)
Two Knowledge bases – KB1 and KB2 are always being compared. KB1 is sequentially read by rows, each row is matched with a record in KB2. First it searches for item in KB2 with the same unique identifier (typically wikipedia url). Fields labeled UNIQUE are compared. Search can use multiple identifiers too. If a corresponding record in KB2 is found, it is assigned to the record in KB1 and the searching ends. If no results were found or a record in KB1 does not contain an identifier, it searches for candidates basd on the relations specified in field labeled NAME. Usually names, pseudonyms, alternative names are compared. The result of search is a list of candidates suitable to be assigned. Each candidate is evaluated based on the number of matching strings. Then evaluation of relations in field labeled OTHER begins. A candidate with the best evaluation is selected to be assigned, if the evaluation reaches or exceeds the threshold, otherwise nothing is selected. Entity from KB is assigned "used" flag so it can't be used again.
In some cases, two entities may not be matched, even if they should. An example of such a case are, for example, the following two records:
a:8b4f4e2666 artist Charles Alexander Smith Charles Alexander Smith Charles Alexander Smith Canadian Charles Alexander Smith was a Canadian painter from Ontario. 1864 1915 M http://www.freebase.com/m/0269dw5 http://en.wikipedia.org/wiki/Charles_Alexander_Smith http://dbpedia.org/page/Charles_Alexander_Smith freebase/02cy_gx.jpg
and
a:3f368917fe artist Charles Alexander Alexander, Charles Alexander Charles Smith artist photographer|painter British Canadian British painter, 1864-1915 1864 Ontario (Canada) (province) 1915 London (Greater London, England, United Kingdom) (inhabited place) M 500026312
These two records could not be assigned to each other. Above all, the second record does not contain any unique identifiers, such as a wiki URL, according to which it could be uniquely assigned to the first record. Unfortunately, the assignment by name match was not successful, since the first record contains only the name "Charles Alexander Smith" and the second record includes "Charles Alexander", "Alexander, Charles" and "Alexander Charles Smith". Note that here are the first names in reverse order, the strings do not match, so the entities were not assigned to each other.
The task is to create a regular expression, which will be able to extract every common format of date occuring in a text (2004-04-30, 02/30/1999, 31 January 2003 etc.). The regular expression is supposed to be complex (one regular expression only) and should be able to extract as many common formats as possible (we are not interested in specific times, only in years, months and days).
After the extraction of data its normalization (to ISO 8601) will be performed in order to work with data further.
Alongside the regular expression, it was necessary to write a code (function and class) to process and transfer processed dates to other scripts.
Input is a plain text in English language (type str) and output is a list (type list) of items of class Date.
If the script was supposed to be perfect, it would have to include a semantic analysis of English language, which would recognize what is and what is not a date, respectively year.
The script is located on git:
secapi/NER/dates.py
Class Date
Class for the dates found. Besides year, month and day it also includes a position in the source string and a string, from which was this particular date transferred. After creation of a new instance it is necessary to call init_date()
or init_interval()
to initialize attributes.
It has two types, DATE
for a plain date and INTERVAL
for interval between two dates. Here is an example with comments:
class_type: DATE # Attribute indicating the type of data found (plain date or interval). source: April 2, 1918 # String source from source text. iso8601: 1918-04-02 # Date in **ISO_date** s_offset: 265 # Start of string source in source text end_offset: 278 # End of strign source in source text (calculated from """s_offset + len(source)""") --------------------------------- class_type: INTERVAL source: 1882-83 date_from: 1882-00-00 # Starting date in **ISO_date** date_to: 1883-00-00 # Ending date in **ISO_date** s_offset: 467 end_offset: 474
Class ISO_date
Class storing date (year, month and day). It includes attributes day, month and year. it was created to replace datetime.date in situlations when only a year is known. In this case the unknown details are replaced by value zero (e.g. 1881-00-00)
Month and year only:
Year only:
When using dateutil.parser:
str(dateutil.parser.parse("Jul 18 '30").date()) -> '2030-07-18' # automatically adds the current first two digits - considered as correct str(dateutil.parser.parse("Jul 18 30").date()) -> '2030-07-18' # what if the year is 30? str(dateutil.parser.parse("0030-01-01").date()) -> '2001-01-30' # looks like that dateutil.parser does not take year less than hundred str(dateutil.parser.parse("0099-01-01").date()) -> '1999-01-01' # will not always add the current first two digits, but the closest ones str(dateutil.parser.parse("Jul 18 '62").date()) -> '2062-07-18' # because it is year 2013 str(dateutil.parser.parse("Jul 18 '63").date()) -> '1963-07-18' # because it is year 2013 str(dateutil.parser.parse("0100-01-01").date()) -> '0100-01-01' # correct If I get DD/MM/YYYY, then dateutil.parser takes this date as MM/DD/YYYY if DD < 13 otherwise as DD/MM/YYYY str(dateutil.parser.parse("10/1/2000").date()) -> '2000-10-01' str(dateutil.parser.parse("13/1/2000").date()) -> '2000-01-13'
During scanning 45,764,556 words 561,744 etries were found (of which 177,336 were intervals) in 11m 17.836s, thus speed of 67,515.6764 words per second.
The task was to create a test set for the start.sh script stored in secapi/NER
.
List of created files:
secapi/NER/TestAndRunStart.sh - test script secapi/NER/AllKB.txt - list of files needed to create a new KB secapi/NER/fields.txt - list of field files secapi/NER/MatchKBAndFields.py - script that connects relevant fields file to each KB
If the script is launched with the -t
parameter, the TestAndRunStart.sh
script checks the availability of the files, which start.sh
works with and the scripts that start.sh
launches. The script also checks if *.py
and *.sh
files have the appropriate rights set. Next, the script is tests for the correct format, specifically number of columns, files that are needed to create a new KB (list in the AllKB.txt
file). The number of columns must be the same as the number of columns in the fields files stored in the NER/KnowBase
.
If the script is launched with the -r
parameter, the start.sh
script is called and the stderr is redirected to the file. After the script is finished, the stderr is analyzed and the names of the missing files, if there are any, the number of run-time errors and the number of warnings are printed out. After the creation of new machines and KB, comparison between old versions of machines and KB.all
and KBstatsMetrics.all
files takes place. The script reports an error if the difference between the newly created file and the older version is greater than 5% or when the new file is smaller by more than 10%. The script also reports an error if no machine or KB have been created.
Testing the ner.py
scripts occurs when launching with the -n
parameter and launching figav08 tool with the -f
flag. 75% of testing is done on data stored in secapi/NER/data/input
. In other cases, a specific entity is tested. "Paris 1984" at ner.py
and "A Alewijn" at figav08. Then the ner.py
utility (also when lauching with the -n parameter) is tested with different switches. The tool is launched without any switches, with the -l
switch (converts input to lower case), with -r
switch (with an error in input diacritics) and launching with -l
and -r
with invalid input text.
Launch example:
./TestAndRunStart.sh -t -> Starts testing file the existence of a file, permissions, and file format testing needed to create KB ./TestAndRunStart.sh -r -> starts start.sh script ./TestAndRunStart.sh -f -> starts testing of figa tool ./TestAndRunStart.sh -n -> starts testing of ner tool ./TestAndRunStart.sh -h -> shows help ./TestAndRunStart.sh -c -> starts CheckNerOut.py script - testing output from ner with manual annotation
The script CheckNerOut.py
is used for this comparison. This script prints a percentage match to the output. The file with entities from annotation that were not paired is saved into the secapi/NER/CheckNerOut
directory. The same is done with entities from the file that contains the output from NER. The script has been embedded in TestAndRunStart.sh
, where it tests files that have manual annotations.
Launch example:
python CheckNerOut.py -a SouborSAnotaci.tsv -o NerOutput.txt
Paul_Kane: 56.6037735849% travelling_artist_part2: 69.0909090909% travelling_artist_part3: 81.6513761468% Dossier_and_OS_texts.docx: 56.5020576132%
The purpose of the script is to identify the first and born names of people in the selected text using a list of names. The tool is located in the git repository in directory secapi/NameRecognizer
(it is necessary to change branch to NameRecognizer).
1. step - Creating a list of names:
First, it was necessary to create a list of names for the finite state machine. That is why I created the Name Collector tool, which is described below. Out of all the output text files, only the outputs/all.txt file is fundamental for the finite state machine.
2. step - Compilation of figa tool:
First, you need to download the git figa tool repository (see Decipher_fsa) to:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22
and launch script:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/create_figa.sh
The script compiles the tool and copies it to directory:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/data
3. step - Formation of the finite state machine:
As in Step 2, it is necessary to download the git figa tool repository and then run the script:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/create_fsa.sh
The script assembles the finite state machine final.fsa using the list of names from the first step and puts it in directory:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/data
4. step - Processing outputs of figa:
Figa outputs are edited by sort -u and later on by process_outputs.py
script:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/process_outputs.py
The script accepts 2 parameters - the input and output file - and combines the acquired names based on offsets.
The process_outputs.py script uses lists that are stored in the data/lists directory when filtering:
blist_locations.txt - list of locations created by kb_locations script custom_names.txt - list of manually completed names custom_surrnames.txt - list of manually completed surnames names.txt - list of names, gained from results of name_collector and kb_list scripts nationalities.txt - list of nationalities notfirst.txt - list of words that can not be on the first position replace.txt - list of words or phrases that are supposed to be removed from the names surrnames.txt - list of surnames, gained from results of name_collector and kb_list scripts
The whole step is automated by the run.sh
script, which loads the data from stdin and prints the results to stdout.
In addition to names, the script also searches for initials (two-character words composed of capitals and dots) in text to find their start and end offsets, and then adds them to the list of names from the figa tool output. After obtaining a list of names, this part is removed from the names ending with "'s" and the end offset is recalculated.
Usage:
bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/run.sh [--show-filtered]
Optional parameter --show-filtered
causes that the filtered names will be written to the file.
Example use:
bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/run.sh < /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/test_data/test.txt
Output of the example above:
0 26 39 Nathaniel Hill 0 41 54 Nathaniel Hill 1 202 224 Joseph Malachy Kavanagh 1 226 248 Joseph Malachy Kavanagh 0 328 341 Walter Osborne 0 347 360 Nathaniel Hill 0 539 552 Edward McGuire 0 554 567 Edward McGuire
Output format:
1. column - Type of name according to how the name was found in the text (0 - all names were found by the figa tool, 1 - the names in which the part was found by the match and part added by the script). 2. column - Start offset 3. column - End offset 4. column - Name
The list of gained names can also include invalid names (such as Post Office) or names that are sub-strings of other names found. This occurs due to the fact that, when the number of gain names is increased, the new names are not compared with the contents of surnames.txt
and could therefore be surnames. The way the name was found indicates the first column of results, the possible flags are:
0 - all names were found by the figa tool 1 - names in which a part was found by a match, and other part was added by the script 4 - names created by conjunction of names of types 0 and 1 7 - names whose surname part was not found in surnames.txt 8 - names that are a sub-string of another name found
Names tagged by number 4 are created by merging types 0 and 1. First, it detects which of the combined names have a lower start offset, then all the words from this name are added, followed by words from the second name (only ones that do not yet appear in the resulting name). In the end, the offsets are recalculated and the first name is replaced by the new one and the second one is removed.
The script is capable of learning new type 1 words (see paragraph above for info). The words learned like this are stored in:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/outputs/learned.txt
These words can not be clearly identified as a name or surname, so they are only included in the general list and they are not in names.txt
and surrnames.txt
lists (list description is at the beginning of this step).
Names that have been filtered out for a certain reason are stored in:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/outputs/filtered.txt
Only the results of the last processing are saved. Writing of filtered names must be enabled by an optional parameter --show-filtered
. The number at the beginning of the line specifies the reason of the filtration (in parentheses the list that caused the name to be filtered are listed):
0 - Name contained less than 2 words 1 - Name contained the location name (list blist_locations.txt) 2 - Name had a word that can not be in this position as the first word (list notfirst.txt) 3 - Name contained a word which is not a name as the first word (names.txt) 4 - Name contained a word which is not a surname as the final word (surrnames.txt) [DELETED] 5 - Name contained nationality (nationalities.txt)
5. step - Highlighting of the names found in text:
Outputs of figa are further used by the following script:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/highlight_names.py
The script requires the outputs of the previous step stored in figa.out
file, takes the text on stdin and prints HTML text with colored highlighted names on stdout:
green Names marked 0 in the first column of the outputs of figa, i.e. names found by full match with names in names.txt and surrnames.txt red Names marked 1 in the first column of the outputs of figa, i.e. names that have at least one name added by script, thus this does not have to be a valid name blue Names that have been found in the text more times than their quantity is in outputs of figa. purple Co-references of the names highlighted by colors above lime Names whose surnames are not in the surnames.txt list olive Names that are a sub-string of a longer name (green, red or blue)
The first column determines the way the name was found in the text (see step 4). Names marked blue may indicate an error when processing outputs by process_outputs.py
script or an error when searching by the figa tool.
Usage:
python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/highlight_names.py
Example use:
bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/run.sh < /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/test_data/test.txt > figa.out python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/highlight_names.py < /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/test_data/test.txt > outputs/examples/example.html
The example output file is stored in:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/outputs/example.html
Results
Names recognized by the tool in test texts:
Richard Moynan Mary Magdalene Jack B. Lawrence J.
To use NameRecognizer in other scripts, the wrap script name_recognizer.py
is available in order to access the functionality of the scripts process_outputs.py
and highlight_names.py
by two functions - process (text) and highlight (text, output_fce_process) that return the processed outputs of figa tool or text with highlighted names.
Before using, it is necessary to instantiate the NameRecognizer class with two mandatory constructor parameters - path to the executable figa file (figav08) and path to the finite state machine (* .fsa).
To test functionality, you can run the script with this input text:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/test_data/example_input.txt
output is written to stdout:
python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/name_recognizer.py
Tool that obtains a list of names from several websites, it is located in:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/name_collector
Directory structure:
main.py - launch script of the tool base.py - defines the class from which it inherits classes of resources (sites) that have a previously known number of pages base_pagination.py - defines the class from which it inherits classes of resources (sites) that use paging (thus they have "next" button) rsrc_*_base.py - classes that define a common interface for certain sources rsrc_*.py - source classes ./outputs/names.txt - alphabetically sorted list of names ./outputs/surrnames.txt - alphabetically sorted list of lastnames ./outputs/all_raw.txt - alphabetically sorted list of names and lastnames ./outputs/all.txt - file all_raw.txt edited to a format suitable for figa tool ./outputs/name/*.txt - outputs of sources with names ./outputs/surrname/*.txt - outputs of sources with lastnames
Sites implemented as sources:
http://german.about.com/library/blname_Girls.htm http://german.about.com/library/blname_Boys.htm http://babynames.net http://surname.sofeminine.co.uk/w/surnames/most-common-surnames-in-great-britain.html http://www.surnamedb.com/Surname http://en.wikipedia.org/wiki/Old_Frisian_given_names http://en.wikipedia.org/wiki/List_of_biblical_names http://en.wikipedia.org/wiki/Slavic_names http://genealogy.familyeducation.com/browse/origin/
Usage:
python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/name_collector/main.py
The script automatically runs when using the run.sh
script in the KB List tool.
Tool used to extract names from KB.all
, results are combined with results from Name Collector, it is located in:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/kb_list
Complete processing, merging and categorization of data into names and lastnames is performed by run.sh
.
Usage:
bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/kb_list/run.sh
Tool that creates a list of locations used to filter the outputs of figa. The result of the script is a list of locations:
data/lists/blist_locations.txt
The script is located in:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/kb_locations
and launched using the following script:
bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/kb_locations/run.sh
Script that searches for all one-word names in KB.all
. The script is located in directory:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts
Launched using:
python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/swn.py
The output of the script are one-word names along with information about occurrence in KB.all
. The first line contains version of KB.all
. Format of the second and the following lines is as follows:
1. column - line number of KB.all at which the name was found 2. column - ID of a person in KB.all 3. column - name
The example outputs of process_outputs.py
(.out suffix files), highlight_names.py
(.html suffix files), and swn.py
(swn.txt) are stored in:
/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/outputs/examples/
The aim of the project is to completely replace FSA automata in project FIGA for CEDAR (implemented in project Ner4).
This project has its own branch on git: NER-figa_cedar.
The following chapters will briefly describe the changes made to the migration of fsa figa to cedar figa.
For the Ner project, scripts were only slightly updated to use/support new version of Figa.
The following files are left, even though they were not even used in the old version:
NER/figa/sources/kb_loader_fast.cc NER/figa/sources/kb_loader_slow.cc
Updated scripts in order to use the new version of figa:
NER/start.sh NER/uploadKB.sh NER/deleteKB.sh NER/ner.py
Major changes have been made in the Figa project. The most important thing is replacement of FSA machines for CEDAR tries. There was an effort to modify the new Figa as little as possible - mostly only trivial bugs have been fixed. Figa interface was slightly modified (c++) <->
Ner (python). New Figa machines occupy much more disk space, but processing is faster.
create_fsa[_autocomplete].sh
rewritten and edited to create_cedar[_autocomplete].sh
autocomplete.py
updatedBelow is the comparison of requirements for creation tools, searching in machines of individiual libraries. All tests were run on the athena1 server.
To create namelists, Knowledge Bases from the Ner project were used: KB.all
(5400067 entities) and its smaller version KB.11
(490915 entities - contains every 11th line from KB.all
).
Entry texts example_input2
(3693 words) and testing_data.txt
(17931 words) were used to search for entities.
Creation of automata (FSA) and dictionary (CEDAR/DARTS). It's is assumed you have prepared namelist, thus only the time of generating the machine itself is measured by figav1.0/fsa_build
(not the whole process create_[cedar|fsa].sh
):
Time | Time | Time | Memory usage | Memory usage | Memory usage | Machine size | Machine size | Machine size | |
---|---|---|---|---|---|---|---|---|---|
FSA | CEDAR | DARTS | FSA | CEDAR | DARTS | FSA | CEDAR | DARTS | |
KB.11 | 56s | 31s | 30s | 2.4 GB | 331,1 MB | 251,1 MB | 24,3 MB | 98,4 MB | 31,3 MB |
KB.all | 115m | 4m 46s | 4m 50s | 8,8 GB | 2,8 GB | 2,9 GB | 237,1 MB | 921,1 MB | 334,3 MB |
Creation of automation for spellcheck (FSA), for CEDAR and DARTS - special dictionary is not needed:
Time | Memory usage | Machine size | |
---|---|---|---|
KB.11 | 12s | 329,4 MB | 11,6 MB |
KB.all | 3m 42s | 4,0 GB | 90,8 MB |
Search in created automata/dictionaries:
Time | Time | Time | Memory usage | Memory usage | Memory usage | ||
---|---|---|---|---|---|---|---|
FSA | CEDAR | DARTS | FSA | CEDAR | DARTS | ||
KB.11 | example_input2 | 0,11s | 0,49s | 0,16s | 36,8 MB | 110,9 MB | 43,9 MB |
KB.11 | testing_data.txt | 0,15s | 0,51s | 0,18s | 36,8 MB | 110,9 MB | 43,9 MB |
KB.all | example_input2 | 1,21s | 4,74s | 1,89s | 249,7 MB | 933,7 MB | 346,8 MB |
KB.all | testing_data.txt | 1,27s | 4,84s | 1,93s | 249,7 MB | 933,7 MB | 346,8 MB |
Search with spellcheck enabled in automata/dictionaries:
Time | Time | Time | Memory usage | Memory usage | Memory usage | ||
---|---|---|---|---|---|---|---|
FSA | CEDAR | DARTS | FSA | CEDAR | DARTS | ||
KB.11 | example_input2 | 0,12s | 4,3s | 3,4s | 48,4 MB | 110,9 MB | 43,8 MB |
KB.11 | testing_data.txt | 0,17s | 20,8s | 17,1s | 48,5 MB | 110,9 MB | 43,8 MB |
KB.all | example_input2 | 1,64s | 23,5s | 17,3s | 340,5 MB | 934,1 MB | 347,9 MB |
KB.all | testing_data.txt | 1,67s | 2m 26s | 2m | 340,5 MB | 934,8 MB | 347,3 MB |
Conclusion
CEDAR and DARTS libraries create a machine much faster compared to the FSA, with a much smaller memory requirements (To create some really big namelists from the project Wikify the memory of the athena1 server was not enough, and even athena3 failed to create a FSA dictionary without an error). CEDAR and DARTS don't need special spellchecking machines. On the other hand, the FSA machine uses less disk space and searching is much faster.
Script secapi/NER/KB_changes_comparator.py
was created in branch "wikipedia_update" to compare changes, entities with wikipedia link and different versions of KB.
Required arguments:
oldKB_path
- path to old KB filenewKB_path
- path to new KB fileOptional arguments:
-h --help
- show this help message and exit-w --word
- if toggled, the changes will be displayed in context of whole word, not just part of it-e CATEGORY --exclude CATEGORY
- exclude some categories from comparsion. Usage: --exclude "ALIAS,DATE OF DEATH" - categories ALIAS and DATE OF DEATH will be excluded from the output. You can use it like this:
--exclude WIKI and all categories with WIKI in their name will be excluded. Category names can be found in HEAD-KB file.-c CATEGORY --category CATEGORY
- explicitly set one category to be tested. Category names can be found in HEAD-KB file.Launch example:
python ./KB_changes_comparator.py oldKB.tsv newKB.tsv -w -e "GENDER,NATIONALITY" # Compares oldKB with newKB, prints out words and omits categories GENDER and NATIONALITY python ./KB_changes_comparator.py oldKB.tsv newKB.tsv -c DESCRIPTION # Compares oldKB with newKB, only compares category DESCRIPTION
Example of output with parameter -w and without:
without -w 4116 4116 DESCRIPTION http://en.wikipedia.org/wiki/Štepán_Wagner replace . -> er with -w 4116 4116 DESCRIPTION http://en.wikipedia.org/wiki/Štepán_Wagner replace jump. -> jumper
Output structure:
row_number_in_newKB \t row_number_in_oldKB \t column_category \t wiki_link \t change_type \t original_value \t -> \t new_value
Exception in case a new entity is found:
row_number \t wiki_link \t new \t entity_row_content
Example:
452196 3323492 {e}PLACE OF BIRTH http://en.wikipedia.org/wiki/Jan_van_der_Heyden insert -> Gorinchem (South Holland, Netherlands) (inhabited place) 17641 17641 PLACE OF BIRTH http://en.wikipedia.org/wiki/African_Spir replace Elisabethgrad, -> Elizabethgrad, 278171 http://en.wikipedia.org/wiki/Erjon_Vucaj new p:6fa1cac12f person Erjon Vucaj footballer Albania, Shkodër 1990-12-25 http://en.wikipedia.org/wiki/Erjon_Vucaj http://www.freebase.com/m/0b__zy7 http://dbpedia.org/page/Erjon_Vucaj
Generated outputs are launched using the following command:
python KB_changes_comparator.py /mnt/data/kb/1455196205/KB.all /mnt/data/kb/1476090552/KB.all
and can be found here:
/mnt/minerva1/nlp/projects/wikipedia_update/output.out
The run-time for 4500000 rows is 5-6 minutes. Script tags columns of different types of entities using this file:
/mnt/minerva1/nlp/repositories/decipher/secapi/HEAD-KB