Decipher ner-en

Decipher NER (iotrusina)

Tool Decipher NER was developed within the project Decipher and is used to search and to make entities more clear (e.g. person, artist, location, ...) in a text. It uses Knowledge Base, which was created by extraction and combining the relevant information of entities from many sources such as Wikipedia, Freebase, Geonames or Getty ULAN. Custom search of entities is done by tool figa, which is based on finite state machines.

NER tool is a part of a larger project Decipher called secapi. The larger part of the tool is stored in git repository minerva1.fit.vutbr.cz/mnt/minerva1/nlp/repositories/decipher/secapi/, on this page there are listed relative paths in this repository. However, some parts (e.g. extracting of information about entities from different sources) are not in the repository yet. For these cases there are always stated absolute paths to the school servers.

To get an idea what NER actually does, feel free to visit a demo app on server knot24. On the same server there is also tool autocomplete, which can effectively offer entities from the Knowledge Base, available.

Knowledge Base (iotrusina)

Knowledge Base (KB) of the project DECIPHER is saved in format TSV (Tab-separated values). This file contains information about entities, while there is always only one entity saved on each line. Number of columns on line depends on the type of the particular entity. Type of entity is determined by the prefix of column ID (first column) or by column TYPE (second column). In the future, the abolition of prefixes is anticipated, that is why it is appropriate to use TYPE column to identify the type. Columns that can contain multiple values are tagged as MULTIPLE VALUES. To separate multiple values the | (vertical bar) sign is used. A complete overview of available types, including a description of relevant columns can be found below.

The current version of KB can be downloaded from the server athena3 or it is available on the server athena3 in /mnt/data/kb/KB.all.

 PERSON (prefix: p)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 PERIOD OR MOVEMENT (MULTIPLE VALUES)
 08 PLACE OF BIRTH
 09 PLACE OF DEATH
 10 DATE OF BIRTH
 11 DATE OF DEATH
 12 PROFESSION (MULTIPLE VALUES)
 13 PLACE LIVED (MULTIPLE VALUES)
 14 GENDER
 15 NATIONALITY (MULTIPLE VALUES)
 16 WIKIPEDIA URL
 17 FREEBASE URL
 18 DBPEDIA URL
 ARTIST (prefix: a)
 ==========================================
 01 ID
 02 TYPE
 03 DISPLAY TERM
 04 PREFERRED TERM
 05 OTHER TERM (MULTIPLE VALUES)
 06 PREFERRED ROLE
 07 OTHER ROLE (MULTIPLE VALUES)
 08 PREFERRED NATIONALITY
 09 OTHER NATIONALITY (MULTIPLE VALUES)
 10 DESCRIPTION (MULTIPLE VALUES)
 11 DATE OF BIRTH
 12 PLACE OF BIRTH
 13 DATE OF DEATH
 14 PLACE OF DEATH
 15 GENDER
 16 NOTE
 17 PERIOD OR MOVEMENT (MULTIPLE VALUES)
 18 INFLUENCED (MULTIPLE VALUES)
 19 INFLUENCED BY (MULTIPLE VALUES)
 20 ART FORM (MULTIPLE VALUES)
 21 PLACE LIVED (MULTIPLE VALUES)
 22 WIKIPEDIA URL
 23 FREEBASE URL
 24 ULAN ID
 25 DBPEDIA URL
 26 OTHER URL (MULTIPLE VALUES)
 27 IMAGE (MULTIPLE VALUES)
 LOCATION (prefix: l)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALTERNATIVE NAME (MULTIPLE VALUES)
 05 LATITUDE
 06 LONGITUDE
 07 FEATURE CODE
 08 COUNTRY
 09 POPULATION
 10 ELEVATION
 11 WIKIPEDIA URL
 12 DBPEDIA URL
 13 FREEBASE URL
 14 GEONAMES ID
 15 SETTLEMENT TYPE (MULTIPLE VALUES)
 16 TIMEZONE (MULTIPLE VALUES)
 17 DESCRIPTION
 18 IMAGE (MULTIPLE VALUES)
 ARTWORK (prefix: w)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 ARTIST (MULTIPLE VALUES)
 08 ART SUBJECT (MULTIPLE VALUES)
 09 ART FORM
 10 ART GENRE (MULTIPLE VALUES)
 11 MEDIA (MULTIPLE VALUES)
 12 SUPPORT (MULTIPLE VALUES)
 13 LOCATION (MULTIPLE VALUES)
 14 DATE BEGUN
 15 DATE COMPLETED
 16 OWNER (MULTIPLE VALUES)
 17 HEIGHT
 18 WIDTH
 19 DEPTH
 20 WIKIPEDIA URL
 21 FREEBASE URL
 22 DBPEDIA URL
 23 PAINTING ALIGNMENT (MULTIPLE VALUES)
 24 MOVEMENT
 MUSEUM (prefix: c)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 MUSEUM TYPE (MULTIPLE VALUES)
 08 ESTABLISHED
 09 DIRECTOR
 10 VISITORS (MULTIPLE VALUES)
 11 CITYTOWN
 12 POSTAL CODE
 13 STATE PROVINCE REGION
 14 STREET ADDRESS
 15 LATITUDE
 16 LONGITUDE
 17 WIKIPEDIA URL
 18 FREEBASE URL
 19 ULAN ID
 20 GEONAMES ID
 EVENT (prefix: e)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 START DATE
 08 END DATE
 09 LOCATION (MULTIPLE VALUES)
 10 NOTABLE TYPE
 11 WIKIPEDIA URL
 12 FREEBASE URL
 VISUAL ART FORM (prefix: f)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 WIKIPEDIA URL
 08 FREEBASE URL
 VISUAL ART MEDIUM (prefix: d)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 WIKIPEDIA URL
 08 FREEBASE URL
 VISUAL ART GENRE (prefix: g)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 WIKIPEDIA URL
 08 FREEBASE URL
 ART PERIOD MOVEMENT (prefix: m)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 WIKIPEDIA URL
 08 FREEBASE URL
 NATIONALITY (prefix: n)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 SHORT NAME
 08 COUNTRY NAME
 09 ADJECTIVAL FORM (MULTIPLE VALUES)
 10 WIKIPEDIA URL
 11 FREEBASE URL
 MYTHOLOGY (prefix: y)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALTERNATIVE NAME (MULTIPLE VALUES)
 05 WIKIPEDIA URL
 06 IMAGE (MULTIPLE VALUES)
 07 DESCRIPTION
 FAMILY (prefix: i)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALTERNATIVE NAME (MULTIPLE VALUES)
 05 WIKIPEDIA URL
 06 IMAGE
 07 ROLE (MULTIPLE VALUES)
 08 NATIONALITY
 09 DESCRIPTION
 10 MEMBERS (MULTIPLE VALUES)
 GROUP (prefix: r)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALTERNATIVE NAME (MULTIPLE VALUES)
 05 WIKIPEDIA URL
 06 IMAGE
 07 ROLE (MULTIPLE VALUES)
 08 NATIONALITY
 09 DESCRIPTION
 10 FORMATION
 11 HEADQUARTERS
 OTHER (prefix: o) -- !!!! wikifree branch only for now !!!
 ==============================================================
 01 ID
 02 TYPE
 03 TITLE
 04 ALIAS (MULTIPLE VALUES)
 05 ID_WIKIPEDIA
 06 WIKIPEDIA URL
 07 DESCRIPTION
 08 ID_FREEBASE
 09 FREEBASE URL
 10 DESC_FREEBASE
 11 W_BACKLINKS
 12 VIEWS
 13 PRIMARY TAG
 14 TYPE (MULTIPLE VALUES)

Location of the files necessary to generate KB

Sub-files necessary to generate KB are listed in this section. The files are listed for each type separately. If any type is not listed, the reason why is that to generate sectional KB of the particular type, it is not necessary to use any scripts - there is no directory created for it on git. Instead, the merge is happening within the script prepare_data (see below). For individual types there is always stated the particular git directory in brackets, where you can find scripts and configuration files required to create KB of the particular type.

 PERSON (secapi/NER/KnowBase/persons)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.persons
 /mnt/minerva1/nlp/projects/decipher_wikipedia/Ludia_Wikipedia/outputs/categories_death_birth_data.csv
 ARTIST (secapi/NER/KnowBase/artists)
 ==========================================
 /mnt/minerva1/nlp/datasets/art/artbiogs/artbiogs.tsv
 /mnt/minerva1/nlp/datasets/art/bbc/bbc.tsv
 /mnt/minerva1/nlp/datasets/art/council/council.tsv
 /mnt/minerva1/nlp/datasets/art/distinguishedwomen/distinguishedwomen.tsv
 /mnt/minerva1/nlp/datasets/art/artists2artists/artists2artists.tsv
 /mnt/minerva1/nlp/datasets/art/the-artists/the-artists.tsv
 /mnt/minerva1/nlp/datasets/art/nationalgallery/nationalgallery.tsv
 /mnt/minerva1/nlp/datasets/art/wikipaint/final_data/wikipaint_artist.tsv
 /mnt/minerva1/nlp/datasets/art/davisart/davisart.tsv
 /mnt/minerva1/nlp/datasets/art/apr/apr.tsv
 /mnt/minerva1/nlp/datasets/art/rkd/rkd.tsv
 /mnt/minerva1/nlp/datasets/art/open/open.tsv
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.artists
 /mnt/minerva1/nlp/projects/decipher_ner/ULAN/ulan_rel_13/KB.artists
 /mnt/minerva1/nlp/projects/decipher_wikipedia/wiki_template/artists_extended
 /mnt/minerva1/nlp/datasets/art/artrepublic/artrepublic.tsv
 /mnt/minerva1/nlp/datasets/art/biography/biography.tsv
 /mnt/minerva1/nlp/datasets/art/englandgallery/englandgallery.tsv
 /mnt/minerva1/nlp/datasets/art/infoplease/infoplease.tsv
 /mnt/minerva1/nlp/datasets/art/nmwa/nmwa.tsv
 LOCATION (secapi/NER/KnowBase/locations)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_dbpedia/extraction_results/v39/v39_location_finall.tsv
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.locations
 /mnt/minerva1/nlp/projects/decipher_geonames/geonames.locations
 ARTWORK (secapi/NER/KnowBase/artworks)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_dbpedia/extraction_results/v39/v39_artwork_finall.tsv
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.artworks
 MUSEUM (secapi/NER/KnowBase/museums)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_dbpedia/extraction_results/v39/v39_museum_finall.tsv
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.museums
 /mnt/minerva1/nlp/projects/decipher_ner/ULAN/ulan_rel_13/KB.corporations
 /mnt/minerva1/nlp/projects/decipher_geonames/geonames.museums
 MYTHOLOGY (secapi/NER/KnowBase/mythology)
 ==========================================
 /mnt/minerva1/nlp/projects/extrakce_z_wikipedie/xgraca00/MythologyKB.txt 
 FAMILY (secapi/NER/KnowBase/artist_families)
 ==========================================
 /mnt/minerva1/nlp/projects/extrakce_z_wikipedie/xdosta40/final/finalFamilies.xml
 GROUP (secapi/NER/KnowBase/artist_group_or_collective)
 ==========================================
 /mnt/minerva1/nlp/projects/extrakce_z_wikipedie/xdosta40/final/finalGroupsAndCollectives.xml
 Other sub-KB which are integrated into the final KB (see script secapi/NER/prepare_data)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/PERSONS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTISTS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/LOCATIONS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTWORKS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/MUSEUMS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/MYTHOLOGY
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTIST_FAMILIES
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTIST_GROUP_OR_COLLECTIVE
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.events
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.visual_art_forms
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.visual_art_mediums
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.art_period_movements
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.visual_art_genres
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.nationalities
 /mnt/minerva1/nlp/projects/ie_foreign_wikipedia/xbajan01/wiki/output/es_only_data.tsv
 /mnt/minerva1/nlp/projects/ie_foreign_wikipedia/xklima22/wiki/outputs/de-wiki_only_people_data.tsv
 File with stats
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_wikipedia/wikipedia_statistics/wikipedia_statistics_2014-12-09.tsv

Specification of HEAD-KB header (xdolez52)

File HEAD-KB in "/mnt/minerva1/nlp/repositories/decipher/secapi/" (consider this a working path from now on) is used as a configuration file for loading of KB. It contains a complete specification of all types and columns in KB (see chapter Knowledge Base). At each column there are their data types, eventually prefixes stated. Individual columns of the header are separated by tabs in order to fit in with the data columns.

It is used during generating of KB in script "NER/metrics_knowledge_base.py" and creating of file "KB-HEAD.all".

Columns in HEAD-KB

Syntax:

<first column> ::= "<" <name of type> ">" <other columns>
                  | "<" <name of type> ":" <name of subtype> ">" <other columns>

<other columns> ::= <name_of_column>
                    | "{" <flags> "}" <name_of_column>
                    | "{[" <prefix_of_value> "]}" <name_of_column>
                    | "{" <flasy> "[" <prefix_of_value> "]}" <name_of_column>

where:

  • name of type is a string indicating a type of (see Knowledge Base)
  • name of subtype is a string indicating a subtype
  • flags are characters indicating a data type (s-string, ...), multiple values in data in column (m-multivalue) and identifier (i-identifier)
    • Data types for internal use - suggestion format SXML - it is used by programs following on SEC API.
    • Data types:
      • 's' - "string" (is used by default at SXML)
      • 'd' - "decimal" (real numbers)
      • 'e' - "date" (date in ISO 8601 format - more specificaly format YYYY-MM-DD)
      • 'g' - "image" (URI to image)
      • 'r' - "integer" (integer)
      • 'u' - "uri" (link)
    • When the column can contain multiple values in the data, it has a flag 'm'. This flag is set according to MULTIPLE VALUES in chapter Knowledge Base.
    • Identifier (flag 'i') is a column with unique value of each line in the whole table of the particular column. If there are multiple identifiers on a line, the first non-empty one on the left is chosen.
  • prefix_of_value is a string that will be attached to the data in this column
  • name_of_column is a string containing a column name
example of the first column in header: <location>ID
example of first column in data: l:21931315183
example of other columns in header: {iu[http://en.wikipedia.org/]}WIKIPEDIA URL
example of other columns in data: wiki/city_of_london

Regular expression to get individual items in Python:

regex = re.compile(u'(?u)^(?:<([^:>]+)(?:[:]([^>]+))?>)?(?:\{((?:\w|[ ])*)(?:\[([^\]]+)\])?\})?((?:\w|[ ])+)$')
regex.search(u"<name of type:name of subtype>{flags[prefix_of_value]}name_of_column").groups()
(u'name of type', u'name of subtype', u'flags', u'prefix_of_value', u'name_of_column')

Program and libraries for KB in shared memory (xdolez52)

The task was to create a program in C language, which loads KB as a string into shared memory. Program and libraries are saved in Git repository "/mnt/minerva1/nlp/repositories/decipher/secapi/" (consider this a working path from now on). KB for this purpose must contain header, which gives a purpose to individual columns in KB. Header KB is separated from the data by one empty line.

Three variants have been created:

  • 1. variant (SharedKB/var1) takes as a divisive character only '\n', thus one line of KB is one string.
  • 2. variant (SharedKB/var2) divides KB not only into lines, but also into columns separated by '\t'.
  • 3. variant (SharedKB/var3) is based on 2. variant, but it also creates a table from "namelist.ASCII". It was created to reduce the demands on RAM when assembling machines for figu.

The first two versions were created for the original task. It was necessary to determine which one will be faster. The second option was proven as faster.

To each variant the dynamic library "libKB_shm.so" and its extension in Python "KB_shm.py" is created. At the moment the extension in Python is up to date only for 2. variant.

Usage

Daemon loads KB into shared RAM, eventually also namelist. Copy of this memory will be saved to disk next to KB with the exact name as KB have and extension ".bin". This copy is used to accelerate the next load and should not be transferred between different architectures. Change of KB and namelist is monitored and if there are newer than copies of SHM, they will be reloaded and copies will be re-created.

Once deamon loads data to SHM, it will print "./decipherKB-daemon: Waiting for signal..." on stdout. At this stage it is waiting for SIGTERM, SIGINT or SIGQUIT and when one of those three signals is received, it will delete loaded KB from SHM and determine.

To work with data in SHM "libKB_shm.so" and "KB_shm.py" is used. I will not, however, describe them here - they are commented in "libKB_shm.h" and "KB_shm.py". If it is not enough, let me know.

Launching of daemon

1. variant:

 decipherKB-daemon [{path_to_KB | -b path_to_KB_bin}]

2. variant:

 decipherKB-daemon [-s SHM_NAME] [{path_to_KB | -b path_to_KB_bin}]

3. variant:

 decipherKB-daemon [{path_to_KB path_to_namelist | -b path_to_KB_bin}]
 path_to_KB        - path to KB (default value "./KB-HEAD.all")
 path_to_namelist  - path to namelist (default value "./namelist.ASCII")
 -b path_to_KB_bin - loads a copy of SHM
 -s SHM_NAME       - specification of the object name in the shared memory (default is "/decipherKB-daemon_shm")

Tool ner.py (xmagdo00, iotrusina)

Tool for detection and disambiguation of named entities is implemented in script ner.py. This chapter contains information only about its launch. Further information about how does it work can be found here.

Script ner.py uses KB, which is loaded to the shared memory by SharedKB.

Prerequisites

The current version is available on git in branch D114-NER (it is necessary to swap username for your login).

 git clone ssh://username@minerva1.fit.vutbr.cz/mnt/minerva1/nlp/repositories/decipher/secapi/
 git checkout -b D114-NER origin/D114-NER

It is necessary to perform the following sequence of commands before launching:

 ./downloadKB.sh
 make

It is necessary to be aware of the fact that when using script downloadKB.sh, KB and machines (*.fsa) cannot be located in the directories secapi/NER and secapi/NER/figa. It is appropriate to delete them before by script deleteKB.sh.

Script ner.py

The tool works with knowledge base with added columns containing statistical data from Wikipedia and pre-counted score for disambiguation. Search of entities in a text and its disambiguation is enabled by script:

 secapi/NER/ner.py
 usage: ner.py [-h] [-a | -s] [-d] [-f FILE]
 
 Optional arguments:
   -h, --help            Prints help and determines
   -a, --all             Prints all entities from input without recognition.
   -s, --score           Prints every possible meaning and evaluation of each of these meanings for each entity in a text.
   -d, --daemon-mode     "Daemon mode" (see below)
   -f FILE, --file FILE  Uses specified file as input.
   -r, --remove-accent   Removes accent from input.
   -l, --lowercase       Converts input to lowercase and uses 
                         special machine with only lowercase letters.

It is also possible to read input from standard input (it is possible to use the redirection).

Test texts for ner.py can be found in directory:

 secapi/NER/data/input

Daemon mode

Activity mode activated by switch -d allows processing of multiple texts with one instance. Text is expected on standard input, determined on a separate line by command

 NER_NEW_FILE - prints found entities with disambiguation
 NER_NEW_FILE_ALL - prints found entities without disambiguation
 NER_NEW_FILE_SCORE - prints found entities without disambiguation, including scores for each entity

When you enter the command the tool will print a list of found entities in entered text and the output will determine by repetition of the command. Another text is expected on the input. Processing of the text entered since the last command and end of the program will be caused by a separate line command:

 NER_END - ends the program and prints found entities with disambiguation
 NER_END_ALL - ends the program and prints found entities without disambiguation
 NER_END_SCORE - ends the program and prints found entities without disambiguation, including scores for each entity

Output

The tool prints a list of found entities on the standard output in order, in which they occur in the input text. One line and columns separated by tabs belong to each entity. Lines of the output have format:

 BEGIN_OFFSET    END_OFFSET      TYPE    TEXT    OTHER

BEGIN_OFFSET and END_OFFSET represent the position of beginning and end of a entity in text.

TYPE indicates the type of entiy: kb for item knowledge base, date and interval for date and interval, coref for coreferrence by pronoun or a part of person's name.

TEXT contains a text from of a particular entity in the same way, in which it occured in the input text.

OTHER for types of kb and coref is a list of corresponding numbers of a line in knowledge base separated by character ";". If disambiguation is on, only one line corresponding to the most likely meaning is chosen. While using the script with the -s flag, a pair of row number and evaluation of entity will be displayed, the pairs are separated from each other by semicolon. For the type date and interval - data in a standardized ISO format are contained.

How does ner.py work? (xmagdo00)

Finite state machine fig finds various types of entities in a text (person, location, artwork, museum, event, art form, art medium, art period movement, art genre, nationality) from KB, while their textual form may correspond to a various options of significance. The task was to disambiguate found entities, choose one among the possible meanings, which most likely corresponds to reality.

For every possible meaning of entity in the text a numerical score is calculated - meaning with the highest score is selected as the result of disambiguation. The final score is the sum of static component and contextual component.

Static score

Static score represents significance of the particular item in knowledge base. It is calculated on the basis of statistical data about the article on Wikipedia: number of corrupted links, number of article visits, indication whether the article is primarily important for the keyword. If these are not available, other metrics of item Knowledge Base are used. For score using statistics from Wikipedia - also other type of score is calculated for each component in range of 0 to 100. Partial values are evenly averaged into the final score.

Partial matches of names

In order to determine the meaning of the entities in a text that correspond only to a part of person's name (first name or last name), these possible meanings found by figa are traceable for each entity.

Before the disambiguation isself, for each person in knowledge base the columns DISPLAY TERM and PREFERRED TERM and every value of OTHER TERMS is divided into the individual names (originally separated by space). To all of those the relevant line of knowledge base is recorded. The result of the process is a dictionary of all parts of the names appearing in the knowledge base that assigns set of rows of knowledge base to each menu, where such a name is used.

Each entity found in text is divided into words, for each a relevant set of lines in knowledge base is found and the average is calculated. This is how the people, whose names include all words found within the entity in text are obtained. These meanings are added to those found by figo.

Contextual disambiguation

Contextual disambiguation adds criteria context to the meaning of entity with a meaning of the rest of the document.

At locations after the first round of disambiguation (that will be done no matter the content) regions, where the location (which became the result of disambiguation) is, will be recorded. For each regions the quotient of number of locations belonging to a particular regions found in locations on input will be calculated. During the second round of disambiguation this value is used as a new component of score of meanings.

At people after the the first round of disambiguation the instances of individuals between the results will be calculated. In the second round the quotient of occurrences of a person among persons found in the first round is also counted.

Determine the meaning of pronouns – coreference resolution

The tool marks not only entities from knowledge base and dates but also marks the English pronouns. Afterwards it is trying to determine on which do they link. The last found entity of the grammatical gender is considered as the meaning of the pronoun. Pronouns he, him, himself and his corresponds to males, she, her, hers and herself corresponds to females, who, whom and whose corresponds to either sex. Furthermore, pronouns here, there, where are used.

Removal of overlapping entities

Finite state machine figa and script dates.py can produce multiple results for a one place in the text. The date can also be a name of an item of KB. In that case the date is preferred. In case when multiple overlapping entities finds fig, the longest one is preferred.

Removing entities adjacent to a word that begins with a capital

If the closest word before the beginning or behing the end of the entity starts with a capital - it is very likely that the tagged text is a part of a longer name and its meaning does not have to correspond to the meaning of the marked section.Therefore, such entities are not listed. The exception are cases when the neighbouring words separated by punctuation or a capital letter marks the start of sentence (the previous word ends with period, question mark or exclamation mark).

Performance characteristics (outdated)

Speed of processing of a text by tool securing the search of entities identifying people, places and works of art and disambiguation of their importance and locate dates and time intervals in a text were measured on the server athena1, measurement was repeated three times.

Test data includes entity names (1 name = 1 line) in range of 10,000 to 10,000,000.

Test data can be found at this location of NER.

secapi/NER/data/performance

Initialization of the tool takes 4.902 s. It is included in the following time data.

Performance characteristics of NER - without disambiguation
Number of entities on input Input size Length of processing
10,000 146 kB 5.6 s
100,000 1.5 MB 13.221 s
1,000,000 15.8 MB 89.282 s
10,000,000 152.8 MB 815.583 s
Performance characteristics of NER - with disambiguation
Number of entities on input Input size Length of processing
10,000 146 kB 5.822 s
100,000 1.5 MB 18.197 s
1,000,000 15.8 MB 240.959 s
10,000,000 152.8 MB 2,533.679 s

Creation of Knowledge Base (iotrusina)

Creation of sectional KB

Creation of sectional KB PERSONS, ARTISTS, ARTWORKS, LOCATIONS, MUSEUMS, MYTHOLOGY, ARTIST_FAMILIES and ARTIST_GROUP_OR_COLLECTIVE takes place in directory

 secapi/NER/KnowBase.

Individual KB are created by script

 start.sh,

which for each type separately launches scripts start.sh in particular subdirectory of the particular type (artworks, locations, museums, persons, mythology, artist_families and artist_group_or_collective). Thus it is possible to lauch regeneration of only a specific type independently. Own creation of a sectional KB is done by script kb_compare.py (described more specificaly here). In this step there are alternative names from JRC-Names added for type ARTIST and PERSON.

You can find more useful scripts in directory secapi/NER/KnowBase. Script backupKB.sh is used to backup input files for creation of sectional KB to location /mnt/data-in/knot/iotrusina/KB_data_backups. Script copyKB.sh is used to upload newly created sectional KB to location /mnt/minerva1/nlp/projects/decipher_ner/KnowBase. Script deleteKB.sh deletes created sectional KB. Scripts copyKB.sh and deleteKB.sh are launched within the script secapi/NER/start.sh.

Creation of KB.all

Knowledge Base KB.all is created by merging of sectional KB with data from Freebase and some foreign-language Wikipedia. This connection is done in script:

 secapi/NER/prepare_data

When using the -i parameter the missing images will be downloaded simultaneously to the image database /mnt/athena3/kb/images (wikimedia images only at the moment).

Newly created Knowledge Base is located in file

 secapi/NER/KB.all.

Creation of KBstatsMetrics.all

Knowledge Base KBstatsMetrics.all is basically the original KB.all expanded by several columns with statistics. They are also created in script prepare_data, where relevant statistics are added to each row by scripts wiki_stats_to_KB.py and metrics_to_KB.py.

Unlike the described format KB.all each line is expanded six columns: first three generates statistics of a relevant article on Wikipedia (backlinks, hits, primary sense). The fourth forms score for disambiguation calculated on the basis of the previous three columns. The fifth column contains score for disambiguation calculated according to metrics such as length, number of filled columns in KB or number of inhabitants of the location. The sixth column contains confidence score, which combines all of the previous values to one.

Knowledge Base created this way is located in the file

 secapi/NER/KBstatsMetrics.all

Script dates.py used to extract data of various formats from pure text (xdolez52)

The task is to create a regular expression, which will be able to extract every common format of data occuring in a text (2004-04-30, 02/30/1999, 31 January 2003 atd.). The regular expression is supposed to be complex (one regular expression only) and should be able to extract as many common formats as possible (we are not interested in specific times, only in years, months and days).

After the extraction of data its normalization (to ISO 8601) will be performed in order to work with data further.

Except for the regular exrpession it was necessary to write code (function and class) for its processing and transfer the processed data to other scripts.

Input is a plain text in English language (type str) and output is a list (type list) of items of class Date.

If the script was supposed to be perfect, it would have to include a semantic analysis of English language, which would recognize what is and what is not a date, respectively year.

The script is located on git in location:

secapi/NER/dates.py

Classes

Class Date

Class for the data found. Besides year, month and day includes a position of finding in the source string and a string, from which was this particular date transferred. After creation of a new instantiation it is necessary to call init_date() or init_interval() to initialize attributes.

It has two types, DATE for a plain date and INTERVAL for interval between two dates. Here is an example with comments:

 class_type: DATE          # Attribute indicating the type of data found (plain date or interval).
 source:     April 2, 1918 # String source from source text.
 iso8601:    1918-04-02    # Date in **ISO_date**
 s_offset:   265           # Start of string source in source text
 end_offset: 278           # End of strign source in source text (calculated from """s_offset + len(source)""")
 ---------------------------------
 class_type: INTERVAL
 source:     1882-83
 date_from:  1882-00-00    # Starting date in **ISO_date**
 date_to:    1883-00-00    # Ending date in **ISO_date**
 s_offset:   467        
 end_offset: 474        

Class ISO_date

Class preserving date (year, month and day). It includes attributes day, month and year. it was created to replace datetime.date in situlations when only a year is known. In this case the unknown details are replaced by value zero (e.g. 1881-00-00)

Properties

  • Extraction of data from source text
  • Extraction of data interval from source text
  • dateutil.parser is used for normalization (to ISO 8601)
  • If a number is found, it is taken as a year only if the number has 4 digits
  • If an incomplete date will be found in text, only month and year, respectively year only, is taken

Supported formats

Date

  • Sept. 12, 2007
  • Jul 18 '10
  • 1999-12-28
  • 12-11-1694, 12/11/1694
  • 12.11.1694, 12. 11. 1694
  • 12th Nov. 1694, 8th of November 2003, 27 May 1859

Month and year only:

  • November 2003

Year only:

  • 1694, 1690s

Interval

  • June 6-Sept. 23, 2007
  • Aug. 4-31, 2007
  • June. 6, 2005 – Sept. 12, 2007
  • 20 March 1856 – 10 January 1941
  • 1694-99
  • 1693-1734, 1693 to 1734

Problems and unclarities

When using dateutil.parser

 str(dateutil.parser.parse("Jul 18 '30").date()) -> '2030-07-18' # automatically adds the current first two digits - considered as correct
 str(dateutil.parser.parse("Jul 18 30").date()) -> '2030-07-18' # what if the year is 30?
 str(dateutil.parser.parse("0030-01-01").date()) -> '2001-01-30' # looks like that dateutil.parser does not take year less than hundred
 str(dateutil.parser.parse("0099-01-01").date()) -> '1999-01-01' # will not always add the current first two digits, but the closest ones
 str(dateutil.parser.parse("Jul 18 '62").date()) -> '2062-07-18' # because it is year 2013
 str(dateutil.parser.parse("Jul 18 '63").date()) -> '1963-07-18' # because it is year 2013
 str(dateutil.parser.parse("0100-01-01").date()) -> '0100-01-01' # correct
 
 If I get DD/MM/YYYY, then dateutil.parser takes this date as MM/DD/YYYY if DD < 13 otherwise as DD/MM/YYYY
 str(dateutil.parser.parse("10/1/2000").date()) -> '2000-10-01'
 str(dateutil.parser.parse("13/1/2000").date()) -> '2000-01-13'

Statistical information

During scanning 45,764,556 words 561,744 etries were found (of which 177,336 were intervals) in 11m 17.836s, thus speed 67,515.6764 words per second.