Decipher NER

Table of Contents

1 Decipher NER

Tool Decipher NER was developed within the project Decipher and is used to search and to make entities more clear (e.g. person, artist, location, ...) in a text. It uses Knowledge Base, which was created by extracting and combining the relevant information of entities from many sources such as Wikipedia, Freebase, Geonames or Getty ULAN. Custom search of entities is done by tool figa, which is based on finite state machines.

NER tool is a part of a larger project Decipher called secapi. The larger part of the tool is stored in git repository minerva1.fit.vutbr.cz/mnt/minerva1/nlp/repositories/decipher/secapi/, this page contains relative paths to the repository. However, some parts (e.g. extracting of information about entities from different sources) are not in the repository yet. For these special cases, we provide absolute paths to files located on school servers.

To get an idea what NER actually does, feel free to visit a demo app on server knot24. Tool autocomplete can efficiently find and offer entities from Knowledge Base and is available on the same server.


2 Knowledge Base

Knowledge Base (KB) of the project DECIPHER is saved in format TSV (Tab-separated values). This file contains information about entities, one entity per row. Number of columns per line depends on the type of the particular entity. Type of entity is determined by the prefix of column ID (first column, outdated) or by column TYPE (second column) and alternatively column SUBTYPE (if there is, then it is the third column). In the future, the abolition of prefixes is anticipated, that is why it is appropriate to use TYPE column to identify the type. Subtypes extend the parent type with additional columns. (That's why they're numbered with a plus sign). Columns that can contain multiple values are tagged as MULTIPLE VALUES. To separate multiple values the | (vertical bar) sign is used. A complete overview of available types, including a description of relevant columns can be found below. Columns that have a name written in italics are blank in KB.all, filled in KBstatsMetrics.all.

The current version of KB can be downloaded from the server athena3 (alternatively KBstatsMetrics.all) or it is available on the server athena3 in /mnt/data/kb/KB.all.

 PERSON (prefix: p)
 ==========================================
 01 ID
 02 TYPE
 03 SUBTYPE (MULTIPLE VALUES)
 04 NAME
 05 ALIAS (MULTIPLE VALUES)
 06 ROLE (MULTIPLE VALUES)
 07 NATIONALITY (MULTIPLE VALUES)
 08 DESCRIPTION (MULTIPLE VALUES)
 09 DATE OF BIRTH
 10 PLACE OF BIRTH
 11 DATE OF DEATH
 12 PLACE OF DEATH
 13 GENDER
 14 PERIOD OR MOVEMENT (MULTIPLE VALUES)
 15 PLACE LIVED (MULTIPLE VALUES)
 16 WIKIPEDIA URL
 17 FREEBASE URL
 18 DBPEDIA URL
 19 IMAGE (MULTIPLE VALUES)
 20 WIKI BACKLINKS
 21 WIKI HITS
 22 WIKI PRIMARY SENSE
 23 SCORE WIKI
 24 SCORE METRICS
 25 CONFIDENCE

 ARTIST (PERSON subtype) (prefix: a)
 ==========================================
 +01 ART FORM (MULTIPLE VALUES)
 +02 INFLUENCED (MULTIPLE VALUES)
 +03 INFLUENCED BY (MULTIPLE VALUES)
 +04 ULAN ID
 +05 OTHER URL (MULTIPLE VALUES)

 LOCATION (prefix: l)
 ==========================================
 01 ID
 02 TYPE
 03 SUBTYPE (MULTIPLE VALUES)
 04 NAME
 05 ALTERNATIVE NAME (MULTIPLE VALUES)
 06 LATITUDE
 07 LONGITUDE
 08 FEATURE CODE
 09 COUNTRY
 10 POPULATION
 11 ELEVATION
 12 WIKIPEDIA URL
 13 DBPEDIA URL
 14 FREEBASE URL
 15 GEONAMES ID (MULTIPLE VALUES)
 16 SETTLEMENT TYPE (MULTIPLE VALUES)
 17 TIMEZONE (MULTIPLE VALUES)
 18 DESCRIPTION
 19 IMAGE (MULTIPLE VALUES)
 20 WIKI BACKLINKS
 21 WIKI HITS
 22 WIKI PRIMARY SENSE
 23 SCORE WIKI
 24 SCORE METRICS
 25 CONFIDENCE

 ARTWORK (prefix: w)
 ==========================================
 01 ID
 02 TYPE
 03 SUBTYPE (MULTIPLE VALUES)
 04 NAME
 05 ALIAS (MULTIPLE VALUES)
 06 DESCRIPTION
 07 IMAGE (MULTIPLE VALUES)
 08 ARTIST (MULTIPLE VALUES)
 09 ART SUBJECT (MULTIPLE VALUES)
 10 ART FORM
 11 ART GENRE (MULTIPLE VALUES)
 12 MEDIA (MULTIPLE VALUES)
 13 SUPPORT (MULTIPLE VALUES)
 14 LOCATION (MULTIPLE VALUES)
 15 DATE BEGUN
 16 DATE COMPLETED
 17 OWNER (MULTIPLE VALUES)
 18 HEIGHT
 19 WIDTH
 20 DEPTH
 21 WIKIPEDIA URL
 22 FREEBASE URL
 23 DBPEDIA URL
 24 PAINTING ALIGNMENT (MULTIPLE VALUES)
 25 MOVEMENT
 26 WIKI BACKLINKS
 27 WIKI HITS
 28 WIKI PRIMARY SENSE
 29 SCORE WIKI
 30 SCORE METRICS
 31 CONFIDENCE

 MUSEUM (prefix: c)
 ==========================================
 01 ID
 02 TYPE
 03 SUBTYPE (MULTIPLE VALUES)
 04 NAME
 05 ALIAS (MULTIPLE VALUES)
 06 DESCRIPTION
 07 IMAGE (MULTIPLE VALUES)
 08 MUSEUM TYPE (MULTIPLE VALUES)
 09 ESTABLISHED
 10 DIRECTOR
 11 VISITORS (MULTIPLE VALUES)
 12 CITYTOWN
 13 POSTAL CODE
 14 STATE PROVINCE REGION
 15 STREET ADDRESS
 16 LATITUDE
 17 LONGITUDE
 18 WIKIPEDIA URL
 19 FREEBASE URL
 20 ULAN ID
 21 GEONAMES ID (MULTIPLE VALUES)
 22 WIKI BACKLINKS
 23 WIKI HITS
 24 WIKI PRIMARY SENSE
 25 SCORE WIKI
 26 SCORE METRICS
 27 CONFIDENCE

 EVENT (prefix: e)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 START DATE
 08 END DATE
 09 LOCATION (MULTIPLE VALUES)
 10 NOTABLE TYPE
 11 WIKIPEDIA URL
 12 FREEBASE URL
 13 WIKI BACKLINKS
 14 WIKI HITS
 15 WIKI PRIMARY SENSE
 16 SCORE WIKI
 17 SCORE METRICS
 18 CONFIDENCE

 VISUAL ART FORM (prefix: f)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 WIKIPEDIA URL
 08 FREEBASE URL
 09 WIKI BACKLINKS
 10 WIKI HITS
 11 WIKI PRIMARY SENSE
 12 SCORE WIKI
 13 SCORE METRICS
 14 CONFIDENCE

 VISUAL ART MEDIUM (prefix: d)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 WIKIPEDIA URL
 08 FREEBASE URL
 09 WIKI BACKLINKS
 10 WIKI HITS
 11 WIKI PRIMARY SENSE
 12 SCORE WIKI
 13 SCORE METRICS
 14 CONFIDENCE

 VISUAL ART GENRE (prefix: g)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 WIKIPEDIA URL
 08 FREEBASE URL
 09 WIKI BACKLINKS
 10 WIKI HITS
 11 WIKI PRIMARY SENSE
 12 SCORE WIKI
 13 SCORE METRICS
 14 CONFIDENCE

 ART PERIOD MOVEMENT (prefix: m)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 WIKIPEDIA URL
 08 FREEBASE URL
 09 WIKI BACKLINKS
 10 WIKI HITS
 11 WIKI PRIMARY SENSE
 12 SCORE WIKI
 13 SCORE METRICS
 14 CONFIDENCE

 NATIONALITY (prefix: n)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 SHORT NAME
 08 COUNTRY NAME
 09 ADJECTIVAL FORM (MULTIPLE VALUES)
 10 WIKIPEDIA URL
 11 FREEBASE URL
 12 WIKI BACKLINKS
 13 WIKI HITS
 14 WIKI PRIMARY SENSE
 15 SCORE WIKI
 16 SCORE METRICS
 17 CONFIDENCE

 MYTHOLOGY (prefix: y)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALTERNATIVE NAME (MULTIPLE VALUES)
 05 WIKIPEDIA URL
 06 IMAGE (MULTIPLE VALUES)
 07 DESCRIPTION
 08 WIKI BACKLINKS
 09 WIKI HITS
 10 WIKI PRIMARY SENSE
 11 SCORE WIKI
 12 SCORE METRICS
 13 CONFIDENCE

 FAMILY (prefix: i)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALTERNATIVE NAME (MULTIPLE VALUES)
 05 WIKIPEDIA URL
 06 IMAGE (MULTIPLE VALUES)
 07 ROLE (MULTIPLE VALUES)
 08 NATIONALITY
 09 DESCRIPTION
 10 MEMBERS (MULTIPLE VALUES)
 11 WIKI BACKLINKS
 12 WIKI HITS
 13 WIKI PRIMARY SENSE
 14 SCORE WIKI
 15 SCORE METRICS
 16 CONFIDENCE

 GROUP (prefix: r)
 ==========================================
 01 ID
 02 TYPE
 03 NAME
 04 ALTERNATIVE NAME (MULTIPLE VALUES)
 05 WIKIPEDIA URL
 06 IMAGE (MULTIPLE VALUES)
 07 ROLE (MULTIPLE VALUES)
 08 NATIONALITY
 09 DESCRIPTION
 10 FORMATION
 11 HEADQUARTERS
 12 WIKI BACKLINKS
 13 WIKI HITS
 14 WIKI PRIMARY SENSE
 15 SCORE WIKI
 16 SCORE METRICS
 17 CONFIDENCE

 OTHER (prefix: o) 
 ==============================================================
 01 ID
 02 TYPE
 03 TITLE
 04 ALIAS (MULTIPLE VALUES)
 05 DESCRIPTION
 06 IMAGE (MULTIPLE VALUES)
 07 WIKIPEDIA URL
 08 WIKI BACKLINKS
 09 WIKI HITS
 10 WIKI PRIMARY SENSE
 11 SCORE WIKI
 12 SCORE METRICS
 13 CONFIDENCE
        

2.1 Location of the files necessary to generate KB

Sub-files necessary to generate KB are listed in this section. The files are listed for each type separately. If any type is not listed, it means that there are no necessary scripts to generate that part of KB and therefore there is no git directory for it. Instead, the merge is happening within the script prepare_data (see below). Each individual has a git directory specified, containing scripts and config files necessary to create KB of said type.

 PERSON (secapi/NER/KnowBase/persons)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.persons 
 /mnt/minerva1/nlp/projects/decipher_wikipedia/Ludia_Wikipedia/outputs/categories_death_birth_data.csv

 ARTIST (secapi/NER/KnowBase/artists)
 ==========================================
 /mnt/minerva1/nlp/datasets/art/artbiogs/artbiogs.tsv
 /mnt/minerva1/nlp/datasets/art/bbc/bbc.tsv
 /mnt/minerva1/nlp/datasets/art/council/council.tsv
 /mnt/minerva1/nlp/datasets/art/distinguishedwomen/distinguishedwomen.tsv
 /mnt/minerva1/nlp/datasets/art/artists2artists/artists2artists.tsv
 /mnt/minerva1/nlp/datasets/art/the-artists/the-artists.tsv
 /mnt/minerva1/nlp/datasets/art/nationalgallery/nationalgallery.tsv
 /mnt/minerva1/nlp/datasets/art/wikipaint/final_data/wikipaint_artist.tsv
 /mnt/minerva1/nlp/datasets/art/davisart/davisart.tsv
 /mnt/minerva1/nlp/datasets/art/apr/apr.tsv
 /mnt/minerva1/nlp/datasets/art/rkd/rkd.tsv
 /mnt/minerva1/nlp/datasets/art/open/open.tsv
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.artists
 /mnt/minerva1/nlp/projects/decipher_ner/ULAN/ulan_rel_13/KB.artists
 /mnt/minerva1/nlp/projects/decipher_wikipedia/wiki_template/artists_extended
 /mnt/minerva1/nlp/datasets/art/artrepublic/artrepublic.tsv
 /mnt/minerva1/nlp/datasets/art/biography/biography.tsv
 /mnt/minerva1/nlp/datasets/art/englandgallery/englandgallery.tsv
 /mnt/minerva1/nlp/datasets/art/infoplease/infoplease.tsv
 /mnt/minerva1/nlp/datasets/art/nmwa/nmwa.tsv

 LOCATION (secapi/NER/KnowBase/locations)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_dbpedia/extraction_results/v39/v39_location_finall.tsv
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.locations
 /mnt/minerva1/nlp/projects/decipher_geonames/geonames.locations

 ARTWORK (secapi/NER/KnowBase/artworks)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_dbpedia/extraction_results/v39/v39_artwork_finall.tsv
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.artworks

 MUSEUM (secapi/NER/KnowBase/museums)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_dbpedia/extraction_results/v39/v39_museum_finall.tsv
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.museums
 /mnt/minerva1/nlp/projects/decipher_ner/ULAN/ulan_rel_13/KB.corporations
 /mnt/minerva1/nlp/projects/decipher_geonames/geonames.museums

 MYTHOLOGY (secapi/NER/KnowBase/mythology)
 ==========================================
 /mnt/minerva1/nlp/projects/extrakce_z_wikipedie/xgraca00/MythologyKB.txt 

 FAMILY (secapi/NER/KnowBase/artist_families)
 ==========================================
 /mnt/minerva1/nlp/projects/extrakce_z_wikipedie/xdosta40/final/finalFamilies.xml

 GROUP (secapi/NER/KnowBase/artist_group_or_collective)
 ==========================================
 /mnt/minerva1/nlp/projects/extrakce_z_wikipedie/xdosta40/final/finalGroupsAndCollectives.xml

 OTHER (secapi/NER/KnowBase/other)
 ==========================================
 /mnt/minerva1/nlp/projects/wikify/wikipedia/data/new_KB/KB.all

 Other sub-KB which are integrated into the final KB (see script secapi/NER/prepare_data)
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/PERSONS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTISTS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/LOCATIONS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTWORKS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/MUSEUMS
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/MYTHOLOGY
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTIST_FAMILIES
 /mnt/minerva1/nlp/projects/decipher_ner/KnowBase/ARTIST_GROUP_OR_COLLECTIVE
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.events
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.visual_art_forms
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.visual_art_mediums
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.art_period_movements
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.visual_art_genres
 /mnt/minerva1/nlp/projects/decipher_freebase/data/freebase.nationalities
 /mnt/minerva1/nlp/projects/ie_foreign_wikipedia/xbajan01/wiki/output/es_only_data.tsv
 /mnt/minerva1/nlp/projects/ie_foreign_wikipedia/xklima22/wiki/outputs/de-wiki_only_people_data.tsv

 File containing statistics
 ==========================================
 /mnt/minerva1/nlp/projects/decipher_wikipedia/wikipedia_statistics/wikipedia_statistics_2014-12-09.tsv
        

3 Specification of HEAD-KB header

File HEAD-KB in /mnt/minerva1/nlp/repositories/decipher/secapi/ (consider this a working path from now on) is used as a configuration file to load KB. It contains a complete specification of all types and columns in KB (see chapter Knowledge Base). Each column has it's type and prefixes if there are any. Individual columns of the header are separated by tabs in order to fit in with the data columns.

It is used in script NER/metrics_knowledge_base.py when generating KB and file KB-HEAD.all.

3.1 Columns in HEAD-KB

KB header syntax:

 <KB header> ::= <row> "\n"
               | <row> <KB header>
 <row> ::= <first_column> "\n"
         | <first_column> "\t" <other columns> "\n"
 <first_column> ::= "<" <type_name> ">" <other_columns>
                  | "<" <type_name> ":" <subtype_name> ">" <other_columns>
 <other_columns> ::= <column>
                   | <column> "\t" <other_columns>
 <column> ::= <column_name>
            | "{" <flags> "}" <column_name>
            | "{[" <prefix_of_value> "]}" <column_name>
            | "{" <flags> "[" <prefix_of_value> "]}" <column_name>
        

where:

 example of the first column in header: <location>ID
 example of first column in data: l:21931315183

 example of other columns in header: {iu[http://en.wikipedia.org/]}WIKIPEDIA URL
 example of other columns in data: wiki/city_of_london
        

Regular expression to get individual items in Python:

 PARSER = re.compile(r"""(?ux)
	 ^
	 (?:<(?P<TYPE>[^:>]+)(?:[:](?P<SUBTYPE>[^>]+))?>)?
	 (?:\{(?P<FLAGS>(?:\w|[ ])*)(?:\[(?P<PREFIX_OF_VALUE>[^\]]+)\])?\})?
	 (?P<NAME>(?:\w|[ ])+)
	 $
 """)
 column = u"<jméno typu:jméno podtypu>{flagy[prefix_hodnoty]}název_sloupce"
 PARSER.search(column).groupdict()
 {
 	 'TYPE': u'jmeno typu',
	 'SUBTYPE': u'jmeno podtypu',
	 'FLAGS': u'typov\xe9 prefixy',
	 'PREFIX_OF_VALUE': u'adresov\xfd prefix',
	 'NAME': u'jm\xe9no sloupce'
 }
 PARSER.search(column).group("TYPE")
 u'jmeno typu'
        

4 Program and libraries for KB in shared memory

The task was to create a program in C language, that loads KB as a string into shared memory. Program and libraries are saved in Git repository /mnt/minerva1/nlp/repositories/decipher/secapi/ (consider this a working path from now on). KB for this purpose must contain header, that gives a purpose to individual columns in KB. KB header is separated from the data by one empty line.

Three variants have been created:

The first two versions were created for the original task. It was necessary to determine which one will be faster. The second option was proven as faster.

Each variant has it's own dynamic library libKB_shm.so and KB_shm.py in Python that builds on the library. Nowadays the Python extension is only maintained for second variant that we know is used.

4.1 Usage

Daemon loads KB into shared RAM, eventually also namelist. Copy of this memory will be saved to disk next to KB with the exact name as KB and have suffix .bin. This copy is used to accelerate the next load and should not be transferred between different architectures. Change of KB and namelist is monitored and if there are newer versions than copies of SHM, they will be reloaded and copies will be re-created.

Once deamon loads data to SHM, it will print "./decipherKB-daemon: Waiting for signal..." on stdout. At this stage it is waiting for SIGTERM, SIGINT or SIGQUIT and when one of those three signals is received, it will delete loaded KB from SHM and terminate.

The two earlier mentioned files libKB_SHM.so and KB_shm.py are used to work with data in SHM. Their descriptions can be found in libKB_shm.h and KB_shm.py. So it is be omitted here.

4.1.1 Daemon launch

1. variant:

 decipherKB-daemon [{path_to_KB | -b path_to_KB_bin}]
        

2. variant:

 decipherKB-daemon [-s SHM_NAME] [{path_to_KB | -b path_to_KB_bin}]
        

3. variant:

 decipherKB-daemon [{path_to_KB path_to_namelist | -b path_to_KB_bin}]
        

5 Tool ner.py

Tool for detection and disambiguation of named entities is implemented in script ner.py. This chapter contains information only about its launch. Further information about how it works can be found here.

Script ner.py uses KB, which is loaded to the shared memory by SharedKB.

5.1 Prerequisities

The current version is available on git in branch D114-NER (change username to your login).

 git clone ssh://username@minerva1.fit.vutbr.cz/mnt/minerva1/nlp/repositories/decipher/secapi/
 git checkout -b D114-NER origin/D114-NER
        

It is necessary to perform the following sequence of commands before attempting a launch:

  ./downloadKB.sh
 make
        

Make sure that downloadKB.sh, KB and machines (*.fsa) are not located in directories secapi/NER and secapi/NER/figa during launch. It is advised to delete them beforehand using script deleteKB.sh.

5.2 Script ner.py

The tool works with knowledge base with added columns containing statistical data from Wikipedia and pre-counted score for disambiguation. Search of entities in a text and its disambiguation is enabled by script:

 secapi/NER/ner.py
        

Usage:

 ner.py [-h] [-a | -s] [-d] [-f FILE]
        

It is also possible to read input from standard input (use redirection).

Test texts for ner.py can be found in directory:

 secapi/NER/data/input
        

5.3 Daemon mode

Activity mode activated using parameter -d allows processing of multiple texts with one instance. Text is expected on standard input, terminated on a separate line using command:

 NER_NEW_FILE - prints found entities with disambiguation
 NER_NEW_FILE_ALL - prints found entities without disambiguation
 NER_NEW_FILE_SCORE - prints found entities without disambiguation, including scores for each entity
        

When you enter the command, the tool will print a list of found entities in input text and the output, terminates using the same command. Another text is expected on the input. Processing of the input text since the last command and termination of the program will be caused by a separate line command:

 NER_END - ends the program and prints found entities with disambiguation
 NER_END_ALL - ends the program and prints found entities without disambiguation
 NER_END_SCORE - ends the program and prints found entities without disambiguation, including scores for each entity
        

5.4 Output

The tool prints a list of found entities on the standard output in order, in which they occur in the input text. One line and columns separated by tabs belong to each entity.

Output lines are in the following format:

 BEGIN_OFFSET    END_OFFSET      TYPE    TEXT    OTHER
        

BEGIN_OFFSET and END_OFFSET represent the position of beginning and end of an entity in text.

TYPE specifies the type of entity: kb for item knowledge base, date and interval for date and interval, coref for coreferrence by pronoun or a part of person's name.

TEXT contains a text from of a particular entity in the same form as it was in the input text.

OTHER for types of kb and coref is a list of corresponding numbers of a line in knowledge base separated by character ";". If disambiguation is on, only one most likely corresponding line is selected. While using the script with the -s flag, a pair of row number and evaluation of entity will be displayed, the pairs are separated from each other by semicolon. For the type date and interval - data in a standardized ISO format are contained.


6 How does ner.py work?

Finite state machine figa searches for varius types of entities (person, location, artwork, museum, event, art form, art medium, art period movement, art genre, nationality) from KB, while their textual form may correspond to multiple meanings. The task was to disambiguate found entities, choose one among the possible meanings, which most likely corresponds to reality.

For every possible meaning of entity in the text a numerical score is calculated - meaning with the highest score is selected as the result of disambiguation. The final score is the sum of static component and contextual component.

6.1 Static score

Static score represents significance of the particular item in knowledge base. It is calculated on the basis of statistical data about the article on Wikipedia: number of corrupted links, number of article visits, indication whether the article is primarily important for the keyword. If these are not available, other metrics of the Knowledge Base item are used. For score using statistics from Wikipedia - also other type of score is calculated for each component in range of 0 to 100. Partial values are evenly averaged into the final score.

6.2 Partial matches of names

In order to determine the meaning of the entities in a text that correspond only to a part of person's name (first name or last name), these possible meanings found by figa are traceable for each entity.

Before the disambiguation itself, for each person in knowledge base the columns DISPLAY TERM and PREFERRED TERM and every value of OTHER TERMS is divided into the individual names (originally separated by space). The relevant row in knowledge base is specified for each of these names. The result of the process is a dictionary of all parts of the names occurring in the knowledge base that assigns set of rows from knowledge base to each name, where that particular name is used.

Each entity found in text is divided into words, for each a relevant set of lines in knowledge base is found and their intersection is calculated. This is how the people, whose names include all words found within the entity in text are obtained. These meanings are added to those found by figa.

6.3 Contextual diambiguation

Contextual disambiguation adds criteria context to the meaning of entity with a meaning of the rest of the document.

After the first disambiguation iteration, locations have regions assigned to them, based on which region the location belongs to. Consequently a value representing number of locations that belong to a specific region is calculated. This value is used as a part of the meaning score during second iteration of disambiguation.

Occurrences of individual persons are counted after the first disambiguation iteration. This value is again used in the second iteration of disambiguation.

6.4 Determining the meaning of pronouns – coreference resolution

The tool marks entities from knowledge base, dates and even english pronouns. Consequently it attempts to determine what they refer to. The last found entity of the grammatical gender is considered as the meaning of the pronoun. Pronouns he, him, his and himself correspond to males; she, her, hers and herself corresponds to females; who, whom and whose corresponds to either gender. Furthermore, pronouns here, there, where are used.

6.5 Removing overlapping entities

Finite state machine figa and script dates.py can produce multiple results for a one place in the text. The date can also be used as a name of an item of KB. In that case the date is preferred. In case that figa finds multiple overlapping entities, the longest one is preferred.

6.6 Removing entities adjacent to a word that begins with a capital

If the closest word before the beginning or behing the end of the entity starts with a capital - it is very likely that the tagged text is a part of a longer name and its meaning does not have to correspond to the meaning of the marked section. Therefore, such entities are not listed. The exception are cases where the adjacent words are separated by punctuation or where a capital letter marks the start of sentence (the previous word ends with period, question mark or exclamation mark).

6.7 Perfomance characteristics

Please note that these are outdated.
Text processing speed of a tool securing the search of entities identifying people, places and works of art and disambiguation of their importance and locating dates and time intervals in a text were measured on the server athena1, measurement was repeated three times.

Test data includes entity names (1 name = 1 line) in range of 10,000 to 10,000,000.

Tested data can be found here:

 secapi/NER/data/performance
        

Initialization of the tool takes 4.902 s. It is included in the following time data.

Performance characteristics of NER - without disambiguation

Number of entities on input Input size Length of processing
10,000 146 kB 5.6 s
100,000 1.5 MB 13.221 s
1,000,000 15.8 MB 89.282 s
10,000,000 152.8 MB 815.583 s

Performance characteristics of NER - with disambiguation

Number of entities on input Input size Length of processing
10,000 146 kB 5.822 s
100,000 1.5 MB 18.197 s
1,000,000 15.8 MB 240.959 s
10,000,000 152.8 MB 2,533.679 s

7 Creating Knowledge Base

7.1 Creating partial KB

Creation of partial KB PERSONS, ARTISTS, ARTWORKS, LOCATIONS, MUSEUMS, MYTHOLOGY, ARTIST_FAMILIES, ARTIST_GROUP_OR_COLLECTIVE and OTHER takes place in directory:

 secapi/NER/KnowBase
        

Individual KBs are created by script

start.sh

which launches scripts start.sh for each type individually in their respective subdirectory (artworks, locations, museums, persons, mythology, artist_families and artist_group_or_collective). Therefore it is possible to re-generate a specific type. The creation of partial KB is executed using the kb_compare.py script (see the matching section below). This step also assigns alternative names from JRC-Names to types ARTIST and PERSON.

You can find more useful scripts in directory secapi/NER/KnowBase. Script backupKB.sh is used to backup input files for creation of partial KB to location /mnt/data-in/knot/iotrusina/KB_data_backups. Script copyKB.sh is used to upload newly created partial KB to location /mnt/minerva1/nlp/projects/decipher_ner/KnowBase. Script deleteKB.sh deletes created partial KB. Scripts copyKB.sh and deleteKB.sh are launched within the script secapi/NER/start.sh.

7.1.1 Creating partial KB OTHER

KB is based on a KB.all project Wikify:

 /mnt/minerva1/nlp/projects/wikify/wikipedia/data/new_KB/KB.all
        

Names that are already used in previously created partial KBs are filtered. Script checks partial KB PERSONS, ARTISTS, etc..., loads names of individual entities and filters theses names out of the KB.all of Wikify project. Entities that remain and aren't used in any of the partial KBs will be the foundation of new partial KB OTHER. It is necessary that OTHER is created as the last partial KB (if a new partial KB is created, it has to be added to script secapi/NER/KnowBase/other/kb_filter_entity_out.py).

7.2 Creating KB.all

Knowledge Base KB.all is created by merging of partial KB with data from Freebase and some foreign-language Wikipedia.

This connection is done in script:

 secapi/NER/prepare_data
        

When using the -i parameter, the missing images will be downloaded simultaneously to the image database /mnt/athena3/kb/images (wikimedia images only at the moment).

Newly created Knowledge Base is located in file:

 secapi/NER/KB.all

7.3 Creating KBstatsMetrics.all

Knowledge Base KBstatsMetrics.all is basically the original KB.all expanded by several columns with statistics. They are also created in script prepare_data, where relevant statistics are added to each row by scripts wiki_stats_to_KB.py and metrics_to_KB.py.

Unlike the described format KB.all each line is expanded by six columns: first three generate statistics of a relevant article on Wikipedia (backlinks, hits, primary sense). The fourth forms score for disambiguation calculated on the basis of the previous three columns. The fifth column contains score for disambiguation calculated according to metrics such as length, number of filled columns in KB or number of inhabitants of the location. The sixth column contains confidence score, which combines all of the previous values to one.

Knowledge Base created this way is located in the file:

 secapi/NER/KBstatsMetrics.all
        

8 Creating dictionaries

The ner.py tool uses tool figa that is able to recognize entities in text using several dictionaries. The description of tool figa and the dictionaries can be found on the project Ner4 page. This chapter describes only its creation. (Previously, the finite machine described on the project Decipher fsa page were used for the same purpose)

Creation of dictionaries for NER and for autocomplete is carried out by scripts create_cedar.sh and create_cedar_autocomplete.sh. These scripts work with the file KBstatsMetrics.all, from which a list of names will be obtained and subsequently submitted to the tool figav1.0, which will create the particular finite state machines.

 secapi/NER/figa/make_automat/create_cedar.sh
        

Usage:

 create_fsa.sh [-h] [-l|-u] [-c|-d] --knowledge-base=KBstatsMetrics.all
        

Required arguments:

Optional arguments:

Using script create_cedar.sh the following dictionaries are generated:

 automata.[ct|dct]       - basic machine for ner.py
 automata-lower.[ct|dct] - machine used for lowercase variant of recognition
 automata-uri.[ct|dct]   - machine for URI
        

Using script create_cedar_autocomplete.sh the following dictionaries are generated:

 art_period_movement_automata.[ct|dct] - dictionary for autocomplete (type ART PERIOD MOVEMENT)
 artwork_automata.[ct|dct]             - dictionary for autocomplete (type ARTWORK)
 event_automata.[ct|dct]               - dictionary for autocomplete (type EVENT)
 family_automata.[ct|dct]              - dictionary for autocomplete (type FAMILY)
 group_automata.[ct|dct]               - dictionary for autocomplete (type GROUP)
 location_automata.[ct|dct]            - dictionary for autocomplete (type LOCATION)
 museum_automata.[ct|dct]              - dictionary for autocomplete (type MUSEUM)
 mythology_automata.[ct|dct]           - dictionary for autocomplete (type MYTHOLOGY)
 nationality_automata.[ct|dct]         - dictionary for autocomplete (type NATIONALITY)
 person_automata.[ct|dct]              - dictionary for autocomplete (type PERSON with subtype ARTIST)
 visual_art_form_automata.[ct|dct]     - dictionary for autocomplete (type VISUAL ART FORM)
 visual_art_genre_automata.[ct|dct]    - dictionary for autocomplete (type VISUAL ART GENRE)
 visual_art_medium_automata.[ct|dct]   - dictionary for autocomplete (type VISUAL ART MEDIUM)
 x_automata.[ct|dct]                   - dictionary for autocomplete (type all types together)
        

Both of these scripts are launched within the script secapi/NER/start.sh.


9 Automation of Knowledge Base and machines creation

The process of creating the Knowledge Base and all necessary automata (including those for autocomplete) is automated. The process can be launched using the followign script:

 secapi/NER/start.sh
        

This script gradually createspartial KB, executes merging into KB.all, creates KBstatsMetrics.all, creates machines for NER and autocomplete and makes it possible to upload them to the athena3 server (parameter -u or --upload) to location athena3:/mnt/data/kb, from where they can be easily downloaded.

In addition to script start.sh, some other useful scripts are found on git in the NER directory. Script deleteKB.sh deletes all created KBs and machines. Script uploadKB.sh uploads all created KBs and machines to athena3:/mnt/data/kb. Script downloadKB.sh downloads the latest stable version of KB and machines from the athena3 server. Script TestAndRunStart.sh performs a test of all the prerequisites necessary to generate finite state machines and KB and enables their managed generation to run (more here) with a list of errors that occurred during generating.

The process of creating KBs and machines should only be performed by authorized people. Students are forbidden to run the start.sh script with the -u or --upload parameter because they could cause the NER tool to malfunction.

If someone uploads a malfunctioning version of KB or machines to athena3 server, it is possible to manually return to the last functional version. Individual versions are located in athena3:/mnt/data/kb/kb and are numbered by Unix timestamp (e.g. 1413053397). Thus you can copy all the files in the directory of a particular version to the location /mnt/data/kb/. If you would like to go back to the version 1413053397, use the following command:

 cp /mnt/data/kb/kb/1413053397/* /mnt/data/kb/.
        

10 Matching entities from two different Knowledge Bases

The aim was to match entities from two Knowledge Bases. Create new KB from matched entities. The script works on "any" KB, composed of one type of entity (person or location, possibly other), according to configuration files. Matching is implemented in the kb_compare.py script, which is available on git:

 secapi/NER/KnowBase/kb_compare.py
        

Script kb_compare.py can now also request deduplication of it's own KB using the --deduplicate_kb1 or --deduplicate_kb2 attributes. This function can be used separately using script kb_dedup.py available on git:

 secapi/NER/KnowBase/kb_dedup.py
        

10.1 Script kb_compare.py

Usage:

 kb_compare.py [-h] --first FIRST --second SECOND [--first_fields FIRST_FIELDS] [--second_fields SECOND_FIELDS] --rel_conf REL_CONF 
 [--output_conf OUTPUT_CONF] [--other_output_conf OTHER_OUTPUT_CONF] [--first_sep FIRST_SEP] [--second_sep SECOND_SEP] 
 [--id_prefix ID_PREFIX] [--output OUTPUT]

Optional arguments:

10.1.1 Description of parameters

10.1.2 Launch example

 ./kb_compare.py --first=DBPEDIA --second=GEONAMES --rel_conf=dbpedia_geonames_rel.conf --output_conf=DG_output.conf --output=DG
  --id_prefix=l --other_output_conf=DG_other_output.conf --treshold=4
        

10.2 Script kb_dedup.py

Usage:

kb_dedup.py [-h] --kb KB [--kb_fields KB_FIELDS] [--kb_sep KB_SEP] [--id_fields ID_FIELDS [ID_FIELDS ...]] --output OUTPUT
        

Remove duplicity from Knowledge Base.

Optional arguments:

10.2.1 Description of parameters

10.3 Description of configuration files *_output.conf

Syntax (BNF):

 <file> ::= <row> | <row> <file>
 <row> ::= <column source> "\n"
 <column source> ::= "ID" | "None" | '"' <column content< '"' | <column from kb>
 <column from kb> ::= <name of kb> "." <name of column>
 <name of kb> ::= <name of kb1> | <name of kb2>
        

where:

10.3.1 Example configuration file artists/ARTISTS_output.conf

 ID
 WF.TYPE
 WF.SUBTYPE
 ULAN.DISPLAY TERM
 WF.ALIAS
 WF.PROFESSION
 WF.NATIONALITY
 WF.DESCRIPTION
 ULAN.DATE OF BIRTH
 ULAN.PLACE OF BIRTH
 ULAN.DATE OF DEATH
 ULAN.PLACE OF DEATH
 ULAN.GENDER
 WF.PERIOD OR MOVEMENT
 WF.PLACE LIVED
 WF.WIKIPEDIA URL
 WF.FREEBASE URL
 WF.DBPEDIA URL
 WF.IMAGE
 WF.ART FORM
 WF.INFLUENCED
 WF.INFLUENCED BY
 ULAN.ID
        

10.4 Description of configuration files *_other_output.conf

Syntax (BNF):

 <file> ::= <row> | <row> <file>
 <row> ::= <column source> "\n"
 <column source> ::= "ID" | "None" | '"' <column content< '"' | <list of columns from kb>
 <list of columns from kb> ::= <column from kb> | <column from kb> "|" <list of columns from kb>
 <column from kb> ::= <name of kb> "." <name of column>
 <name of kb> ::= <name of kb1> | <name of kb2>
        

where:

10.4.1 Example configuration file artists/artrepublic/OUT_other_output.conf

 ID
 "person"
 "artist"
 ARTREPUBLIC.NAME
 None
 None
 None
 ARTREPUBLIC.DESCRIPTION|ARTREPUBLIC.ABOUT
 None
 None
 None
 None
 None
 None
 None
 None
 None
 None
 ARTREPUBLIC.LOCAL IMAGE
 None
 None
 None
 None
 ARTREPUBLIC.PROFILE LINK
        

10.5 Description of configuration files *_rel.conf

Syntax (BNF):

 <file> ::= <unique> <name> <other>
 <unique> ::= "UNIQUE:" "\n" <list of relations>
 <name> ::= "NAME:" "\n" <list of relations>
 <other> ::= "OTHER:" "\n" <list of relations>
 <list of relations> ::= "" | <relation> <list of relations>
 <relation> ::= "\t" <column from kb1> "=" <column from kb2> "\n"
 <column from kb1> ::= <name of kb1> "." <name of column>
 <column from kb2> ::= <name of kb2> "." <name of column>
        

where:

10.5.1 Example configuration file artists/wikipedia_freebase_rel.conf

 UNIQUE:
	 WIKIPEDIA.WIKIPEDIA URL=FREEBASE.WIKIPEDIA URL
	 WIKIPEDIA.FREEBASE URL=FREEBASE.FREEBASE URL
 NAME:
 	 WIKIPEDIA.NAME=FREEBASE.NAME
	 WIKIPEDIA.NAME=FREEBASE.ALIAS
	 WIKIPEDIA.ALTERNATIVE NAME=FREEBASE.NAME
	 WIKIPEDIA.ALTERNATIVE NAME=FREEBASE.ALIAS
 OTHER:
	 WIKIPEDIA.DATE OF BIRTH=FREEBASE.DATE OF BIRTH
	 WIKIPEDIA.DATE OF DEATH=FREEBASE.DATE OF DEATH
	 WIKIPEDIA.PLACE OF BIRTH=FREEBASE.PLACE OF BIRTH
	 WIKIPEDIA.PLACE OF DEATH=FREEBASE.PLACE OF DEATH
	 WIKIPEDIA.WORK=FREEBASE.PROFESSION
	 WIKIPEDIA.NATIONALITY=FREEBASE.NATIONALITY
	 WIKIPEDIA.INFLUENCED=FREEBASE.INFLUENCED
	 WIKIPEDIA.INFLUENCED BY=FREEBASE.INFLUENCED BY
	 WIKIPEDIA.PERIOD OR MOVEMENT=FREEBASE.PERIOD OR MOVEMENT
	 WIKIPEDIA.IMAGE=FREEBASE.IMAGE
        

10.6 Creating KB - type Artworks

Data sources available from:

 secapi/NER/KnowBase/artworks
        

There are several files in the artworks folder. These are:

 DBPEDIA (A data source obtained from DBPEDIE)
 DBPEDIA.fields (format of data source DBPEDIE - note especially postfix "(MULTIPLE VALUES)" needed to distinguish multiple values in field)
 FREEBASE (data source retrieved from FREEBASE)
 FREEBASE.fields (format of data source FREEBASE)
 ARTWORKS (newly created KB - which we want to create again)
 ARTWORKS.fields (The format of the new KB needed for other members co-operating on Decipher)
 ARTWORKS_other_output.conf (configuration file for switch --conf_other_output)
 ARTWORKS_output.conf (configuration file for switch --conf_output)
 dbpedia_freebase_rel.conf (configuration file defining relations between fields DBPEDIE and FREEBASE)
 start.sh (command to start creation of a new KB with all the necessary switches)
        

10.7 How does generating KB work?

  1. After you launch the script, it receives the switches from the command line and opens the required files (classInit).
  2. It tries to process all configuration files (classkb_config.Config).
  3. Retrieves data sources into the internal data structure (Each row is an object of type kb_data.Data).
  4. Creates indixes from --second - needed to reduce the time complexity of matching (classkb_index.Index).
  5. Matching itself occurs (kb_match.Match). Matching is one large cycle that takes a row from a data source --first and compares it with all the data from --second based on the relations defined in --rel_conf. It uses an indexed --second data source.
  6. The last task is to generate the KB itself from internal data structures (classkb_match.Output).

10.8 Description of search for matching entities

Two Knowledge bases – KB1 and KB2 are always being compared. KB1 is sequentially read by rows, each row is matched with a record in KB2. First it searches for item in KB2 with the same unique identifier (typically wikipedia url). Fields labeled UNIQUE are compared. Search can use multiple identifiers too. If a corresponding record in KB2 is found, it is assigned to the record in KB1 and the searching ends. If no results were found or a record in KB1 does not contain an identifier, it searches for candidates basd on the relations specified in field labeled NAME. Usually names, pseudonyms, alternative names are compared. The result of search is a list of candidates suitable to be assigned. Each candidate is evaluated based on the number of matching strings. Then evaluation of relations in field labeled OTHER begins. A candidate with the best evaluation is selected to be assigned, if the evaluation reaches or exceeds the threshold, otherwise nothing is selected. Entity from KB is assigned "used" flag so it can't be used again.

10.9 Occasions when entities are not assigned to each other

In some cases, two entities may not be matched, even if they should. An example of such a case are, for example, the following two records:

 a:8b4f4e2666	artist	Charles Alexander Smith	Charles Alexander Smith	Charles Alexander Smith	Canadian
 	 Charles Alexander Smith was a Canadian painter from Ontario.	1864	1915	M	
 http://www.freebase.com/m/0269dw5 http://en.wikipedia.org/wiki/Charles_Alexander_Smith	http://dbpedia.org/page/Charles_Alexander_Smith freebase/02cy_gx.jpg

and

 a:3f368917fe	artist	Charles Alexander	Alexander, Charles	Alexander Charles Smith	artist	
 photographer|painter	British	Canadian British painter, 1864-1915	1864	Ontario 
 (Canada) (province)	1915	London (Greater London, England, United Kingdom) (inhabited place)	M	500026312

These two records could not be assigned to each other. Above all, the second record does not contain any unique identifiers, such as a wiki URL, according to which it could be uniquely assigned to the first record. Unfortunately, the assignment by name match was not successful, since the first record contains only the name "Charles Alexander Smith" and the second record includes "Charles Alexander", "Alexander, Charles" and "Alexander Charles Smith". Note that here are the first names in reverse order, the strings do not match, so the entities were not assigned to each other.


11 Script dates.py

The task is to create a regular expression, which will be able to extract every common format of date occuring in a text (2004-04-30, 02/30/1999, 31 January 2003 etc.). The regular expression is supposed to be complex (one regular expression only) and should be able to extract as many common formats as possible (we are not interested in specific times, only in years, months and days).

After the extraction of data its normalization (to ISO 8601) will be performed in order to work with data further.

Alongside the regular expression, it was necessary to write a code (function and class) to process and transfer processed dates to other scripts.

Input is a plain text in English language (type str) and output is a list (type list) of items of class Date.

If the script was supposed to be perfect, it would have to include a semantic analysis of English language, which would recognize what is and what is not a date, respectively year.

The script is located on git:

 secapi/NER/dates.py
        

11.1 Classes

Class Date

Class for the dates found. Besides year, month and day it also includes a position in the source string and a string, from which was this particular date transferred. After creation of a new instance it is necessary to call init_date() or init_interval() to initialize attributes.

It has two types, DATE for a plain date and INTERVAL for interval between two dates. Here is an example with comments:

 class_type: DATE          # Attribute indicating the type of data found (plain date or interval).
 source:     April 2, 1918 # String source from source text.
 iso8601:    1918-04-02    # Date in **ISO_date**
 s_offset:   265           # Start of string source in source text
 end_offset: 278           # End of strign source in source text (calculated from """s_offset + len(source)""")
 ---------------------------------
 class_type: INTERVAL
 source:     1882-83
 date_from:  1882-00-00    # Starting date in **ISO_date**
 date_to:    1883-00-00    # Ending date in **ISO_date**
 s_offset:   467        
 end_offset: 474       
        

Class ISO_date

Class storing date (year, month and day). It includes attributes day, month and year. it was created to replace datetime.date in situlations when only a year is known. In this case the unknown details are replaced by value zero (e.g. 1881-00-00)

11.2 Properties

11.3 Supported formats

11.3.1 Date

Month and year only:

Year only:

11.3.2 Interval

11.4 Problems

When using dateutil.parser:

 str(dateutil.parser.parse("Jul 18 '30").date()) -> '2030-07-18' # automatically adds the current first two digits - considered as correct
 str(dateutil.parser.parse("Jul 18 30").date()) -> '2030-07-18' # what if the year is 30?
 str(dateutil.parser.parse("0030-01-01").date()) -> '2001-01-30' # looks like that dateutil.parser does not take year less than hundred
 str(dateutil.parser.parse("0099-01-01").date()) -> '1999-01-01' # will not always add the current first two digits, but the closest ones
 str(dateutil.parser.parse("Jul 18 '62").date()) -> '2062-07-18' # because it is year 2013
 str(dateutil.parser.parse("Jul 18 '63").date()) -> '1963-07-18' # because it is year 2013
 str(dateutil.parser.parse("0100-01-01").date()) -> '0100-01-01' # correct
 
 If I get DD/MM/YYYY, then dateutil.parser takes this date as MM/DD/YYYY if DD < 13 otherwise as DD/MM/YYYY
 str(dateutil.parser.parse("10/1/2000").date()) -> '2000-10-01'
 str(dateutil.parser.parse("13/1/2000").date()) -> '2000-01-13'
        

11.5 Statistical information

During scanning 45,764,556 words 561,744 etries were found (of which 177,336 were intervals) in 11m 17.836s, thus speed of 67,515.6764 words per second.


12 Testing the start.sh script

The task was to create a test set for the start.sh script stored in secapi/NER.

List of created files:

 secapi/NER/TestAndRunStart.sh - test script
 secapi/NER/AllKB.txt - list of files needed to create a new KB
 secapi/NER/fields.txt - list of field files
 secapi/NER/MatchKBAndFields.py - script that connects relevant fields file to each KB
        

12.1 Test sets

If the script is launched with the -t parameter, the TestAndRunStart.sh script checks the availability of the files, which start.sh works with and the scripts that start.sh launches. The script also checks if *.py and *.sh files have the appropriate rights set. Next, the script is tests for the correct format, specifically number of columns, files that are needed to create a new KB (list in the AllKB.txt file). The number of columns must be the same as the number of columns in the fields files stored in the NER/KnowBase.

If the script is launched with the -r parameter, the start.sh script is called and the stderr is redirected to the file. After the script is finished, the stderr is analyzed and the names of the missing files, if there are any, the number of run-time errors and the number of warnings are printed out. After the creation of new machines and KB, comparison between old versions of machines and KB.all and KBstatsMetrics.all files takes place. The script reports an error if the difference between the newly created file and the older version is greater than 5% or when the new file is smaller by more than 10%. The script also reports an error if no machine or KB have been created.

Testing the ner.py scripts occurs when launching with the -n parameter and launching figav08 tool with the -f flag. 75% of testing is done on data stored in secapi/NER/data/input. In other cases, a specific entity is tested. "Paris 1984" at ner.py and "A Alewijn" at figav08. Then the ner.py utility (also when lauching with the -n parameter) is tested with different switches. The tool is launched without any switches, with the -l switch (converts input to lower case), with -r switch (with an error in input diacritics) and launching with -l and -r with invalid input text.

Launch example:

 ./TestAndRunStart.sh -t -> Starts testing file the existence of a file, permissions, and file format testing needed to create KB
 
 ./TestAndRunStart.sh -r -> starts start.sh script
 
 ./TestAndRunStart.sh -f -> starts testing of figa tool

 ./TestAndRunStart.sh -n -> starts testing of ner tool

 ./TestAndRunStart.sh -h -> shows help

 ./TestAndRunStart.sh -c -> starts CheckNerOut.py script - testing output from ner with manual annotation
        

12.2 Comparison of output from NER with manual annotation

The script CheckNerOut.py is used for this comparison. This script prints a percentage match to the output. The file with entities from annotation that were not paired is saved into the secapi/NER/CheckNerOut directory. The same is done with entities from the file that contains the output from NER. The script has been embedded in TestAndRunStart.sh, where it tests files that have manual annotations.

Launch example:

 python CheckNerOut.py -a SouborSAnotaci.tsv -o NerOutput.txt
        

12.2.1 Current match results as of March 24, 2015 for files from secapi/NER/data/input:

 Paul_Kane: 56.6037735849%
 travelling_artist_part2: 69.0909090909%
 travelling_artist_part3: 81.6513761468%
 Dossier_and_OS_texts.docx: 56.5020576132%
        

13 Recognizing names of people in text

The purpose of the script is to identify the first and born names of people in the selected text using a list of names. The tool is located in the git repository in directory secapi/NameRecognizer (it is necessary to change branch to NameRecognizer).

13.1 Procedure

1. step - Creating a list of names:

First, it was necessary to create a list of names for the finite state machine. That is why I created the Name Collector tool, which is described below. Out of all the output text files, only the outputs/all.txt file is fundamental for the finite state machine.

2. step - Compilation of figa tool:

First, you need to download the git figa tool repository (see Decipher_fsa) to:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22 
        

and launch script:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/create_figa.sh
        

The script compiles the tool and copies it to directory:

/mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/data
        

3. step - Formation of the finite state machine:

As in Step 2, it is necessary to download the git figa tool repository and then run the script:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/create_fsa.sh
        

The script assembles the finite state machine final.fsa using the list of names from the first step and puts it in directory:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/data
        

4. step - Processing outputs of figa:

Figa outputs are edited by sort -u and later on by process_outputs.py script:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/process_outputs.py
        

The script accepts 2 parameters - the input and output file - and combines the acquired names based on offsets.

The process_outputs.py script uses lists that are stored in the data/lists directory when filtering:

 blist_locations.txt  - list of locations created by kb_locations script
 custom_names.txt     - list of manually completed names
 custom_surrnames.txt - list of manually completed surnames
 names.txt            - list of names, gained from results of name_collector and kb_list scripts
 nationalities.txt    - list of nationalities
 notfirst.txt         - list of words that can not be on the first position
 replace.txt          - list of words or phrases that are supposed to be removed from the names 
 surrnames.txt        - list of surnames, gained from results of name_collector and kb_list scripts
        

The whole step is automated by the run.sh script, which loads the data from stdin and prints the results to stdout.

In addition to names, the script also searches for initials (two-character words composed of capitals and dots) in text to find their start and end offsets, and then adds them to the list of names from the figa tool output. After obtaining a list of names, this part is removed from the names ending with "'s" and the end offset is recalculated.

Usage:

 bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/run.sh [--show-filtered]
        

Optional parameter --show-filtered causes that the filtered names will be written to the file.

Example use:

 bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/run.sh < /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/test_data/test.txt
        

Output of the example above:

 0	26	39	Nathaniel Hill
 0	41	54	Nathaniel Hill
 1	202	224	Joseph Malachy Kavanagh
 1	226	248	Joseph Malachy Kavanagh
 0	328	341	Walter Osborne
 0	347	360	Nathaniel Hill
 0	539	552	Edward McGuire
 0	554	567	Edward McGuire
        

Output format:

 1. column - Type of name according to how the name was found in the text (0 - all names were found by the figa tool, 1 - the names in which the part was found by the match and part added by the script).
 2. column - Start offset
 3. column - End offset
 4. column - Name
        

The list of gained names can also include invalid names (such as Post Office) or names that are sub-strings of other names found. This occurs due to the fact that, when the number of gain names is increased, the new names are not compared with the contents of surnames.txt and could therefore be surnames. The way the name was found indicates the first column of results, the possible flags are:

 0 - all names were found by the figa tool
 1 - names in which a part was found by a match, and other part was added by the script
 4 - names created by conjunction of names of types 0 and 1
 7 - names whose surname part was not found in surnames.txt
 8 - names that are a sub-string of another name found
        

Names tagged by number 4 are created by merging types 0 and 1. First, it detects which of the combined names have a lower start offset, then all the words from this name are added, followed by words from the second name (only ones that do not yet appear in the resulting name). In the end, the offsets are recalculated and the first name is replaced by the new one and the second one is removed.

The script is capable of learning new type 1 words (see paragraph above for info). The words learned like this are stored in:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/outputs/learned.txt
        

These words can not be clearly identified as a name or surname, so they are only included in the general list and they are not in names.txt and surrnames.txt lists (list description is at the beginning of this step).

Names that have been filtered out for a certain reason are stored in:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/outputs/filtered.txt
        

Only the results of the last processing are saved. Writing of filtered names must be enabled by an optional parameter --show-filtered. The number at the beginning of the line specifies the reason of the filtration (in parentheses the list that caused the name to be filtered are listed):

 0 - Name contained less than 2 words
 1 - Name contained the location name (list blist_locations.txt)
 2 - Name had a word that can not be in this position as the first word (list notfirst.txt)
 3 - Name contained a word which is not a name as the first word (names.txt)
 4 - Name contained a word which is not a surname as the final word (surrnames.txt) [DELETED]
 5 - Name contained nationality (nationalities.txt)
        

5. step - Highlighting of the names found in text:

Outputs of figa are further used by the following script:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/highlight_names.py
        

The script requires the outputs of the previous step stored in figa.out file, takes the text on stdin and prints HTML text with colored highlighted names on stdout:

 green           Names marked 0 in the first column of the outputs of figa, i.e. names found by full match with names in names.txt and surrnames.txt
 red             Names marked 1 in the first column of the outputs of figa, i.e. names that have at least one name added by script, thus this does not have to be a valid name
 blue            Names that have been found in the text more times than their quantity is in outputs of figa.
 purple          Co-references of the names highlighted by colors above
 lime            Names whose surnames are not in the surnames.txt list
 olive           Names that are a sub-string of a longer name (green, red or blue)  
        

The first column determines the way the name was found in the text (see step 4). Names marked blue may indicate an error when processing outputs by process_outputs.py script or an error when searching by the figa tool.

Usage:

 python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/highlight_names.py
        

Example use:

 bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/run.sh < /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/test_data/test.txt > figa.out
 python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/highlight_names.py < /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/test_data/test.txt > outputs/examples/example.html
        

The example output file is stored in:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/outputs/example.html
        

Results

Names recognized by the tool in test texts:

 Richard Moynan
 Mary Magdalene
 Jack B.
 Lawrence J.
        

13.2 Wrapper script

To use NameRecognizer in other scripts, the wrap script name_recognizer.py is available in order to access the functionality of the scripts process_outputs.py and highlight_names.py by two functions - process (text) and highlight (text, output_fce_process) that return the processed outputs of figa tool or text with highlighted names.

Before using, it is necessary to instantiate the NameRecognizer class with two mandatory constructor parameters - path to the executable figa file (figav08) and path to the finite state machine (* .fsa).

To test functionality, you can run the script with this input text:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/test_data/example_input.txt
        

output is written to stdout:

 python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/name_recognizer.py
        

13.3 Helpful tools

13.3.1 Name Collector

Tool that obtains a list of names from several websites, it is located in:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/name_collector
        

Directory structure:

 main.py                  - launch script of the tool
 base.py                  - defines the class from which it inherits classes of resources (sites) that have a previously known number of pages
 base_pagination.py       - defines the class from which it inherits classes of resources (sites) that use paging (thus they have "next" button)
 rsrc_*_base.py           - classes that define a common interface for certain sources
 rsrc_*.py                - source classes
 ./outputs/names.txt      - alphabetically sorted list of names
 ./outputs/surrnames.txt  - alphabetically sorted list of lastnames
 ./outputs/all_raw.txt    - alphabetically sorted list of names and lastnames
 ./outputs/all.txt        - file all_raw.txt edited to a format suitable for figa tool
 ./outputs/name/*.txt     - outputs of sources with names
 ./outputs/surrname/*.txt - outputs of sources with lastnames
        

Sites implemented as sources:

 http://german.about.com/library/blname_Girls.htm
 http://german.about.com/library/blname_Boys.htm
 http://babynames.net
 http://surname.sofeminine.co.uk/w/surnames/most-common-surnames-in-great-britain.html
 http://www.surnamedb.com/Surname
 http://en.wikipedia.org/wiki/Old_Frisian_given_names
 http://en.wikipedia.org/wiki/List_of_biblical_names
 http://en.wikipedia.org/wiki/Slavic_names
 http://genealogy.familyeducation.com/browse/origin/
        

Usage:

 python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/name_collector/main.py
        

The script automatically runs when using the run.sh script in the KB List tool.

13.3.2 KB List

Tool used to extract names from KB.all, results are combined with results from Name Collector, it is located in:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/kb_list
        

Complete processing, merging and categorization of data into names and lastnames is performed by run.sh.

Usage:

 bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/kb_list/run.sh
        

13.3.3 KB Locations

Tool that creates a list of locations used to filter the outputs of figa. The result of the script is a list of locations:

 data/lists/blist_locations.txt
        

The script is located in:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/kb_locations
        

and launched using the following script:

 bash /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/kb_locations/run.sh
        

13.3.4 SWN

Script that searches for all one-word names in KB.all. The script is located in directory:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts
        

Launched using:

 python /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/swn.py
        

The output of the script are one-word names along with information about occurrence in KB.all. The first line contains version of KB.all. Format of the second and the following lines is as follows:

 1. column - line number of KB.all at which the name was found
 2. column - ID of a person in KB.all
 3. column - name
        

13.4 Example outputs

The example outputs of process_outputs.py (.out suffix files), highlight_names.py (.html suffix files), and swn.py (swn.txt) are stored in:

 /mnt/minerva1/nlp/projects/decipher_ner/xklima22/scripts/outputs/examples/
        

14 Ner4 project integration

The aim of the project is to completely replace FSA automata in project FIGA for CEDAR (implemented in project Ner4).

This project has its own branch on git: NER-figa_cedar.

The following chapters will briefly describe the changes made to the migration of fsa figa to cedar figa.

14.1 NER

For the Ner project, scripts were only slightly updated to use/support new version of Figa.

The following files are left, even though they were not even used in the old version:

 NER/figa/sources/kb_loader_fast.cc
 NER/figa/sources/kb_loader_slow.cc
        

Updated scripts in order to use the new version of figa:

 NER/start.sh
 NER/uploadKB.sh
 NER/deleteKB.sh
 NER/ner.py
        

14.2 FIGA

Major changes have been made in the Figa project. The most important thing is replacement of FSA machines for CEDAR tries. There was an effort to modify the new Figa as little as possible - mostly only trivial bugs have been fixed. Figa interface was slightly modified (c++) <-> Ner (python). New Figa machines occupy much more disk space, but processing is faster.

14.3 Performance comparison (FIGA)

Below is the comparison of requirements for creation tools, searching in machines of individiual libraries. All tests were run on the athena1 server.

To create namelists, Knowledge Bases from the Ner project were used: KB.all (5400067 entities) and its smaller version KB.11 (490915 entities - contains every 11th line from KB.all).

Entry texts example_input2 (3693 words) and testing_data.txt (17931 words) were used to search for entities.

Creation of automata (FSA) and dictionary (CEDAR/DARTS). It's is assumed you have prepared namelist, thus only the time of generating the machine itself is measured by figav1.0/fsa_build (not the whole process create_[cedar|fsa].sh):

Time Time Time Memory usage Memory usage Memory usage Machine size Machine size Machine size
FSA CEDAR DARTS FSA CEDAR DARTS FSA CEDAR DARTS
KB.11 56s 31s 30s 2.4 GB 331,1 MB 251,1 MB 24,3 MB 98,4 MB 31,3 MB
KB.all 115m 4m 46s 4m 50s 8,8 GB 2,8 GB 2,9 GB 237,1 MB 921,1 MB 334,3 MB

Creation of automation for spellcheck (FSA), for CEDAR and DARTS - special dictionary is not needed:

Time Memory usage Machine size
KB.11 12s 329,4 MB 11,6 MB
KB.all 3m 42s 4,0 GB 90,8 MB

Search in created automata/dictionaries:

Time Time Time Memory usage Memory usage Memory usage
FSA CEDAR DARTS FSA CEDAR DARTS
KB.11 example_input2 0,11s 0,49s 0,16s 36,8 MB 110,9 MB 43,9 MB
KB.11 testing_data.txt 0,15s 0,51s 0,18s 36,8 MB 110,9 MB 43,9 MB
KB.all example_input2 1,21s 4,74s 1,89s 249,7 MB 933,7 MB 346,8 MB
KB.all testing_data.txt 1,27s 4,84s 1,93s 249,7 MB 933,7 MB 346,8 MB

Search with spellcheck enabled in automata/dictionaries:

Time Time Time Memory usage Memory usage Memory usage
FSA CEDAR DARTS FSA CEDAR DARTS
KB.11 example_input2 0,12s 4,3s 3,4s 48,4 MB 110,9 MB 43,8 MB
KB.11 testing_data.txt 0,17s 20,8s 17,1s 48,5 MB 110,9 MB 43,8 MB
KB.all example_input2 1,64s 23,5s 17,3s 340,5 MB 934,1 MB 347,9 MB
KB.all testing_data.txt 1,67s 2m 26s 2m 340,5 MB 934,8 MB 347,3 MB

Conclusion

CEDAR and DARTS libraries create a machine much faster compared to the FSA, with a much smaller memory requirements (To create some really big namelists from the project Wikify the memory of the athena1 server was not enough, and even athena3 failed to create a FSA dictionary without an error). CEDAR and DARTS don't need special spellchecking machines. On the other hand, the FSA machine uses less disk space and searching is much faster.


15 Comparing changes in two KBs

Script secapi/NER/KB_changes_comparator.py was created in branch "wikipedia_update" to compare changes, entities with wikipedia link and different versions of KB.

Required arguments:

Optional arguments:

Launch example:

 python ./KB_changes_comparator.py oldKB.tsv newKB.tsv -w -e "GENDER,NATIONALITY"
 # Compares oldKB with newKB, prints out words and omits categories GENDER and NATIONALITY

 python ./KB_changes_comparator.py oldKB.tsv newKB.tsv -c DESCRIPTION
 # Compares oldKB with newKB, only compares category DESCRIPTION
        

Example of output with parameter -w and without:

 without -w
 4116    4116    DESCRIPTION     http://en.wikipedia.org/wiki/Štepán_Wagner      replace .       ->      er
 
 with -w
 4116    4116    DESCRIPTION     http://en.wikipedia.org/wiki/Štepán_Wagner      replace  jump.  ->       jumper
        

Output structure:

 row_number_in_newKB \t row_number_in_oldKB \t column_category \t wiki_link \t change_type \t original_value \t -> \t new_value
        

Exception in case a new entity is found:

 row_number \t wiki_link \t new \t entity_row_content
        

Example:

 452196	3323492	{e}PLACE OF BIRTH	http://en.wikipedia.org/wiki/Jan_van_der_Heyden	insert		->	Gorinchem (South Holland, Netherlands) (inhabited place)
17641   17641   PLACE OF BIRTH  http://en.wikipedia.org/wiki/African_Spir       replace Elisabethgrad,  ->      Elizabethgrad,
278171  http://en.wikipedia.org/wiki/Erjon_Vucaj        new     p:6fa1cac12f    person  Erjon Vucaj             footballer                      Albania, Shkodër                1990-12-25                      
                      http://en.wikipedia.org/wiki/Erjon_Vucaj        http://www.freebase.com/m/0b__zy7       http://dbpedia.org/page/Erjon_Vucaj
        

Generated outputs are launched using the following command:

 python KB_changes_comparator.py /mnt/data/kb/1455196205/KB.all /mnt/data/kb/1476090552/KB.all
        

and can be found here:

 /mnt/minerva1/nlp/projects/wikipedia_update/output.out
        

The run-time for 4500000 rows is 5-6 minutes. Script tags columns of different types of entities using this file:

 /mnt/minerva1/nlp/repositories/decipher/secapi/HEAD-KB