Decipher FSA

Table of contents

1 Modifying figa tool

1.1 Introduction

The goal of this project was to create a finite automaton (FA) from Knowledge Base (KB), that can be used in conjunction with figa (FIt GAzetter) to search for entities and their fragments in text.

Program autocomplete (recognizing names of saved entities) exists within the scope of this project that can be used for figa FA.

1.2 Project Repository

All the programs and scripts can be found in a git repository on server knot09:

 git clone ssh://user@knot09.fit.vutbr.cz/var/decipher/secapi/
        

Files:

 secapi/NER/figav08 // directory containing figa tool, version 08
 secapi/NER/figav08/make_automat/ // directory containing tools to create FA
 secapi/NER/autocomplete // directory containing autocomplete tool for FA
        

1.3 Figav08

Figa is adjusted to search in FA, that has name of the entity as well as information about it's location in KB (row number). This information is later used to find entities in file with Knowledge Base (KB.all) that was used to create the FA. The output is saved to a string type (for the purposes of python binding). To print the data found to stdout, use switch -p (see parameters).

Launch syntax:

 figav08 -d <automata> [-f <input_file>|-p|-h]
        

Parameters:

Launch examples:

 ./figav08 -d automata.fsa -p < soubor   (input from stdin | output to stdout)
 ./figav08 -d automata.fsa -f soubor     (vstup from file | output only to string)
 ./figav08 -d automata.fsa <<< "A Bird"  (input from stdin | output only to string)
        

Example output:

 row number in KB    start offset    end offset   entity name
 36739               1               6            A Bird
        

1.4 Finite automaton *.fsa

Finite automaton (automata.fsa) is created using tools fsa from Jan Daciuk (fsa_build or fsa_ubild). Script create_fsa_v05.sh was written to create FA for figa. However, it is necessary to respect the input format (see below).

Input FA data format:

 entity_name|row_number_in_KB[;another_number;...]
 (e.g. A Bird|36739)
        

Output data format. If the term corresponds to multiple entities (rows) in KB, they're separated using ;:

 row_number[;row_number;...] \t beginning_word_offset \t end_word_offset \t entity_name
        

Creating FA

Automaton is created using the earlier mentioned create_fsa_v05.sh script. File with Knowledge Base (KB.all) is a required parameter as well as files names and activities.

Usage:

 secapi/NER/figa/make_automat/create_fsa_v05.sh ../../KB.all names activities
        

FA is then created in a directory above.

The FA needs to be recreated whenever KB.all is changed. Otherwise there's a chance that records won't correspond with the searched string.

1.5 Identifying entity fragments

Figa can identify entity fragments in texts from KB.all. Output of a fragment differs from entity. row_number_in_KB is set to 0.

Example output:

 row number in KB    beginning offset    end offset    entity name
 0                   1                   5             James
        

1.6 Backtracking

This feature allows figa to go back to a word, that follows immediately after the current search start point. Figa searches for entities with the longest possible name.

More information can be found in comments of figa.cc file (functions lookup, lookup_file or lookup_string).

Example:

 "Já a James Blain jsme navštívili divadlo. A A Bird šel s námi."

 1. searched entity -> James Blain (beginning of string "^James Blain jsme ....$")
 2. -> Blain (beginning of string "^Blain jsme ....$")
 3. -> A A (beginning of string  "^A A Bird Å¡el ....$")
 4. -> A Bird (beginning of string "^A Bird Å¡el ....$")
        

This feature increases the odds of finding an entity in text.

1.7 UTF-8 support

All the tools used for FA (creating and figa) can process input string/file containing UTF-8 symbols. Converting function in Figa can be found in file figa.cc (utf2symbols and dec2hex). Conversion for the creation of FA can be found in file utf2symbols.cc. More information can be found in comments of the source code.

Tools use algorithm for conversion of UTF-8 symbols to ASCII ("Å¡" -> "&#x0161;").

1.8 Character skipping

To search for entities in full text a set of rules for characters skipping was created. These characters are simple skipped in input string/file before the search begins.

This only applies to the beginning of words, not during search or at the end of words.

Characters to be skipped can be added to the figa source code (figa.cc).

Adding characters to be skipped:

Find function bool marker::to_skip(char) and add another branch -> case 'character':.

1.9 Knowledge Base

Knowledge base (KB.all) contains several entity types. Data on each row are separated by tab. More information about KB.all can be found here.

1.10 Performance statistics

Testing data contain 10.000 - 10.000.000 names of entities.

Testing data can be found in git repository:

 secapi/NER/data/performance/
        

These stats are were measured on 22.11.2013 on server athena1:

 time ./figav08 -d automata.fsa -p < data10.test > /dev/null
 real 0.197    user 0.140    sys 0.054

 time ./figav08 -d automata.fsa -p < data100.test > /dev/null
 real 1.969    user 1.886    sys 0.058

 time ./figav08 -d automata.fsa -p < data1000.test > /dev/null
 real 15.420   user 15.282   sys 0.119

 time ./figav08 -d automata.fsa -p < data10000.test > /dev/null
 real 145.513  user 144.369  sys 0.728
        

1.11 Autocomplete

Autocomplete is a program created for the purposes of figa FA. It is a modified version of figa. Returns entities from FA with corresponding prefix for each input string.

The tool prints out 5 entities that share the same prefix and are in the FA for each input string.

Entities are ordered based on this priority list:

 " ',-.:aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ&0123456789#;|"
        

Program is a working prototype and can be found in git repository:

 secapi/NER/figa
        

There are multiple types of FAs for autocomplete. Characterised by prefixes, see KB.

Short description:

 VISUAL ARTIST (prefix: p)
 LOCATION (prefix: l)
 ARTWORK (prefix: w)
 MUSEUM ( prefix: c)
 EVENT (prefix: e)
 ART FORM (prefix: f) 
 ART MEDIUM (prefix: d)
 ART PERIOD MOVEMENT (prefix: m)
 ART GENRE (prefix: g)
 NATIONALITY (prefix: n)
 ALL (prefix: x)
        

Usage:

 secapi/NER/autocomplete/figa_autocomplete_v01 [-d -f -p -h -m]
        

Parameters:

figa_autocomplete launch example:

 ./figav08 -a -d l_automata.fsa -p < soubor  (input from stdin | output to stdout)
 ./figav08 -a -d w_automata.fsa -f soubor        (input from file | output only to string)
 ./figav08 -a -d p_automata.fsa -m 2 <<< "Paris"     (input from stdin | output only to string | 2 records)
        

Output example for input "Paris":

 Paris   
 2604662;2640630;2682383;2686449;2703450;2709352;2711956;2714570;2741670;2747387;2801217;2833100;2837 451;2861981;2875907;2901014;2904894;2954273;2956455;2975011;3116612;3135687;3179239;3184675;3238521;3306306;3400867;3401125;3425723;3464984;3515932
 Paris - Charles de Gaulle Airport       3181703
 Paris - Le Bourget Airport      3400405
 Paris - Orly Air Base   3462031
 Paris - Orly Airport    3147659
 Paris Appeal Court      3239613
        

2 Modifying tool figa + autocomplete

2.1 str2bin tool

This tool is used to convert numbers of rows in namelist.ASCII to binary format (to conserve space). It is implemented within the script creating FAs create_fsa_v05.sh. Due to issues with creating FA, the tool uses a rather simple encoding method (see below). Whenever a fragment is found (fragment_name|N), the tool writes the record without changing it.

Source code together with the program can be found here:

 secapi/NER/figa/make_automat/
        

Algorithm is described in the comments.

Input data format:

Entity name consists of a sequence of characters except '|', which separates entity name from row number. Numbers are in a string format, multiple numbers are separated by ';'. Row (record) ends with newline character '\n'.

 entity_name|row_number_in_KB[;next_number;...]

Output data format:

Entity name consists of a sequence of characters except '|', which separates entita name from row number. Numbers are in binary format (4 bytes per number), multiple numbers are not separated by any character. Row (record) ends with newline character '\n'.

 entity_name|row_number_in_KB[next_number][next_number]...
        

Conserved space in input file namelist.ASCII is roughly 12% (35 629 774 B for the current version of KB.all).

Conserved space in finite automata is roughly 5% (9 616 084 B for the current version of KB.all).

Encoding:

Numbers are loaded (along with text) from stdin, converted to a binary format and stored in an unsigned integer variable. Consequently 4 left-most bits are cut off and the remaining 28 are split in 4 parts (7 bits each), each of these is saved into a single byte (char). Each byte is shifted left by one bit and bit with index [0] is set to 1. These modified bytes are then sent to stdout.

This encoding was chosen due to an error when creating FA with binary number format, where some bytes would correspond with '\0' or '\n', which are used as delimiters. The downside of this encoding is lower highest possible number on (2^28 - 1) = 268 435 455.

Binary format of a number (32 bits):

 [7_bits] + [1] + [7_bits] + [1] + [7_bits] + [1] + [7_bits] + [1]
        

2.2 Creating automatons

An automaton is created for each tool. Scripts handling automatons creation can be found here:

 secapi/NER/figa/make_automat/create_fsa.sh
 secapi/NER/figa/make_automat/create_fsa_autocomplete.sh
        

Launch example (parameters are in source files):

 ./create_fsa.sh [-l] [-s] --knowledge-base=../../KBstatsMetrics.all --names=names
        

Parameters:

 ./create_fsa_autocomplete.sh ../../KBstatsMetrics.all # autocomplete automaton
        

2.3 Figa

Both figa and autocomplete are merged into a single program. However, autocomplete uses script autcomplete.py (see Autocomplete).

Using command make in directory secapi/NER/figa/sources/ creates a binary program named figav08 in directory secapi/NER/figa/.

Launch parameters:

Output description:

 [numbers of rows in KB separated by ;] \tab [beginning offset] \tab [end offset] \tab [entity/fragment] \tab [FLAG F/S]
        

Figa - delimiters

Delimiters serve multiple purposes in figa (see below). Function marker::to_skip() uses characters from string static const char *delimiters (figa.h file) - characters can be added and removed.

Purposes:

  1. Entity terminating characters
  2. Characters surrounding entities to be skipped
  3. Checkpoints for Backtracking

Example:

Fragment:

 ./figav08 -d automata.fsa -p <<< "Putin"
 0	1	5	Putin	F
        

Entity:

 ./figav08 -d automata.fsa -p <<< "London"
 2936223;3019186;3032807;3043156;3055700;3098156;3120755;3187307;3233389;3366587;3408764    1       6       London	F
        

Offset in characters:

 ./figav08 -d automata.fsa -p <<< "Tren�ín"
 2860691;3110343;3179887;3183231	1	7	Tren쒍쎭n	F  #wrong character interpretation
        

Offset in bytes:

 ./figav08 -d automata.fsa -p -b <<< "Tren�ín"
2860691;3110343;3179887;3183231	1	9	Tren쒍쎭n	F  # wrong character interpretation
        

2.4 Autocomplete

Script autocomplete.py is used for the purposes of autocomplete. Autocomplete supports all characters in the alphabet above (character priority list) in UTF-8, meaning it returns only characters from this alphabet and their base corresponds to input text.

UTF-8 characters are saved in FA in format "&#x0000;" where 0000 represents hexadecimal character [0-9A-F].

Usage:

 ./autocomplete_py [opts]

Parameters:

Example launch:

 ./autocomplete.py -d a_automata.fsa <<< "Ir"  # ./figav08 -a -d a_automata.fsa -p <<< "Ir"
 Ir Herbowo      2334250
 Ira Bach J      2454591
 Ira Belmont     2270971
 Ira Block       2334212
 Ira C Goodell   2382662
        

Entity with UTF-8 characters:

 ./autocomplete.py -d a_automata.fsa -m 7 <<< "Ire"     #./figav08 -a -d p_automata.fsa -p -m 7 <<<
 "Ire"
 Ire     1182597;1863634
  Ire Wardlaw     1542701
 Ire, Yuba       1024623
 Irec쎪 Valad쎣o    1097357  # wrong character interpretation
 Ireen Kirsch    446907
 Ireen Ntonyo    269368
 Ireen Sheer     668236
        

2.5 Backtracking and overlapping entities

Figa returns entities with longest possible names. Each word from input is in at most one entity on output. To return all found entities, use switch -o. Backtracking uses checkpoint in the form of characters from variable static const char *delimiters (inside figa.h file), that it returns to if a search is unsuccsessful or if switch -o was used. Characters can be added to and removed from this string.

Launch example without -o:

 ./figav08 -d automata.fsa -p <<< "Paris Hilton"
 1347157;3339718	1	12	Paris Hilton	F
        

Launch example with -o:

 ./figav08 -d automata.fsa -p -o <<< "Paris Hilton"
 3409299;3417530	1	5	Paris	F
 1347157;3339718	1	12	Paris Hilton	F
 3230791;3418233	7	12	Hilton	F
        

2.6 Spellchecking

Spellchecking was implemented within figa to identify entities, that have a typo in their name. It is launched using parameter -s FILE where FILE represents name of spellchecking automaton. This automaton has to be generated from the same KB as the base automaton.

Principle:

When identifying entities in input text, indexes of the first capital letters are stored and if no entity is identified in a word that starts with a capital letter, spellchecking launches and attempts to correct that word. The fuction first checks if the automaton contains records that are similar to that word (one different character, one character longer or shorter). If no such record is found in spellchecking automaton, processing of the input text continues. If a similar entity is found, the original word from input is returned + offsets of this word and row number of the similar entity. If multiple similar entities are found, their row numbers are returned as well as the word that is being "corrected".

Input and output examples:

 ./figav08 -d automata.fsa -p -s automata-spell.fsa <<< "Nathaniel"
 1755311;1992639	1	9	Nathaniel	F

 ./figav08 -d automata.fsa -p -s automata-spell.fsa <<< "Nsthaniel"
 1755311;1992639	1	9	Nsthaniel	S
        

2.7 check_namelist.py - namelist checking tool

Script is designed to check for errors that should not be in namelists. See switch --help for usage. Examples of errors that the script detects can be found below.

When the script is done checking namelist, it prints it's statistics (this can be expanded based on your needs).

There is an error threshold, meaning that when the script detects a certain amount of errors, it will stop checking and move onto the next namelist, or end (switch -n, see --help).

The script can also check namelists, that have KB row numbers in binary format (switch -b, see --help).

Detectable errors:

 Entity on line with missing pipe 2547
 Missing number after semicolon|1542;
 |1545
 Missing number|
 N is not at the end|N;5500
 utf_char-š�|55554
        

Script location:

 secapi/NER/figa/make_automat/check_namelist.py
        

Launch examples:

 ./check_namelist.py random_namelist.ASCII

 =====  random_namelist.ASCII  =====
 max. lenght of line: 158
     number of lines: 11713837
 -----------------------------------
 O.K.

 ./check_namelist.py -b namelist.bin

 =====  namelist.bin  =====
 max. lenght of line: 147
     number of lines: 11713837
 -----------------------------------
 O.K.
        

Error examples:

 ./check_namelist.py small_namelist.ASCII

 =====  small_namelist.ASCII  =====
 max. lenght of line: 55
     number of lines: 50
 -----------------------------------
 14: Alder River, Canada|    # Bad format.
 21: Alder Rock|3408516;    # Bad format.
 23: Alder Sherwood|N;1477681    # Bad format.
 32: |1371382    # Bad format.
 33: A�lder burn|3821588    # There is an unicode character in the namelist.