The goal of this project was to create a finite automaton (FA) from Knowledge Base (KB), that can be used in conjunction with figa (FIt GAzetter) to search for entities and their fragments in text.
Program autocomplete (recognizing names of saved entities) exists within the scope of this project that can be used for figa FA.
All the programs and scripts can be found in a git repository on server knot09:
git clone ssh://user@knot09.fit.vutbr.cz/var/decipher/secapi/
Files:
secapi/NER/figav08 // directory containing figa tool, version 08 secapi/NER/figav08/make_automat/ // directory containing tools to create FA secapi/NER/autocomplete // directory containing autocomplete tool for FA
Figa is adjusted to search in FA, that has name of the entity as well as information about it's location in KB (row number). This information is later used to find entities in file with Knowledge Base (KB.all
) that was used to create the FA. The output is saved to a string type (for the purposes of python binding). To print the data found to stdout, use switch -p
(see parameters).
Launch syntax:
figav08 -d <automata> [-f <input_file>|-p|-h]
Parameters:
-d
- specifies name of file with FA (default automata.fsa)-f
- specifies input from file-p
- prints output to stdout-h
- prints help and terminatesLaunch examples:
./figav08 -d automata.fsa -p < soubor (input from stdin | output to stdout) ./figav08 -d automata.fsa -f soubor (vstup from file | output only to string) ./figav08 -d automata.fsa <<< "A Bird" (input from stdin | output only to string)
Example output:
row number in KB start offset end offset entity name 36739 1 6 A Bird
Finite automaton (automata.fsa
) is created using tools fsa from Jan Daciuk (fsa_build or fsa_ubild). Script create_fsa_v05.sh
was written to create FA for figa. However, it is necessary to respect the input format (see below).
Input FA data format:
entity_name|row_number_in_KB[;another_number;...] (e.g. A Bird|36739)
Output data format. If the term corresponds to multiple entities (rows) in KB, they're separated using ;
:
row_number[;row_number;...] \t beginning_word_offset \t end_word_offset \t entity_name
Creating FA
Automaton is created using the earlier mentioned create_fsa_v05.sh
script. File with Knowledge Base (KB.all
) is a required parameter as well as files names and activities.
Usage:
secapi/NER/figa/make_automat/create_fsa_v05.sh ../../KB.all names activities
FA is then created in a directory above.
The FA needs to be recreated whenever KB.all
is changed. Otherwise there's a chance that records won't correspond with the searched string.
Figa can identify entity fragments in texts from KB.all
. Output of a fragment differs from entity. row_number_in_KB
is set to 0.
Example output:
row number in KB beginning offset end offset entity name 0 1 5 James
This feature allows figa to go back to a word, that follows immediately after the current search start point. Figa searches for entities with the longest possible name.
More information can be found in comments of figa.cc
file (functions lookup, lookup_file
or lookup_string
).
Example:
"Já a James Blain jsme navÅ¡tÃvili divadlo. A A Bird Å¡el s námi." 1. searched entity -> James Blain (beginning of string "^James Blain jsme ....$") 2. -> Blain (beginning of string "^Blain jsme ....$") 3. -> A A (beginning of string "^A A Bird Å¡el ....$") 4. -> A Bird (beginning of string "^A Bird Å¡el ....$")
This feature increases the odds of finding an entity in text.
All the tools used for FA (creating and figa) can process input string/file containing UTF-8 symbols. Converting function in Figa can be found in file figa.cc
(utf2symbols and dec2hex). Conversion for the creation of FA can be found in file utf2symbols.cc
. More information can be found in comments of the source code.
Tools use algorithm for conversion of UTF-8 symbols to ASCII ("Å¡" -> "š").
To search for entities in full text a set of rules for characters skipping was created. These characters are simple skipped in input string/file before the search begins.
This only applies to the beginning of words, not during search or at the end of words.
Characters to be skipped can be added to the figa source code (figa.cc
).
Adding characters to be skipped:
Find function bool marker::to_skip(char)
and add another branch -> case 'character':
.
Knowledge base (KB.all
) contains several entity types. Data on each row are separated by tab. More information about KB.all can be found here.
Testing data contain 10.000 - 10.000.000 names of entities.
Testing data can be found in git repository:
secapi/NER/data/performance/
These stats are were measured on 22.11.2013 on server athena1:
time ./figav08 -d automata.fsa -p < data10.test > /dev/null real 0.197 user 0.140 sys 0.054 time ./figav08 -d automata.fsa -p < data100.test > /dev/null real 1.969 user 1.886 sys 0.058 time ./figav08 -d automata.fsa -p < data1000.test > /dev/null real 15.420 user 15.282 sys 0.119 time ./figav08 -d automata.fsa -p < data10000.test > /dev/null real 145.513 user 144.369 sys 0.728
Autocomplete is a program created for the purposes of figa FA. It is a modified version of figa. Returns entities from FA with corresponding prefix for each input string.
The tool prints out 5 entities that share the same prefix and are in the FA for each input string.
Entities are ordered based on this priority list:
" ',-.:aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ&0123456789#;|"
Program is a working prototype and can be found in git repository:
secapi/NER/figa
There are multiple types of FAs for autocomplete. Characterised by prefixes, see KB.
Short description:
VISUAL ARTIST (prefix: p) LOCATION (prefix: l) ARTWORK (prefix: w) MUSEUM ( prefix: c) EVENT (prefix: e) ART FORM (prefix: f) ART MEDIUM (prefix: d) ART PERIOD MOVEMENT (prefix: m) ART GENRE (prefix: g) NATIONALITY (prefix: n) ALL (prefix: x)
Usage:
secapi/NER/autocomplete/figa_autocomplete_v01 [-d -f -p -h -m]
Parameters:
-d
- required; specifies file with FA (typically [plwcefdmgnx]_automata.fsa)-f
- optional; specifies input from file-p
- optional; output to stdout-h
- optional; prints help and terminates-m
- optional; number of entities to be found (default 5)figa_autocomplete launch example:
./figav08 -a -d l_automata.fsa -p < soubor (input from stdin | output to stdout) ./figav08 -a -d w_automata.fsa -f soubor (input from file | output only to string) ./figav08 -a -d p_automata.fsa -m 2 <<< "Paris" (input from stdin | output only to string | 2 records)
Output example for input "Paris":
Paris 2604662;2640630;2682383;2686449;2703450;2709352;2711956;2714570;2741670;2747387;2801217;2833100;2837 451;2861981;2875907;2901014;2904894;2954273;2956455;2975011;3116612;3135687;3179239;3184675;3238521;3306306;3400867;3401125;3425723;3464984;3515932 Paris - Charles de Gaulle Airport 3181703 Paris - Le Bourget Airport 3400405 Paris - Orly Air Base 3462031 Paris - Orly Airport 3147659 Paris Appeal Court 3239613
This tool is used to convert numbers of rows in namelist.ASCII
to binary format (to conserve space). It is implemented within the script creating FAs create_fsa_v05.sh
. Due to issues with creating FA, the tool uses a rather simple encoding method (see below). Whenever a fragment is found (fragment_name|N
), the tool writes the record without changing it.
Source code together with the program can be found here:
secapi/NER/figa/make_automat/
Algorithm is described in the comments.
Input data format:
Entity name consists of a sequence of characters except '|', which separates entity name from row number. Numbers are in a string format, multiple numbers are separated by ';'. Row (record) ends with newline character '\n'.
entity_name|row_number_in_KB[;next_number;...]
Output data format:
Entity name consists of a sequence of characters except '|', which separates entita name from row number. Numbers are in binary format (4 bytes per number), multiple numbers are not separated by any character. Row (record) ends with newline character '\n'.
entity_name|row_number_in_KB[next_number][next_number]...
Conserved space in input file namelist.ASCII is roughly 12% (35 629 774 B for the current version of KB.all
).
Conserved space in finite automata is roughly 5% (9 616 084 B for the current version of KB.all
).
Encoding:
Numbers are loaded (along with text) from stdin, converted to a binary format and stored in an unsigned integer variable. Consequently 4 left-most bits are cut off and the remaining 28 are split in 4 parts (7 bits each), each of these is saved into a single byte (char). Each byte is shifted left by one bit and bit with index [0] is set to 1. These modified bytes are then sent to stdout.
This encoding was chosen due to an error when creating FA with binary number format, where some bytes would correspond with '\0' or '\n', which are used as delimiters. The downside of this encoding is lower highest possible number on (2^28 - 1) = 268 435 455.
Binary format of a number (32 bits):
[7_bits] + [1] + [7_bits] + [1] + [7_bits] + [1] + [7_bits] + [1]
An automaton is created for each tool. Scripts handling automatons creation can be found here:
secapi/NER/figa/make_automat/create_fsa.sh secapi/NER/figa/make_automat/create_fsa_autocomplete.sh
Launch example (parameters are in source files):
./create_fsa.sh [-l] [-s] --knowledge-base=../../KBstatsMetrics.all --names=names
Parameters:
-l
- lowercase automaton-s
- spellchecking automaton./create_fsa_autocomplete.sh ../../KBstatsMetrics.all # autocomplete automaton
Both figa and autocomplete are merged into a single program. However, autocomplete uses script autcomplete.py (see Autocomplete).
Using command make in directory secapi/NER/figa/sources/
creates a binary program named figav08 in directory secapi/NER/figa/
.
Launch parameters:
-d FILE
- required; specifies file with FA (*.fsa)-a
- enables autocomplete-b
- enables offset in bytes on output (offset on output is in characters by default) [ONLY FIGA]-f FILE
- input file-h
- prints help-m NUMBER
- number of entities to return if autocomplete is enabled (default 5) [ONLY AUTOCOMPLETE]-o
- figa includes overlapping entities in output [ONLY FIGA]-p
- prints output to stdout-q
- text in quotation marks/apostrophes is considered a whole, rather than parts as entities [ONLY FIGA]-s FILE
- enables spellchecking and specifies name of spellchecking automaton [ONLY FIGA]-x
- output contains all entities with prefix that was specified on input [ONLY AUTOCOMPLETE]Output description:
[numbers of rows in KB separated by ;] \tab [beginning offset] \tab [end offset] \tab [entity/fragment] \tab [FLAG F/S]
Figa - delimiters
Delimiters serve multiple purposes in figa (see below). Function marker::to_skip()
uses characters from string static const char *delimiters
(figa.h
file) - characters can be added and removed.
Purposes:
marker::to_skip()
returns true for character .
, returns the same if a the character is in string *delimiters
marker::to_skip()
or string *delimiters
between the two entities will be skipped.Example:
Fragment:
./figav08 -d automata.fsa -p <<< "Putin" 0 1 5 Putin F
Entity:
./figav08 -d automata.fsa -p <<< "London" 2936223;3019186;3032807;3043156;3055700;3098156;3120755;3187307;3233389;3366587;3408764 1 6 London F
Offset in characters:
./figav08 -d automata.fsa -p <<< "TrenÄ�Ãn" 2860691;3110343;3179887;3183231 1 7 Tren쒍쎭n F #wrong character interpretation
Offset in bytes:
./figav08 -d automata.fsa -p -b <<< "TrenÄ�Ãn" 2860691;3110343;3179887;3183231 1 9 Tren쒍쎭n F # wrong character interpretation
Script autocomplete.py
is used for the purposes of autocomplete. Autocomplete supports all characters in the alphabet above (character priority list) in UTF-8, meaning it returns only characters from this alphabet and their base corresponds to input text.
UTF-8 characters are saved in FA in format "�" where 0000 represents hexadecimal character [0-9A-F].
Usage:
./autocomplete_py [opts]
Parameters:
-a --return-all
- return all possible entities-d dictionary
- automaton file (multiple files allowed)-f FILE
- input file-h --help
- show this help message and exit-l --lowercase
- convert input to lowercase-m NUMBER
- define number of returned entities (default value is 5)-r --remove-accent
- remove accent from inputExample launch:
./autocomplete.py -d a_automata.fsa <<< "Ir" # ./figav08 -a -d a_automata.fsa -p <<< "Ir" Ir Herbowo 2334250 Ira Bach J 2454591 Ira Belmont 2270971 Ira Block 2334212 Ira C Goodell 2382662
Entity with UTF-8 characters:
./autocomplete.py -d a_automata.fsa -m 7 <<< "Ire" #./figav08 -a -d p_automata.fsa -p -m 7 <<< "Ire" Ire 1182597;1863634 Ire Wardlaw 1542701 Ire, Yuba 1024623 Irec쎪 Valad쎣o 1097357 # wrong character interpretation Ireen Kirsch 446907 Ireen Ntonyo 269368 Ireen Sheer 668236
Figa returns entities with longest possible names. Each word from input is in at most one entity on output. To return all found entities, use switch -o
. Backtracking uses checkpoint in the form of characters from variable static const char *delimiters
(inside figa.h
file), that it returns to if a search is unsuccsessful or if switch -o
was used. Characters can be added to and removed from this string.
Launch example without -o
:
./figav08 -d automata.fsa -p <<< "Paris Hilton" 1347157;3339718 1 12 Paris Hilton F
Launch example with -o
:
./figav08 -d automata.fsa -p -o <<< "Paris Hilton" 3409299;3417530 1 5 Paris F 1347157;3339718 1 12 Paris Hilton F 3230791;3418233 7 12 Hilton F
Spellchecking was implemented within figa to identify entities, that have a typo in their name. It is launched using parameter -s FILE
where FILE
represents name of spellchecking automaton. This automaton has to be generated from the same KB as the base automaton.
Principle:
When identifying entities in input text, indexes of the first capital letters are stored and if no entity is identified in a word that starts with a capital letter, spellchecking launches and attempts to correct that word. The fuction first checks if the automaton contains records that are similar to that word (one different character, one character longer or shorter). If no such record is found in spellchecking automaton, processing of the input text continues. If a similar entity is found, the original word from input is returned + offsets of this word and row number of the similar entity. If multiple similar entities are found, their row numbers are returned as well as the word that is being "corrected".
Input and output examples:
./figav08 -d automata.fsa -p -s automata-spell.fsa <<< "Nathaniel" 1755311;1992639 1 9 Nathaniel F ./figav08 -d automata.fsa -p -s automata-spell.fsa <<< "Nsthaniel" 1755311;1992639 1 9 Nsthaniel S
Script is designed to check for errors that should not be in namelists. See switch --help
for usage. Examples of errors that the script detects can be found below.
When the script is done checking namelist, it prints it's statistics (this can be expanded based on your needs).
There is an error threshold, meaning that when the script detects a certain amount of errors, it will stop checking and move onto the next namelist, or end (switch -n
, see --help
).
The script can also check namelists, that have KB row numbers in binary format (switch -b
, see --help
).
Detectable errors:
Entity on line with missing pipe 2547 Missing number after semicolon|1542; |1545 Missing number| N is not at the end|N;5500 utf_char-š�|55554
Script location:
secapi/NER/figa/make_automat/check_namelist.py
Launch examples:
./check_namelist.py random_namelist.ASCII ===== random_namelist.ASCII ===== max. lenght of line: 158 number of lines: 11713837 ----------------------------------- O.K. ./check_namelist.py -b namelist.bin ===== namelist.bin ===== max. lenght of line: 147 number of lines: 11713837 ----------------------------------- O.K.
Error examples:
./check_namelist.py small_namelist.ASCII ===== small_namelist.ASCII ===== max. lenght of line: 55 number of lines: 50 ----------------------------------- 14: Alder River, Canada| # Bad format. 21: Alder Rock|3408516; # Bad format. 23: Alder Sherwood|N;1477681 # Bad format. 32: |1371382 # Bad format. 33: A�lder burn|3821588 # There is an unicode character in the namelist.