Vertical file is a text file that contains exactly one token (word, number, XML tag or punctation) on each line. It is used to store large amounts of natural language text data. Vertical file contains XML tags but as a whole it is not a valid XML document. It has no root element and some of the elements have unconventional syntax for definining attributes and setting their values. Attribute values are surrounded by quotation marks (") (ASCII code 34 dec).
<length=N> tag is an exception. Character \n (ASCII code 10 dec) is used to indicate new lines.
||Paired tag, encapsulates one standalone document (one web page for example) in the vertical file. Vertical file can contain multiple documents.||
||Paired tag, contains verticalized document title, and metadata (additional information about the document, which are not a part of its contents)||none|
||Paired tag, encapsulates one paragraph of text.||none|
||Paired tag, encapsulates one sentence of text.||none|
||Unpaired tag, glue. It is inserted between two tokens, which were not separated by any blank character (such as space) in the original document. Can be inserted between a word and a comma for example.||none|
||Unpaired tag, defines a link. Uses unconventional syntax of the argument. The tag is always located on the same line as the last token of the link separated by TAB character.||the
||Unpaired tag, defines how many of the previous tokens belong to the link defined by the <link> tag. The tag is always located on the same line as the last token of the link after the
Every document in a vertical file has an unique ID provided in the input (supported by Universal_Verticalization_Format) or generated by the verticalization software. Generated ID's format is
z represent one hexadecimal digit. Each of these 3 parts represented by a different character is a 64-bit hash created by xxHash function. The
x part is a hash created from the the verticalization software input file name (not the whole path), the
y part is a hash of the first half of the document's absolute URL and the
z part is a hash of the second half of the document's absolute URL.
<head></head> element contains verticalized title and metadata (one token per row). The title and the metadata are separated by the ; character. This character is required in the beginning of metadata even if the title is empty.
The metadata themselves look like this:
name : value ;
<head> Universities in Brno ; places : czech republic brno ; topics : technology school university ; <head>
Image occurance in a document is represented by the
__IMG__ string followed by TAB character separated tags <link> and <length> on the same line.