Vertical

Vertical file is a text file that contains exactly one token (word, number, XML tag or punctation) on each line. It is used to store large amounts of natural language text data. Vertical file contains XML tags but as a whole it is not a valid XML document. It has no root element and some of the elements have unconventional syntax for definining attributes and setting their values. Attribute values are surrounded by quotation marks (") (ASCII code 34 dec). <length=N> tag is an exception. Character \n (ASCII code 10 dec) is used to indicate new lines.

Table of Contents

1 Semantics of used XML elements

Tag/Element Description Attributes
<doc> Paired tag, encapsulates one standalone document (one web page for example) in the vertical file. Vertical file can contain multiple documents. title - document title. Can be extracted for example from the HTML element <title> of a webpage.
url - absolute URL of the original document
id - unique document ID, see Document ID
<head> Paired tag, contains verticalized document title, and metadata (additional information about the document, which are not a part of its contents) none
<p> Paired tag, encapsulates one paragraph of text. none
<s> Paired tag, encapsulates one sentence of text. none
<g/> Unpaired tag, glue. It is inserted between two tokens, which were not separated by any blank character (such as space) in the original document. Can be inserted between a word and a comma for example. none
<link="URL"> Unpaired tag, defines a link. Uses unconventional syntax of the argument. The tag is always located on the same line as the last token of the link separated by TAB character. the URL string is replaced by the real absolute URL of the link.
<length=N> Unpaired tag, defines how many of the previous tokens belong to the link defined by the <link> tag. The tag is always located on the same line as the last token of the link after the <link=“URL“> tag also separated by TAB character the N character is replaced by a real number of previus tokens belonging to the link.

2 Document ID

Every document in a vertical file has an unique ID provided in the input (supported by Universal_Verticalization_Format) or generated by the verticalization software. Generated ID's format is xxxxxxxxxxxxxxxx-yyyyyyyyyyyyyyyyzzzzzzzzzzzzzzzz where x, y and z represent one hexadecimal digit. Each of these 3 parts represented by a different character is a 64-bit hash created by xxHash function. The x part is a hash created from the the verticalization software input file name (not the whole path), the y part is a hash of the first half of the document's absolute URL and the z part is a hash of the second half of the document's absolute URL.


3 Title and Metadata

The <head></head> element contains verticalized title and metadata (one token per row). The title and the metadata are separated by the ; character. This character is required in the beginning of metadata even if the title is empty.

The metadata themselves look like this:

 name
 :
 value
 ;
        

Example of <head> element:

 <head>
 Universities
 in
 Brno
 ;
 places
 :
 czech
 republic
 brno
 ;
 topics
 :
 technology
 school
 university
 ;
 <head>
        

4 Image Representation

Image occurance in a document is represented by the __IMG__ string followed by TAB character separated tags <link> and <length> on the same line.

Example:

 __IMG__    <link="https://upload.wikimedia.org/wikipedia/commons/7/7a/Nohat-wiki-logo.png">