Tesis – Corpus Construction

6.4 Corpus Construction

The VisSE corpus was built as part of the equally named VisSE project¹ (“Visualizando la SignoEscritura”, “Visualizing SignWriting” in Spanish) which had the goal of improving the use and access of SignWriting in digital contexts. To form the base for further study as well as training samples for machine learning development, around 1950 instances of raw SignWriting were collected. These samples had been produced by Dr. José María Lahoz-Bengoechea during a span of years while learning Spanish Sign Language at Universidad Complutense de Madrid, and were originally a tool for his private study.

The samples were handwritten with pen or pencil, and collected in vocabulary sheets. As part of the project, they were digitized, separated into different files for each entry, and graphically enhanced to reduce scanning artifacts and other noise. During this process, a reference to the original vocabulary entry was kept, and remains as a tentative gloss (in Spanish) for logograms in the corpus.

The logograms are collected in subsets according to the academic level at which they were collected. This has no further intended meaning than being a way to organize the corpus. Nonetheless, due to the temporal separation of the records, this organisation results in some greater graphical and usage consistency within each of the subsets. This has let us do annotation incrementally, learning from each phase and improving the annotation schema each time a new set was added. In the current release of the corpus, not all of the original logograms are present, but only those which have been annotated and revised. These are sets A1_1, A1_2, A1_3, A2 and B2_2, which include 1146 annotations.

Apart from learning from the annotation process and improving it for the following subset, this incremental approach allowed us to use a bootstrapping approach to annotation. Once the first set had been fully manually annotated, machine learning algorithms were trained on it and used to perform a preliminary annotation of the next subset. The resulting annotations had to be checked and corrected manually, but the process was somewhat faster. Some graphemes are easy to detect for the machine learning algorithms, meaning the human annotators could focus on the more difficult parts. As the algorithms improved, the speedup was evident, and some of the tasks could be somewhat automatized, like the drawing of bounding boxes for each grapheme. Readers interested in the machine learning aspect of our research can find more information in Chapter 8 (Sevilla, Díaz, y Lahoz-Bengoechea 2023).

6.4.1 Quevedo

The process of collecting, organizing, annotating, and performing machine learning on the data was a complex one, compounded by the fact that we were developing the annotation schema in parallel to the actual annotation of the data. Moreover, our annotations are complex and very specific, including both visual annotation of logograms and a multi-feature annotation of graphemes. To deal with this complexity and the specific requirements of our task, a specialized tool was developed as part of the VisSE project, named Quevedo². Therefore, computational access to the corpus and its features is easiest when using Quevedo, and the on-disk format and organization of the corpus is as a Quevedo dataset.

Quevedo is available on the Python Package Index, so installing it can be done with the command python3 -m pip install quevedo[web] if Python and Pip are available. This will also install the web interface, which can be launched with quevedo web at the corpus root directory, allowing visual inspection of the logograms and their annotation.

The features of Quevedo and the format on disk are all explained in the online documentation, but are also briefly detailed in the following for parties who want to use the corpus with other tools or need access to the low-level details. All formats are open and standard, so all the data and features in the corpus can be thus accessed.

6.4.2 Computational representation and access

Logograms in the corpus are stored in the logograms directory, in subdirectories representing each of the subsets. Files are sequentially numbered, starting from 1, and each instance consists of two files. The source image is named with the index of the annotation plus file extension .png, and the annotation data uses the same filename (the index) but with .json extension. For example, the annotation data corresponding to image logograms/A1_1/ 1.png can be found in the file logograms/A1_1/1.json.

The json annotation file is a dictionary of attributes, among which there is a graphemes key containing an array of the different graphemes found in the logogram. Each of them is a dictionary as well, having a box key with the coordinates of the bounding box, and a tags key which is another dictionary representing the mapping from feature names to feature values.

The coordinates of the bounding boxes are 4-tuples of floating point numbers, in the format \((cx, cy, w, h)\). \((cx, cy)\) are the coordinates of the center of the box relative to the logogram, which range from 0 to 1, \((0, 0)\) being the top left corner, and \((1, 1)\) the bottom right one. \((w, h)\) are the width and height of the grapheme region, again relative to the width and height of the logogram, so ranging from 0 to 1. The grapheme tags are stored as strings of characters both for feature names and value.

Aside from the graphemes key, logogram annotation files include some other information used by Quevedo. The meta key stores a dictionary of additional metadata keys for the logogram, where the original gloss for the logogram can be found, as well as some boolean flags which we have used to mark and exclude a few problematic graphemes.

There is also a fold key which stores a number, the index of the fold to which the annotation belongs. Folds can be used to split the data in the corpus along a different dimension from that of the subsets, which can be useful, for example, for logically partitioning the data into training and evaluation sets. Storing this split into the annotations as a fold number helps make experiments reproducible and sound and results comparable. Each logogram is assigned a number ranging from 0 to 9, and this numbers split the corpus data into 10 approximately equally-sized folds. In our experiments we use folds 0-7 for training, and 8-9 for testing.

An example annotation file can be seen in Listing 6.1 at the end of the article.

6.4.3 Other files

Inside the corpus root directory there are a number of other files and directories not mentioned above. Especially relevant is the networks directory, where the weights of neural networks trained with the corpus data are stored. These are included with the corpus so that interested parties can reproduce some of our experiments and pipelines without having to train the algorithms themselves. Visual testing of the networks and pipelines is also possible using the Quevedo web interface.

In the scripts directory, some utility Python scripts are also included, and some researchers might find them useful. A dvc.yaml file can also be found in the root directory, for use with DVC (Kuprieiev et al. 2021). This file can be used to run some common tasks with the data from the corpus. Relevant here may be the extract step, which will extract all graphemes from the logograms and turn them into their own annotation files in a graphemes directory, allowing them to be processed independently of the rest of the logogram.

For more information on the dataset structure or other files, please refer to Quevedo’s documentation at https://agarsev.github.io/quevedo, or to our other articles.