6.4 Corpus Construction
The VisSE corpus was built as part of the equally named VisSE project1 (“Visualizando la SignoEscritura”, “Visualizing SignWriting” in Spanish) which had the goal of improving the use and access of SignWriting in digital contexts. To form the base for further study as well as training samples for machine learning development, around 1950 instances of raw SignWriting were collected. These samples had been produced by Dr. José María Lahoz-Bengoechea during a span of years while learning Spanish Sign Language at Universidad Complutense de Madrid, and were originally a tool for his private study.
The samples were handwritten with pen or pencil, and collected in vocabulary sheets. As part of the project, they were digitized, separated into different files for each entry, and graphically enhanced to reduce scanning artifacts and other noise. During this process, a reference to the original vocabulary entry was kept, and remains as a tentative gloss (in Spanish) for logograms in the corpus.
The logograms are collected in subsets according to the academic
level at which they were collected. This has no further intended
meaning than being a way to organize the corpus. Nonetheless, due to
the temporal separation of the records, this organisation results in
some greater graphical and usage consistency within each of the
subsets. This has let us do annotation incrementally, learning from
each phase and improving the annotation schema each time a new set was
added. In the current release of the corpus, not all of the original
logograms are present, but only those which have been annotated and
revised. These are sets A1_1
, A1_2
,
A1_3
, A2
and B2_2
, which
include 1146 annotations.
Apart from learning from the annotation process and improving it for the following subset, this incremental approach allowed us to use a bootstrapping approach to annotation. Once the first set had been fully manually annotated, machine learning algorithms were trained on it and used to perform a preliminary annotation of the next subset. The resulting annotations had to be checked and corrected manually, but the process was somewhat faster. Some graphemes are easy to detect for the machine learning algorithms, meaning the human annotators could focus on the more difficult parts. As the algorithms improved, the speedup was evident, and some of the tasks could be somewhat automatized, like the drawing of bounding boxes for each grapheme. Readers interested in the machine learning aspect of our research can find more information in Chapter 8 (Sevilla, Díaz, y Lahoz-Bengoechea 2023).
6.4.1 Quevedo
The process of collecting, organizing, annotating, and performing machine learning on the data was a complex one, compounded by the fact that we were developing the annotation schema in parallel to the actual annotation of the data. Moreover, our annotations are complex and very specific, including both visual annotation of logograms and a multi-feature annotation of graphemes. To deal with this complexity and the specific requirements of our task, a specialized tool was developed as part of the VisSE project, named Quevedo2. Therefore, computational access to the corpus and its features is easiest when using Quevedo, and the on-disk format and organization of the corpus is as a Quevedo dataset.
Quevedo is available on the Python Package Index, so installing it
can be done with the command
python3 -m pip install quevedo[web]
if Python and Pip are
available. This will also install the web interface, which can be
launched with quevedo web
at the corpus root directory,
allowing visual inspection of the logograms and their annotation.
The features of Quevedo and the format on disk are all explained in the online documentation, but are also briefly detailed in the following for parties who want to use the corpus with other tools or need access to the low-level details. All formats are open and standard, so all the data and features in the corpus can be thus accessed.
6.4.2 Computational representation and access
Logograms in the corpus are stored in the logograms
directory, in subdirectories representing each of the subsets. Files
are sequentially numbered, starting from 1, and each instance consists
of two files. The source image is named with the index of the
annotation plus file extension .png
, and the annotation
data uses the same filename (the index) but with .json
extension. For example, the annotation data corresponding to image
logograms/A1_1/
1.png
can be found in the
file logograms/A1_1/1.json
.
The json annotation file is a dictionary of attributes, among which
there is a graphemes
key containing an array of the
different graphemes found in the logogram. Each of them is a
dictionary as well, having a box
key with the coordinates
of the bounding box, and a tags
key which is another
dictionary representing the mapping from feature names to feature
values.
The coordinates of the bounding boxes are 4-tuples of floating point numbers, in the format \((cx, cy, w, h)\). \((cx, cy)\) are the coordinates of the center of the box relative to the logogram, which range from 0 to 1, \((0, 0)\) being the top left corner, and \((1, 1)\) the bottom right one. \((w, h)\) are the width and height of the grapheme region, again relative to the width and height of the logogram, so ranging from 0 to 1. The grapheme tags are stored as strings of characters both for feature names and value.
Aside from the graphemes
key, logogram annotation
files include some other information used by Quevedo. The
meta
key stores a dictionary of additional metadata keys
for the logogram, where the original gloss for the logogram can be
found, as well as some boolean flags
which we have used
to mark and exclude a few problematic graphemes.
There is also a fold
key which stores a number, the
index of the fold to which the annotation belongs. Folds can be used
to split the data in the corpus along a different dimension from that
of the subsets, which can be useful, for example, for logically
partitioning the data into training and evaluation sets. Storing this
split into the annotations as a fold number helps make experiments
reproducible and sound and results comparable. Each logogram is
assigned a number ranging from 0 to 9, and this numbers split the
corpus data into 10 approximately equally-sized folds. In our
experiments we use folds 0-7 for training, and 8-9 for testing.
An example annotation file can be seen in Listing 6.1 at the end of the article.
6.4.3 Other files
Inside the corpus root directory there are a number of other files
and directories not mentioned above. Especially relevant is the
networks
directory, where the weights of neural networks
trained with the corpus data are stored. These are included with the
corpus so that interested parties can reproduce some of our
experiments and pipelines without having to train the algorithms
themselves. Visual testing of the networks and pipelines is also
possible using the Quevedo web interface.
In the scripts
directory, some utility Python scripts
are also included, and some researchers might find them useful. A
dvc.yaml
file can also be found in the root directory,
for use with DVC (Kuprieiev et al. 2021). This file can
be used to run some common tasks with the data from the corpus.
Relevant here may be the extract
step, which will extract
all graphemes from the logograms and turn them into their own
annotation files in a graphemes
directory, allowing them
to be processed independently of the rest of the logogram.
For more information on the dataset structure or other files, please refer to Quevedo’s documentation at https://agarsev.github.io/quevedo, or to our other articles.