6.1 Introduction

Modern linguistics rely ever increasingly on digital data, source instances of language along with annotations of their origin, meaning, or features. These assets are often organized into datasets or corpora, collections of annotated linguistic data sharing a theme or object of study. The creation and sharing of datasets can help immensely in the research of a certain subject, allowing empirical investigation as well as providing a shared substrate on which to discuss and compare theories.

As an object of increasing linguistic scrutiny, sign languages have also seen the construction of diverse corpora in recent times, covering some of the more than a hundred different sign languages in use in the world. Sign languages, however, present unique challenges due to their viso-gestual nature and, especially for linguists, their lack of a standard and widespread form of writing.

SignWriting transcription of the Spanish Sign Language sign for “lie”. A video can be seen online at SpreadTheSign: https://www.spreadthesign.com/es.es/word/22502/mentira/0/?q=mentira
Fig. 6.1 − SignWriting transcription of the Spanish Sign Language sign for “lie”. A video can be seen online at SpreadTheSign: https://www.spreadthesign.com/es.es/word/22502/mentira/0/?q=mentira

Often, sign languages are recorded using video, and the meaning is annotated using glosses from the oral language in the same geographic region. This is, however, not a proper transcription, since sign languages are natural languages with a grammar and lexicon of their own, and, in order to properly capture them, a native system is needed.

There are a few existing proposals for transcribing sign languages into a written form. Unique among them, SignWriting (Sutton y Frost 2008) transposes the three-dimensional nature of signs into a bi-dimensional arrangement of symbols, as can be seen in Figure 6.1. Different iconic symbols are used to represent the head, hands and other body parts, and their location and movement is recorded in an abstract and systematic manner.

However, the graphical nature of SignWriting means it is very different from the usual writing systems for oral languages, making it harder to process with existing tools and standards. Moreover, while there exist some computational representations, very often SignWriting is shared in the form of images, which do not require special fonts and software installed to be viewed, but are impossible to process as text.

Therefore, to be able to linguistically process SignWriting in its image form, tools and standards are required. If these are to be developed empirically, source data are also needed. A few collections of sign language transcribed using SignWriting exist, but are not research oriented and deal only with the digital representation. Additionally, SignWriting can also be handwritten, and there are no research corpora of handwritten SignWriting that we are aware of.

In this article, we present the VisSE corpus of Spanish SignWriting, a collection of handwritten SignWriting instances representing signs of Spanish Sign language. The instances have been graphically annotated, for which an extensive schema has been developed, recording both the lexical meaning of the different symbols involved as well as their spatial information, a fundamental part of SignWriting.

This corpus can be used to extract information on SignWriting, for research on the processing of graphical languages, for empirical study of the features of sign languages, for the training and evaluation of machine learning methods, or for other research purposes which have not occurred to us. We have used it in our previous and ongoing research, and therefore, believing it may be of use to the research community at large, we have freely released it online (Sevilla, Lahoz-Bengoechea, y Díaz 2022). We will continue to expand and improve it, and this article relates its current state, how it has been built and the annotation schema used.

In Section 6.2, related corpora and tools are discussed, as well as systems for computationally storing SignWriting. Section 6.3 explains the different objects in the corpus and how have they been annotated, while Section 6.4 is centered on the concrete details and computational aspects of its construction. Section 6.5 gives an overview of the data and some statistics, and in Section 6.6 a few conclusions are drawn and future work described.

Due to its complexity, explaining SignWriting is out of the scope of this article, but enough detail will be given to allow the reader to follow the discussion. For more information, interested readers can see the extensive documentation available online at https://signwriting.org.