6.5 Data Description
There are 1146 annotations in the corpus, 982 of which are fully
annotated logograms. The rest of the annotations are marked using the
exclude
flag, most of them being long transcriptions of
polysyllabic signs rather than single logograms. These have been split
into two (or more) independent logograms, but the original annotations
are also included for reference. Some other transcriptions, marked
with the flag problem
, present some kind of graphical or
representational problem, which has led us to exclude them for now
from annotation, but are kept in the corpus.
Within the 982 fully annotated logograms, 6060 different graphemes
can be found. Of these, 330 belong to the HEAD
class,
1047 to DIAC
, 1649 to HAND
, 1369 to
ARRO
, 1292 to STEM
and 373 to
ARC
. In Table
6.3, these numbers can be compared to the number of different
SHAPE
s that can be found for each CLASS
. As
can be seen, the proportions are very different, meaning that some
classes, like ARRO
or DIAC
have only a few
different possible grapheme SHAPE
s but are abundantly
represented in the data, while other classes like HAND
or
ARC
are less abundant compared to the variability within
the class.
CLASS |
Graphemes | Unique Observations | SHAPE s |
Appearance Rate |
---|---|---|---|---|
HAND |
1649 | 560 | 72 | 1.68 |
ARRO |
1369 | 23 | 3 | 1.39 |
STEM |
1292 | 15 | 2 | 1.32 |
DIAC |
1047 | 19 | 19 | 1.07 |
ARC |
373 | 37 | 6 | 0.38 |
HEAD |
330 | 20 | 20 | 0.34 |
total | 6060 | 674 | 122 | 6.17 |
Examining the rate of appearance of grapheme classes per logogram,
we can also make some interesting observations. The average amount of
graphemes per logogram is about 6, 1.68 of which are hands.
Unfortunately, we cannot distinguish between bimanual signs and
transcriptions where a transformation in handshape is encoded, but
further, semantic annotation would make this clear. Easier to compute
is the complexity of movement paths. Since most movements are marked
with an arrow head, the ratio of segments (STEM
and
ARC
) to arrow heads (ARRO
) can give us an
approximate measure of the mean number of segments for paths: \(\frac{1292+373}{1369} \approx 1.22\).
This means that most movement markers are simple, with just one stem
segment, but a non-trivial amount (approximately one every five) is
more complex, having two or more segments.
If we examine the distribution of tag combinations, however, we can
see a very skewed distribution, as is depicted in Figure
6.8. Some graphemes are very common, while many combinations are
rare, forming a very long tail of infrequent graphemes. This also
happens if we just look at the SHAPE
feature, and across
classes, as can be seen in Figure
6.9.
Since this is not a corpus of real utterances or texts, but rather
a vocabulary, conclusions cannot be directly inferred about the
frequency of elements in SignWriting in use. However, we can make
observations across the vocabulary of the transcribed sign language,
seeing that some particular gestures (understood in the broad
linguistic sense) are much more common than others. For example, the
“touch” grapheme is extremely common, as well as the
PICAM-
hand shape —fingers extended but together, which
acts as the “flat” object descriptor in LSE (top
left in Table 6.1).