English Morphological Analyzer

Jump To … +

case_ignore.foma fallback.EN.foma fallback.ES.foma hyphenate.EN.foma hyphenate.ES.foma morfo.EN.foma morfo.ES.foma normalise.foma tokenize.foma verbs.ES.foma

¶

English Morphological Analyzer
¶

This is the component of the pipeline that analyzes English words based on a number of open-class word lists, present in the directory data/EN/OP/.

The analysis is done with Penn Treebank-style tags.
¶

Numerals
¶

Numbers written with letters are a closed class, so they’re automatically detected by the script composed by the makefile. Numbers written with digits are open class, and we detect them with this regex. This is a bit of a lenient regex, since it doesn’t care whether commas or dots are being used to separate thousands or decimals. However, since the input is already tokenized, it is better to recognize any such sequence as a number.
```
define Digit [%0|1|2|3|4|5|6|7|8|9];
define Number [Digit* [(%.|%,) Digit+]*] "+CD":0;
```
¶

Nouns

Noun lexemes are loaded from the corresponding file, and then the pluralization morpheme is added. A morpheme boundary symbol %^ is inserted, to let rewrite rules handle inter-morpheme phonology.

Nouns with irregular plurals bypass the suffix with the priority union operator (.P.). A few are listed here, and adding more would be a matter of just writing them in. Nouns that can form the plural both ways are also included.

define NLex @txt"data/EN/OP/NN.txt";

define NSuf "+NN":0 | "+NNS":[%^ s];

define NIrreg [{goose} "+NNS"]:{geese}
            | [{mouse} "+NNS"]:{mice}
            | [{sheep} "+NNS"]:{sheep}
            ;
define NBoth [{cactus} "+NNS"]:{cacti}
           | [{virus} "+NNS"]:{viri}
           ;
define Noun NIrreg .P. [NBoth | NLex NSuf];

¶

Verbs
¶

Verbs have more forms than nouns, and a tag is assigned to each of them. Each has its own specific morpheme, but a lot of irregularities also exist. As before, only a few are listed here. However, there is an example of a verb with only one irregular form (speak, spoke), a verb with two identical forms (say, said both as past and past participle), and a highly irregular verb (to be).

Expanding the list would only require adding the irregularities desired in the variable VIrreg. In a more serious endeavour to write an English analyzer, a better formalism (for example continuation classes, such as in lexc) might be more appropriate and comfortable to use.
```
define VLex @txt"data/EN/OP/VB.txt";

define VSuf "+VB":0 | "+VBZ":[%^ s] | "+VBG":[%^ i n g]
          | "+VBD":[%^ e d] | "+VBN":[%^ e n]
          ;
define VIrreg [{say} "+VBD"]:{said}
            | [{say} "+VBN"]:{said}
            | [{be} "+VBD"]:{was}
            | [{be} "+VBZ"]:{is}
            | [{be} "+VBP"]:{am}
            | [{be} "+VBP"]:{are}
            | [{be} "+VBN"]:{been}
            | [{speak} "+VBD"]:{spoke}
            ;
define Verb VIrreg .P. VLex VSuf;
```
¶

Adjectives & Adverbs

Adjectives and adverbs have very similary morphology, with comparative and superlative forms made with the same suffixes. Of course, a syntactic analyzer might find these forms made with the words “more” and “most”, but this is a word-by-word morphological analyzer, so that case is not studied. It is to be noted, though, that a perhaps more clever tokenizer might recognize multiword expressions, which then would give the analyzer more flexibility for some cases.

define JLex @txt"data/EN/OP/JJ.txt";

define JSuf "+JJ":0 | "+JJR":[%^ e r] | "+JJS":[%^ e s t];

define JIrreg [{good} "+JJR"]:{better}
            | [{good} "+JJS"]:{best}
            | [{bad} "+JJR"]:{worse}
            | [{bad} "+JJS"]:{worst}
            ;
define Adjective JIrreg .P. JLex JSuf;

define RBLex @txt"data/EN/OP/RB.txt";

define RBSuf "+RB":0 | "+RBR":[%^ e r] | "+RBS":[%^ e s t];

define Adverb RBLex RBSuf;

¶

Lexicon

The lexicon is just the possible words that we are going to recognize, with suffixes appended.

define Lexicon Number
             | Adverb
             | Noun
             | Verb
             | Adjective
             ;

¶

Ortographic rules
¶

The lexicon is then passed through ortographic rewrite rules, which take care of morpheme boundary phenomena. This include deletion (-> 0), insertion ([..] ->, where [..] represents an empty transition, but only one), and transformation. Finally the morpheme boundary indicators are dropped.
```
regex Lexicon
  .o. e -> 0 || _ %^ [e|i]
  .o. y -> i || _ %^ e
  .o. y -> i e || _ %^ s
  .o. [..] -> e || [s|z|x|{ch}|{sh}] _ %^ s
  .o. %^ -> 0;
```

English Morphological Analyzer

Numerals

Nouns

Verbs

Adjectives & Adverbs

Lexicon

Ortographic rules