Inflectional morphology with linear mappings

This methodological study provides a step-by-step introduction to a computational implementation of word and paradigm morphology using linear mappings between vector spaces for form and meaning. Taking as starting point the linear regression model, the main concepts underlying linear mappings are introduced and illustrated with R code. It is then shown how vector spaces can be set up for Latin verb conjugations, using 672 inflected variants of 2 verbs each from the four main conjugation classes. It turns out that mappings from form to meaning (comprehension), and from meaning to form (production) can be carried out loss-free. This study concludes with a demonstration that when the graph of triphones, the units that underlie the form space, is mapped onto a 2-dimensional space with a self-organising algorithm from physics (graphopt), morphological functions show topological clustering, even though morphemic units do not play any role whatsoever in the model. It follows, first, that evidence for morphemes emerging from experimental studies using, for instance, fMRI, to localize morphemes in the brain, does not guarantee the existence of morphemes in the brain, and second, that potential topological organization of morphological form in the cortex may depend to a high degree on the morphological system of a language.


Introduction
Introductions to linguistics and textbooks on morphology frame the discussion of form variation in terms of the theoretical construct of the morpheme, defined as the smallest linguistic unit combining form and meaning (see, e.g., Spencer, 1991;Plag, 2003;Booij, 2012;Lardier, 2014). Unsurprisingly, the vast majority of studies on morphological processing assumes that morphemes are well-established theoretical notions and also exist in the mind. The dominant view of the mental lexicon in psychology and cognitive science is well represented by Zwitserlood (2018, p. 583): Parsing and composition -for which there is ample evidence from many languages -require morphemes to be stored, in addition to information as to how morphemes are combined, or to whole-word representations specifying the combination. as well as by Butz and Kutter (2016, Chapter 13), according to whom "morphemes are the smallest meaning bearing grammatical units". Words are believed to be built from morphemes, and the meanings of complex words are assumed to be a compositional function of the meanings of their parts. In a morpheme-based lexicon, an agent noun such as worker can be derived by unification of the representations of its constituents (Lieber, 1980). Thus, given the verb work is obtained. Reflecting a longstanding confound, the term 'morpheme' is often used in experimental work on the mental lexicon, to refer solely to the form part of the linguistic morpheme (i.e., the 'morph'). In comprehension, accessing this form is seen as essential for gaining access to its meaning and its combinatorial properties (Taft, 1994;Marantz, 2013). In production, given the semantics to be expressed, a morphemic form is selected and aligned with the form of its base to prepare for articulation (Levelt et al., 1999).
However, the theoretical construct of the morpheme as the smallest unit combining form and meaning is highly problematic. Aronoff and Fudeman (2011) explicitly back off from the idea that morphemes are linguistic signs.
We have purposely chosen not to use this definition. Some morphemes have no concrete form or no continuous form … and some do not have meanings in the conventional sense of the term. Beard (1977) observed out that in language change, morphological form and morphological meaning follow their own trajectories, and as a consequence the theoretical construct of the morpheme as a minimal sign combining form and meaning stands in the way of understanding the temporal dynamics of language change. Matthews, in his introduction to morphology (Matthews, 1974(Matthews, , 1991, pointed out that the inflectional system of a language such as Latin is not well served by analyses positing that its fusional system is best analyzed as underlyingly agglutinative (see also Hockett, 1960). One striking example of the noncompositional nature of inflectional morphology is provided by Estonian. In this language, most of the plural forms of nouns are built starting with the form of the partitive singular, without the semantics of these plural forms expressing in any way the semantics of the singular and the partitive (e.g., for jalg 'foot' , jalga, partitive singular, jalgadele, adessive plural, see Erelt, 2003;Blevins, 2006).
A word-based perspective is common to a range of morphological approaches. As Booij (2018) clarifies, construction morphology is … word-based morphology. That is, complex words are not seen primarily as a concatenation morphemes, but as independent meaningful units within which certain subcomponents (morphemes) may be distinguished on the basis of paradigmatic relations with other words. That is, morphology is not to be equated with the 'syntax of morphemes' . Morphological schemas characterize the 'Gestalt' (p. 4-5) of complex words and their holistic properties.
In what follows, we restrict the use of the term morpheme to the concept of the morpheme as a minimal linguistic sign combining form and meaning in a syntax of minimal signs and combinations thereof, a concept that goes back to post-Bloomfieldian American structuralism (Blevins, 2016). A word-based perspective is fully compatible with notions of word-internal structure. We will use the term exponent (a term that goes back to Matthews) to refer to recurrent units of subword form (or form variation). These subcomponents or variants correlate with paradigmatic relations between words. Exponents are often understood in realizational models as the phonological 'spell-out' of one or more morphosyntactic properties (Stump, 2001), but they can also be interpreted discriminatively, as markers that distinguish larger meaningful units.
Although realizational theories eschew the construct of the morpheme, they embrace the construct of the exponent. For the most part, realizational models have focused on patterns of inflectional exponence. For example, the rule proposed in Matthews (1991, p. 127) spells out regular plurals in English by concatenating [z] to the noun stem x.
(1) < [Plural, x] The notion of an exponent as a 'spell-out' or 'marker' can also be extended to the derivational domain. The constructional schema proposed by Booij (2016) (Example 3) illustrates this for English deverbal agent nouns, where the verbal stem x is followed by the exponent er. For derived words with their own shades of meanings, further subschemata may be required. For worker, http://www.dictionary.com/ lists several meanings, including (1) a person or thing that works, (2) a laborer or employee, (3) a person engaged in a particular field, activity, or cause, or (4) a member of a caste of sexually underdeveloped, nonreproductive bees, specialized to collect food and maintain the hive. Schema (2) is adequate for the first reading. The other meanings require additional subschemata. For the worker-bee, such a subschema could be: Using the computational mechanism of inheritance hierarchies, the idiosyncratic part of the lexical entry for worker-bee, [IS_BEE[]] can be separated out from the more general agentive nominalization defined by (2). Inheritance hierarchies have been applied most systematically in approaches that aim to factor variation into patterns of varying levels of generality, from patterns that characterize whole word classes, through those that distinguish inflection classes or subclasses, down to lexical idiosyncrasies. The most highly developed of these approaches include models of Network Morphology (Corbett and Fraser, 1993;Brown and Hippisley, 2012) and allied DATR-based accounts (Cahill and Gazdar, 1999).
For constructional analyses of inflectional paradigms, second-order schemata can be set up. The second-order schema formulated by Booij (2016) (Example 39) for English singular and plural nouns, illustrates that whereas a fully compositional calculus is set up for the meanings of complex words (cf. Jackendoff, 1990), the forms of these words are not hierarchically structured. Thus, an exponent realizing one or more semantic functions is nothing more than an additive change to the phonological form of a word. At a high level of symbolic abstraction, realizational and constructional theories isolate variation in form and meaning and clarify how form and meaning can go together in complex words. This type of analysis may be of practical value, especially in the context of adult second language acquisition. It is less clear whether the corresponding theories, whose practical utility derives ultimately from their pedagogical origins, can be accorded any cognitive plausibility. Constructional schemata, inheritance, and mechanisms spelling out exponents are all products of descriptive traditions that evolved without any influence from research traditions in psychology. As a consequence, it is not self-evident that these notions would provide an adequate characterization of the representations and processes underlying comprehension and production. It seems particularly implausible that children would be motivated to replicate the descriptive scaffolding of theoretical accounts and attempt to establish the systems of 'inflection classes' proposed for languages such as Estonian or the 'templates' associated with languages like Navajo.
The representations in current morphological theories and descriptions also tend to gloss over the actual details of words' forms as they are used in daily life. In spontaneous conversational speech, words can have substantially reduced forms (Ernestus et al., 2002;Johnson, 2004;Kemps et al., 2004). Furthermore, the actual phonetic realization of an exponent may depend on the semantic function that it realizes (Plag et al., 2017). Another complication is that the meanings of words, as illustrated above for worker, can be much richer than specified by the semantic function in constructional schemas such as (2). It therefore makes sense for Booij (2018) to characterize morphological schemas as the 'Gestalt' of complex words, but this raises new questions about the extent to which a schema such as (3) is truly transparent to the more general schema (2) of which it is an idiosyncratic instantiation.
The conception of words as 'recombinant Gestalts' (Ackerman et al., 2009) highlights a further difficulty. Realizational models retreat from the static formmeaning correspondence encapsulated in the structuralist morpheme. However, they retain a dynamic counterpart in assuming a stable correlation between the differences in meaning and differences in form between words. In the case of regular English plurals, the difference between the presence or absence of the 'plural' feature correlates with the presence or absence of the marker [z]. However, Gestalts do not work in this way. For example, the partitive singular of jalg 'foot' cited above, jalga, contrasts with the nominative singular jalg. Yet the contrast between partitive and nominative singular case is not realized by the presence or absence of -a. Although -a discriminates partitive and nominative singular forms, it is the theme vowel of jalg, and occurs in the genitive singular and nearly all forms of the noun.
A further problem is that morpho-syntactic features are not necessarily tied to a specific semantic function. For instance, ablative case in Latin can realize semantic functions that in English would be expressed by prepositions as different as from, with, by, in, and at. The reason that exponents such as o (for a particular class of masculine nouns) and ae (for a particular class of feminine nouns) are analyzed as realizing the ablative is that words with these exponents occur with the same range of abovementioned prepositional meanings. In other words, the morphosyntactic feature 'ablative' does not represent a semantic function, but a distribution class. This brings us to Word and Paradigm Morphology.
The approach of Word and Paradigm Morphology (Matthews, 1974(Matthews, , 1991 is different from that of construction morphology, in that proportional analogies between words within paradigms are assumed to make the lexicon as a system productive. As explained by Matthews (1991), In effect, we are predicting the inflections of servus by analogy with those of dominus. As Genitive Singular domini is to Nominative Singular dominus, so x (unknown) must be to Nominative Singular servus. What then is x? Answer: it (p. 192f ) must be servi. In notation, dominus : domini = servus : servi.
Only words exist, and an exponent such as the -i that realizes the genitive singular for masculine (and neuter) nouns of particular declension classes is implicit in the paradigmatic relations that characterize the Latin noun system. Importantly, exponents themselves have no independent existence, at best, they are a descriptive device useful to highlight paradigmatic analogies between the only units that do exist: full words.
An important aspect of Word and Paradigm Morphology is that morphosyntactic features in general represent distribution classes (Blevins, 2003(Blevins, , 2016, as explained above for the Latin ablative. But the same pertains to, for instance, the English singular, the use of which includes reference to single instances for count nouns (the pen), but also reference to non-individuated quantities (the milk), categories (the evil), and organizations (the church). The similarity in meaning of genitive or ablative forms in Latin, or the singular in English, then follows from the distributional hypothesis, which proposes that linguistic forms with similar distributions have similar meanings (Weaver, 1955;Firth, 1968;Landauer and Dumais, 1997). Thus, Matthew's analogy of forms dominus : domini = servus : servi. is actually paralleled by an analogy of lexical co-occurrence distributions d: We can highlight the complex system of multidimensional analogies at issue by writing (5) However, for this perspective on inflectional morphology and paradigmatic analogy to become computationally tractable, it is essential to (1) define words' forms in such a way that similarities between words can be calculated, (2) to represent the meanings of words in a distributional way so that the semantic similarity between words can be quantified, and (3) to formalize paradigmatic analogy mathematically so that it becomes computationally tractable. The theory laid out in Baayen et al. (2018) provides a simple but effective implementation of these ideas.
Word forms are represented by vectors of zeroes and ones specifying which triphones make up these forms. The choice for triphones is motivated by two considerations. First, phones are inherently contextual; for instance, information about the place of articulation of plosive is carried by formant transitions in adjacent vowels. Second, triphones encode information about partial order: evidence for the triphones han and and implies a directed path han → and. For languages with strong phonotactic constraints, diphones are expected to be more effective as sublexical units of form (Pham and Baayen, 2015). For a richly inflecting language such as Latin, it is conceivable that four-phone sequences are also effective sublexical units. Given form units such as triphones, each word form is conceptualized as a point in a high-dimensional form space. How similar two forms are can be evaluated straightforwardly with the Pearson correlation of the pertinent form vectors.
Word meanings are also represented by numeric vectors, taking inspiraton from computational models of distributional semantics (see, e.g., Landauer and Dumais, 1997;Shaoul and Westbury, 2010;Mikolov et al., 2013). In what follows, we use the Pearson correlation to evaluate the similarity between semantic vectors. Another commonly used measure is the cosine similarity. Which measure is optimal depends in part on how semantic vectors are calculated. For the present study, the correlation measure turns out to be effective and is therefore selected.
Given vector representations for forms and meanings, we can now evaluate mathematically (and computationally) how similar words are and how similar meanings are. In other words, we can evaluate the similarities between the representations on the top row of (5) and the corresponding representations on the bottom row. What is still missing is a way of evaluating analogies within the rows of (5). For this, Baayen et al. (2018) propose to use the linear mapping from linear algebra. They show that for English, their Linear Discriminative Learner (LDL) model, which implements such mappings, is remarkably succesful for the computational modeling of basic inflectional and derivational morphology.
The present study extends the LDL model to a language with a much richer inflectional system, Latin. We constructed a dataset that contains, for 8 verbs selected from the four major conjugation classes of Latin, all of the 6 forms in each of 14 paradigms. For each of the 672 paradigm forms in this dataset, we constructed the corresponding form vectors. We also constructed semantic vectors. Baayen et al. (2018) derived semantic vectors for English from a corpus.
In the present study, we simulated semantic vectors, leaving the construction of real semantic vectors from corpora of Latin to future research.
In what follows we pursue three goals. First, we seek to clarify, by means of a case study of a non-trivial inflectional system, whether it is actually possible for LDL to produce and understand inflected forms computationally without having to hand-craft stems, exponents, inflection classes, and inheritance hierarchies.
Our second goal is to provide an accessible introduction to the basic mathematical concepts used in LDL, and to provide a step-by-step guide to how this model can be applied using the statistical programming environment R (R Core Team, 2017). A package for R that implements Word and Paradigm Morphology with Linear Discriminative Learning (WpmWithLdl) can be downloaded from http:// www.sfs.uni-tuebingen.de/~hbaayen/publications/WpmWithLdl_1.0.tar.gz.
Our third goal is to reflect on the cognitive plausibility of form units such as morphemes and exponents. A vast majority of experimental studies interprets effects observed in the laboratory as evidence for the psychological reality of the morpheme (see, e.g., Zwitserlood, 2018). However, reasoning from the consequent to the antecedent confuses necessity with sufficiency. We illustrate this fallacy for a study reporting the supposed localization of morphemes in the brain  by showing that our model predicts topological clustering of form that would seem to support the existence of morphemes -even though there are no morphemes or exponents whatsoever in our model. The hypothesis that we will put forward is that experimental effects traditionally understood as evidence for morphemes (or exponents) can be explained just as well with LDL.
Before introducing the Latin dataset, let us briefly clarify the similarities and differences between LDL and realizational/constructional theories of morphology (R/CM). Both LDL and R/CM assume that 1. morphology is word-based, 2. a word has at least one form and at least one meaning, 3. word forms do not have an internal hierarchical structure (no syntax of morphemes), 4. the meaning of an inflected or derived word incorporates 'grammatical' meanings such as agent or plural, 5. especially derived words can have their own idiosyncratic semantics, and 6. the system is productive.
LDL and R/CM differ in the following respects: 1. In R/CM, the word forms of complex words are constructed from persistent stems and exponents, whereas in LDL, there are no units for stems nor for exponents.
2. In R/CM, meanings tend to be tacitly associated with grammatical features. When specified, these meanings often take the form of monadic units, such as SEM i in (2), or compositional functions, such as PLU[] in (4); in LDL, all meanings are represented by semantic vectors, importantly, inflectional and derivational meanings are represented by their own specific semantic vectors. 3. In R/CM, the meaning of a complex inflected word is determined by the features associated with the word. The meaning of a complex derived word is assumed to be a compositional function of the meanings of its parts. LDL, by contrast, is a discriminative theory in which the meaning of a transparent complex word is obtained by integrating over the semantic vectors of the base word and the semantic vector of the inflectional or derivational function (see Baayen et al., 2018, for detailed discussion). The semantic vector of a derived word, based on its own distribution, can differ from that obtained by integrating over the semantic vectors of stem and derivational function. The greater this difference, the more the derived word is perceived as semantically opaque (see also Marelli and Baroni, 2015). 4. In R/CM, representations and operations (such as inheritance) on representations are agnostic with respect to the tasks of production and comprehension. In LDL, production and comprehension have their own specific mappings (which mathematically are (approximately) each other's inverse).
In summary, LDL is a computational implementation of Word and Paradigm Morphology in which analogy, formalized as linear mappings over vectorized representations of form and meaning, drives comprehension and production. In this approach it is no longer necessary to hand-craft forms for stems and exponents, or to set up inflectional classes and inheritance hierarchies. LDL appears to work well for English . But does it also work for more complex inflectional systems such as the conjugations of the Latin verb? As a first step towards addressing this question, we first introduce our dataset of Latin verbs.

Latin conjugations
The dataset on Latin conjugations is available in the R package WpmWithLdl.
To install the package, assuming the package source (WpmWithLdl_1.0.tar.gz) is available in the directory visible to R, proceed as follows: install.packages("WpmWithLdl_1.0.tar.gz", repos=NULL) The Latin dataset is available for use after executing the following commands: The words under examination in this study are the finite verb forms of two verbs from each of the four major conjugation classes of Latin. For each verb, 14 paradigms (each with 3 persons × 2 numbers) were included (present/past × active/passive × indicative/subjunctive; perfect/pluperfect × indicative/subjunctive; future × active/passive). Table 1 presents the 14 inflected forms for the first person singular of the verb vocaare, terrere, carpere, and audire, each belonging to one of the four conjugation categories. Long vowels are represented by vowel doubling. It can be seen that each conjugation class has its own idiosyncracies. For instance, the first person singular of the 1st conjugation class does not contain the theme vowel characteristic for this class; the 2nd conjugation class has a perfect form without the v exponent found in the 1st and 4th classes; the 3rd conjugation class has a different stem form for the perfect tenses; and the 3rd and 4th conjugation classes do not make use of the b exponent for the future as used in the 1st and 2nd conjugation classes. In total, there are 672 different verb forms in the dataset. The task that we set ourselves is to find a mapping from the Latin verb forms onto their meanings, and also a mapping from these meanings onto the corresponding forms, without making use of morphemes, exponents, and stems, and without having to set up inflectional classes. A trivial and uninteresting mapping would be to pair each of the 672 word forms with a monadic form unit, indexed from i = 1, 2, …, 672, and to set up a second set of monadic semantic units, indexed from j = 1, 2,…, 672. We can then define pairs (i, j) and a function f (i) = j as well as a function g( j) = i that produce the meaning unit given a form unit, and a form unit, given the meaning unit of a pair, respectively. Such a full listing set-up is uninteresting as it does not do justice to the similarities between forms, the similarities between meanings, the analogies between forms and meanings, and the productivity of the system. The approach that we adopt in what follows is to pair each word form i with a numeric vector c i characterizing its form and a second numeric vector s i characterizing its meaning. Below, we explain how form vectors and semantic vectors can be set up. Here, we note that the word forms can now be conceptualized as points in a high dimensional form space {C}, and that word meanings can likewise be conceptualized as points in a second high-dimensional space {S}. We are interested in a mapping F that takes the points in {C} as input and produces the points in {S} as output. This mapping represents the comprehension of inflected forms. Likewise, we can set up a mapping from {S} to {C} to represent the production of inflected forms. In this study, we constrain these mappings to be linear. In what follows, we first introduce the mathematics of linear mappings. We then return to Latin and show how a form space and a semantic space can be constructed. We then examine how succesful linear mappings are for moving back and forth between form and meaning.

Introduction to the mathematics of linear mappings
Anyone who has run a linear regression analysis has made use of a particular instantiation of a linear mapping. Consider a dataset with n observations, each consisting of a response y i and k predictors x i 1 ,x i 2 , …, x ik . By way of example, the response could be reaction time to words in a lexical decision task, and the predictors could be frequency of occurrence, word length, number of neighbors, …. A data table for such a data set with k = 2 predictors is of the form: The estimates ŷ 1 , ŷ 2 ,…, ŷ n predicted by a linear regression model for observations i = 1, 2, … n are a weighted sum of the data points x i 1 , x i 2 , …, x ik (i = 1, 2, …, n), with weights β 0 , β 1 ,…, β k estimated from the data. For the above dataset with two predictors (k = 2), we can write (6) Here, β 0 represents the estimated intercept, and β 1 and β 2 the estimated slopes for the predictors x. 1 and x. 2 respectively. Using notation from linear algebra, we can restate this equation more succintly as follows. First, let let and let We use lower case bold font to denote column vectors, and upper case bold font to denote matrices (bundles of column vectors). The column vector with sums on the right hand side of (6) is the product of X and β̂. How to calculate the product of two matrices is illustrated for two 2 × 2 matrices in Figure 1. To obtain the value in the cell in the upper right of the resulting matrix (c 12 ), one takes the first row of the first matrix and the second column of the second matrix. These two vectors are aligned, the values are pairwise multiplied, and the resulting products are summed. The same procedure generalizes to larger matrices. In general, given matrices A (r rows and s columns) and B (s rows and t columns), the product C = AB is an r × t matrix such that the element in the i-th row and j-th column of C is given by Thus, for the product of two matrices to be defined, the number of columns of the first matrix should be the same as the number of rows of the second matrix. In the case of the regression weights β̂, the number of rows of the 3 × 1 column matrix β equals the number of columns of X . Hence, (6) can be rewritten as In standard linear regression, we have one vector of observations, y. However, we can consider multiple response vectors simultaneously, which we bring together by placing their vectors into the rows of a matrix Y . We can now generalize (7) to obtain the following model: with B the matrix with the estimated regression weights. Geometrically, we can think of (8) as describing a mapping from one set of points in an n-dimensional space {X}, each point given by a row vector of X , onto another n-dimensional space {Y} in which each point is given by a row vector of Y . Ideally, B maps each point in {X} exactly onto its corresponding point in {Y}, but in practice, the mapping will often be approximate only. As in standard linear regression, we have to estimate the mapping B from the data. Given B , we can then obtain an estimate Ŷ of Y , which provides the estimated locations of the data points in {Y}. Figure 2 illustrates the geometry of such a mapping for two spaces {C} and {S}, and two mappings, F and G. The matrix C defines three datapoints in {C}, a, b, and c, with coordinates (1, 2), (−2, −2), and (−2, 1) respectively (shown in blue in the left part of Figure 2). The transformation matrix F maps these points onto the points (2, −4), (−4, 4), and (−4, −2) shown in red in the right part of this figure.
From the perspective of standard regression, C comprises two predictors, the x-coordinate and the y-coordinate. Each data point specifies values for these coordinates, and thus represents points in the plane spanned by the x and y axes, as shown in the left part of Figure 2. The first column of the matrix F , which maps points in {C} onto points in {S}, specifies the regression weights that we need to obtain the predicted x-coordinates in the space {S}, and the second column of F likewise provides the regression weights for the predicted y-coordinates in {S}. The following code snippet in R illustrates these mappings; %*% is R's operator for matrix multiplication. C = matrix(c(1, −2, −2, 2, −2, 1), nrow = 3, ncol = 2) S = matrix(c(2, −4, −4, −4, 4, −2), nrow = 3, ncol = 2) F = matrix(c( 2, 0, 0, −2), nrow = 2, ncol = 2) G = matrix(c(0.5, 0, 0, 0.5), nrow = 2, ncol = 2) C%*%F Given the points in spaces {C} and {S}, the question arises of how to obtain the transformation matrices F and G that map points from one space onto the other. For this, we require the matrix inverse. The inverse of a 2 × 2 square matrix A , denoted by A −1 , is defined as follows: If the determinant a 11 a 22 − a 21 a 12 is zero (i.e., when a 11 a 22 = a 21 a 12 ), the inverse is not defined, and A is referred to as a singular matrix. In R, the inverse of a nonsingular, square matrix is obtained with solve: The matrices C and S are not square matrices, so we cannot use solve to obtain their inverse. We therefore derive the transformation matrix with a small detour, as follows. (9) In this derivation, we pre-multiply C with its transpose because this results in a matrix that is square and smaller in size, and because it is square, we can now apply optimized algorithms as implemented in solve to obtain its inverse: By multiplying C T C with its inverse, the left hand side of (9) simplifies to the transformation matrix F , which is found to be equal to (C T C ) − 1 C T S. When working with large matrices in which rows can be similar -which is often the case for language, since words can be similar in form or similar in meaning -it is often not possible to use solve, as the matrix can be too close to singular. In practice, we therefore use the Moore-Penrose generalized inverse, which is implemented in R as the ginv function of the MASS package. The generalized inverse of X is denoted as X′ .
Once F has been estimated, the predicted data points in {S} are given by Ŝ = CF. We can rewrite this equation as follows: where H is the so-called hat matrix of the linear regression model.
In what follows, C denotes an n × k matrix describing the form properties of n words, with cell C ij taking on the value 1 if word form i contains triphone j, and 0 otherwise. Thus, words are observations in a k-dimensional triphone space, and triphones are predictors for words. Furthermore, we will use S to denote an n × k matrix the row vectors of which are real-valued semantic vectors of length k. Ideally, these vectors are estimated from corpora using one of the methods from computational distributional semantics. As described below in more detail, we simulated semantic vectors. Thus, words are points in a k-dimensional semantic space, and the k 'axes' of this space function as the semantic predictors of the words.
From S and C , we derive two hat matrices, one for comprehension ( H c ) and one for production ( H p ): These hat matrices in turn allow us to estimate the semantic vectors predicted by the form vectors Ŝ and the form vectors predicted by the semantic vectors Ĉ: The R code for these calculations is: Hcomp = C %*% ginv(t(C)%*%C)%*%t(C) Hprod = S %*% ginv(t(S)%*%S)%*%t(S) Shat = Hcomp %*% S Chat = Hprod %*% C where we use the generalized inverse to invert matrices. For the present simple example, the predicted matrices are identical to the 'observed' matrices.

Matrices for form and meaning in Latin
To implement the linear mappings between form and meaning in Latin, we first load the Latin dataset in the R package WpmWithLdl. In this dataset, inflected word forms are listed in the first column Word. The second column Lexeme specifies the verb (indexed by the infinitive), while the remaining columns specify words' inflectional features. Before we can study mappings between form and meaning, and between meaning and form, we have to define the matrices representing words' forms and meanings. The pertinent matrices, C and S , that we examine in this study are calculated with the function learn_mappings. This function requires a formula as its first argument, in which all the classes (i.e., column names of the content lexomes and inflectional features) required to construct the semantic matrices should be included. (Note that the column name of the content lexome should always be the first element in the formula.) The directive grams = 3 requests triphones as sublexical units of form. The object m is a list with a series of data objects relevant for modeling with LDL. The C and S matrices are included in this list. For convenience, we extract the matrices from m. illustrates how the form of word i is specified as a binary vector with a 1 in cell c ij coding the presence of the j-th triphone in that word. The matrix with words' semantic vectors has the same layout,  (Landauer and Dumais, 1997), the features S 1 , S 2 , … would represent latent variables. For the vector space used by Baayen et al. (2018), the features would reflect the association strength between the words in the rows of the matrix with those words in a corpus that show the highest variability in lexical co-occurrence. The present study makes use of simulated semantic vectors. One important reason for working with simulated semantic vectors for this study is that it is rarely the case that all possible inflected variants of a word are found in a corpus (see Karlsson, 1986;Loo et al., 2018a, b, for detailed discussion), and that as a consequence corpus data would be too sparse to derive proper semantic vectors.
One way of creating simulated vectors is to simply create a matrix with random numbers, without any correlational structure for the rows of the matrix. Such a matrix, however, does not do justice to the semantic similarities between words. For instance, one would expect inflectional variants of the same verb to be more similar to each other semantically than inflectional variants sampled from different verbs. To remedy this problem, in the semantic_space function, we assigned each lexome (including both content lexomes and all inflectional features) a semantic vector using standard normal random variates. The semantic vector of a given inflectional form was then the sum of its original random vector and the semantic vectors of all the pertinent lexomes. In this way, forms that share more lexomes would be more similar to each other than forms share less or no lexomes. The similarity structure among word forms can be seen in Figure 3, a heatmap of the correlation matrix for the row vectors of S. In R, this heatmap is obtained with the following code, where we transpose S as in R correlations are calculated for column vectors.

heatmap(cor(t(S)))
The yellow blocks on the diagonal bring together forms with the same lexomes. For example, in the right-upper corner sit all plural word forms, and within that block are small clusters of word forms with the same person (p1, p2, p3). The details of this simulated matrix are not of interest. What is important is that the matrix reflects, in a high-dimensional vector space, the major morpho-syntactic similarities specified in the dataset. (Another possible solution to the lack-ofsemantic-similarity problem in S is to use the mvrnorm function from the MASS package. This function takes as one of its inputs a variance-covariance matrix. By choosing appropriate covariances in the variance-covariance matrix, we can generate row vectors of S with a correlational structure that respects the morphosyntactic similarity of the word forms. Incorporating this method is possible in the semantic_space function by setting with_mvrnorm = TRUE. More details can be found in the documentation of this function.)

Mappings between form and meaning
We are now ready to consider mappings from C (form) to S (meaning), and back from S to C. Note that these mappings will almost always be different, compare,  Figure 2. And although in this simple example, F and G are each other's exact inverse, for more realistic and large matrices, this will no longer be the case, even though the inverse of, e.g., G may similar to a considerable extent to F.

Comprehension: From form to meaning
Using Equation (11), we obtain the predicted semantic vectors Ŝ by first calculating the hat matrix H and then multiplying this matrix with the semantic matrix S . Hcomp = C %*% ginv(t(C) %*% C) %*% t(C) Shat = Hcomp %*% S For evaluating model performance for word i, we compare a given predicted vector ŝ i with the semantic vectors s j of all words j = 1, 2, …, n, using the Pearson correlation. The word for which this correlation is greatest is selected as the word meaning that is recognized. The function accuracy_comprehension carries out these comparisons, and returns a list with both accuracy across all word forms and the correlation coefficients of the selected predicted vectors ŝ s with the targeted vectors s i . Performance turns out to be 100% accurate. In the above output, r refers to the correlation of the predicted semantic vector with the observed semantic vector of the paradigm form. As long as the two homophones are assumed to be truly identical in form (see Gahl, 2008;Plag et al., 2017, for discussion of why this assumption may need to be relaxed) they will map onto exactly the same semantic vector. Thus, for any pair of homophones, one paradigm form will be mapped on the semantic vector for the same paradigm slot, whereas the other form will miss out on its own semantic vector and will instead be confused with the paradigm slot of its dominant homophonic twin. Without contextual disambiguation, there is no way in which the model could have performed better.
The use of semantic vectors might seem overly cumbersome given that one could set up, for each word, an indicator matrix of lexical and inflectional features. Such a matrix can be extracted from the m object: The only forms for which the observed and predicted features are not identical are, unsurprisingly, the six homophones.  Whereas it is possible to map from C to T, an accurate mapping in the reverse direction is not possible, as we shall see below. The problem that we encounter here is that a linear mapping from a lower-dimensional space into a higherdimensional space can only have an image in that larger space that is of the same dimensionality (rank) as the lower-dimensional space (Kaye and Wilson, 1998). For instance, a linear mapping from a space {A} in ℝ 2 to a space {B} in ℝ 3 will result in a plane in {B}. All the points in {B} that are not on this plane cannot be reached from {A}.
A further problem associated with predicting lexical and inflectional features is that, as mentioned above, the matrix T provides a very impoverished representation of the semantics of the Latin inflectional forms. In Word and Paradigm Morphology, morpho-syntactic features do not represent semantic primitives, but rather distribution classes. Such distributional classes are much better represented by semantic vectors than by binary coding for inflectional features. The excellent performance observed for the mapping of C onto S, rather than the mapping from C to T , therefore, is the result of primary interest.

Production: from meaning to form
For speech production, we consider the mapping from the semantic matrix S onto the form vectors of C . The pertinent hat matrix H is The resulting form vectors provide information on the amount of support for the individual triphones, but do not, as such, indicate how the triphones should be ordered to obtain a proper characterization of a word's form.
We therefore need to consider all the ways in which phone triplets can be joined into legal word forms. Triphones contain intrinsic order information: a triphone such as abc can be joined with bcd but not with dcb. We can exploit this partial ordering information efficiently using graph theory.
We assign triphones to the vertices of a graph, and connect these vertices with a directed edge when the corresponding triphones have the proper overlap (bc for abc and bcd). Figure 4 shows the graph of all the triphones contained in the Latin dataset, and the triphone path of the word sapiivisseemus is marked in red. The path starts with the word-inital triphone #sa and ends with the word-final triphone us# (with the # symbol representing word boundary). Each edge in the graph is associated with a weight. For a given word i, these weights are taken from the predicted form vector ĉ i (the row vector of Ĉ corresponding to the semantic vector s i that is the input for production). Using j and k to index the positions of triphones in the columns of Ĉ , the weight on an edge from triphone t j to triphone t k is set to ĉ ik , i.e., to the k-th value in the predicted form vector ĉ i . The support for a path in the graph can now be defined as the sum of the weights on the edges of this path. Importantly, from a word's predicted form vector ĉ i , we calculate all m paths p 1 , p 2 , …, p m (m ≥ 1) with path weights ω 1 ,ω 2 ,…, ω m that start with an initial triphone and end with a final triphone.
To find these paths, we make use of the igraph package (Csardi and Nepusz, 2006), which provides the all_simple_paths function to trace all paths that start from a given vertex and that do not contain cycles. (As cycles can be traversed ad libitum, paths with cycles cannot be enumerated.) From all simple paths, we select those that proceed from an initial triphone to a final triphone. A word's path can contain a cycle, however, as illustrated in Figure 5 for terreerees. (In this figure, for representational clarity, vertices and edges that have little or no support from the mapping from meaning to form have been suppressed.) Cycles of length two and length three can be extracted from the graph with the functions which_mutual and triangles from the igraph package. The function speak from WpmWithLdl inserts such cycles, if present, into any path where this is possible. Any new (and longer) paths found are added to the list of paths, and their associated path weights are calculated. It turns out that in order to avoid paths with unjustified repeated cycles, a path with a cycle is accepted as valid only if there is one, and only one, vertex of that cycle that is encountered twice in that path.
At this point, we have a set of candidate paths, each of which corresponds to a word that can be selected for articulation. The question is how to select the proper word. The weight of a path is not a good guide for selection, as longer paths typically have high path weights. A heuristic that works quite well, but not without error, is to adjust path weights for path length. What works even better is 'synthesis by analysis' . For each candidate path, we take the triphones and map these back onto semantic vectors using the comprehension model. Each of the resulting semantic vectors can then be correlated with the semantic vector that is targeted for realization in speech. The path that generates the highest correlation can now be selected for articulation. For a general framework within which this interplay of comprehension and production receives justification, see Hickok (2014).
A snag that has to be dealt with at this point is what to do when the highest correlation is shared between multiple candidate paths. Here, the edge weights come into play. Lower edge weights indicate weak links in the path, links that are not well supported by Ĉ . We therefore want to avoid, as much as possible, paths with weak edge weights. We also want to avoid unnecessary long paths. Let R denote the ratio of a path's length divided by the smallest edge weight in the vector of edge weights ω of that path: henceforth the length to weakest link ratio. We select for production that path for which R is smallest.
This algorithm is implemented in the function speak of WpmWithLdl. We illustrate its use for the first word of the Latin dataset (pos = 1). s1 = speak(pos = 1, grams = 3, threshold = 0. In the call to speak, the threshold = 0.1 directive thins the graph by removing all vertices for which ĉ ij < 0.1. Without thinning, the number of possible paths that are calculated by all_simple_paths can become very large -even for the present small dataset, the number of paths can run into the thousands. Discounting of vertices and corresponding edges that are highly unlikely to contribute serious contending path candidates speeds up calculations substantially. The amat = m$am directive tells speak where it can find the adjacency matrix that defines the graph. The adjacency matrix is calculated by learn_mappings and returned in its output list with under the name am. This output list also makes available the transformation matrix B c , which is used to calculate the semantic vectors corresponding to the candidate paths. To speed up evaluation for all forms in the Latin paradigms, we run speak in parallel on four cores. library(doParallel) no_cores=4 cl <-makeCluster(no_cores) registerDoParallel(cl) rs = foreach(i=1:nrow(latin), .packages=c("igraph", "WpmWithLdl")) %dopar% speak(pos = i, grams = 3, threshold = 0.1, amat = m$am, data = latin, C = m$C, S = m$S, Bcomp = m$Bcomp, Chat=m$Chat) stopCluster (cl) Model accuracy is straightforwardly evaluated by comparing the predicted words with the observed words. we turn out to have a perfect tie, with a non-existing form pre-empting (by alphabetic order) the correct form. This tie can be resolved by replacing triphones by 4-phones, in which case production performance is completely error-free. What does not work at all is mapping from the T matrix specifying morphosyntactic features to the C matrix with triphones (or 4-phones). There are only 19 forms for which all 254 features are correctly supported. It is impossible to map, with any reasonable degree of accuracy, the low-dimensional space {T} onto the high-dimensional space {C} when, as in the present study, the mapping is constrained to be linear.

A novel perspective on traditional evidence for morphemes
We have shown that, once meanings and forms are reconceptualized as points in spaces with high dimension k in ℝ k , it is possible to set up linear mappings from form to meaning, and from meaning to form. These simple linear mappings achieve an astonishingly high degree of accuracy, without having to define stems, morphemes or exponents, theme vowels, and inflectional classes. This result raises the question of how to interpret the large body of experimental results that has been argued to support the psychological reality of the morpheme. To address this question, we begin with noting that the experimental literature typically reasons as follows.
First, a hypothesis is set up which has a conditional form: "if morphemes are real (P), we should observe effect Q". For instance, if words consist of morphemes as beads on a string, we should expect to find some kind of probabilistic gaps between the morphemes in a word. Such gaps can be operationalized through low transitional probabilities at morpheme boundaries (Hay, 2002;Hay and Baayen, 2003). Now some experiment is run, and effect Q is indeed observed. For instance, at morpheme boundaries, inter-keystroke intervals are longer (Weingarten et al., 2004;Bertram et al., 2015). Having observed effect Q, the conclusion drawn is that morphemes must indeed exist (P).
However, two premises are at issue here. The first is the conditional claim that if morphemes exist, effect Q should follow (P → Q). The second premise is that morphemes exist (P). When both premises are true, it follows that effect Q must exist: However, from observing effect Q, and given the validity of premise P → Q, we cannot conclude that morphemes exist (P). Affirming the consequent P → Q,Q ? P is a fallacious line of reasoning, as P was never asserted as the only possible condition for Q. To continue with the current example of morpheme boundary effects, Baayen et al. (2018) show that such effects at morpheme junctions can arise in the present theory, even though this theory eschews morphemes or exponents alltogether (see also Seidenberg, 1987). The reason is that in a word's graph, edge weights tend to be substantially lower when at the end of a stem paths fork to support different inflectional variants. The lower the edge weights at such forks are, the more costly the production of the triphone that the edge connects to.
One kind of evidence that has been advanced to support morphemes comes from neuroimaging studies. For instance,  compared the Bold response in fMRI to simple English words (e.g., cream) on the one hand with the Bold response to words with a final d or s, which typically realizes number and tense inflection in English (e.g., played, packs). They reported that (potentially) inflected words gave rise to a stronger Bold response, compared to their control condition, in the left inferior frontal area BA 45, and conclude from this that that left frontal regions perform decompositional computations on grammatical morphemes, required for processing inflectionally complex words. (Bozic and Marslen-Wilson, 2010, p. 5-6) The logic of their reasoning is that if decompositional morphological processes exist, there must be brain areas that are differentially engaged. Having found differentiation in BA 45, the conclusion is that this area performs morphological decomposition. However, the conclusion that morphological decomposition exists (and apparently, for whatever reason, is executed in BA 45) is premature, as it is possible that area BA 45 lights up in an fMRI scanner for reasons other than morphological decomposition. Given that there are good reasons to reject the morpheme as a useful theoretical construct, and given that even exponents may not be required, we have to address the question of what factors other than morphological decomposition could give rise to topologically concentrated Bold responses in fMRI scans. It is at present unclear whether the fMRI effects reported are actually trustworthy, given the replicability crisis in both psychology and neuroimaging (Button et al., 2013;Open Science Collaboration, 2015) and the laxness of the criteria in neuroimaging with respect to by-item and by-subject variability. Nevertheless, the issue of the topological organization of linguistic representations is of theoretical interest, and has been pursued for morphology by Pirelli and collaborators using temporal selforganizing maps (TSOMs Ferro et al., 2011;Chersi et al., 2014;Pirrelli et al., 2015). In TSOMs, word forms become paths in a 2-D plane. However, in our experience, self-organizing maps do not scale up well to realistically sized lexicons.
Within the present framework, the question of the topological organization of morphological form can be addressed straightforwardly at the level of triphones. As mentioned above, we opted for triphones as central form features for two reasons. First, triphones do justice to the modulation of speech sounds by their context. The place of articulation of stops is retrievable from formant transitions present in adjacent vowels. Second, triphones provide rich information about sequencing, which we have exploited to construct wordform paths in triphone graphs. Thus far, however, we have remained agnostic about the topological organization of the vertices in these graphs. If triphones have some cognitive reality, and if there are cell assemblies subserving triphone-like units, then it makes sense to reflect on the spatial organization of triphones on a surface. In what follows, we make the simplifying assumption, just as TSOMs do, that this surface is a plane in ℝ 2 .
To obtain a mapping of triphones onto a plane in ℝ 2 , we make use of an algorithm from physics that has been transferred to graph theory with the goal of obtaining well-interpretable layouts of large graphs, graphopt (http://www .schmuhl.org/graphopt/). The graphopt algorithm, which has also been implemented in the igraph package, uses basic principles of physics to iteratively determine an optimal layout. Each node in the graph is given both mass and an electric charge, and edges between nodes are modeled as springs. This sets up a system in which there are attracting and repelling forces between the vertices of the graph. This physical system is simulated until it reaches equilibrium. In other words, the graphopt algorithm provides us with a simple way of implementing spatial selforganization of triphones. Figure 6 (produced with the function graph_topology_flect) presents the results of the graph-opt algorithm applied to the triphones of the Latin dataset. Triphones that are unique for a given inflectional function (perfect, pluperfect, future, and past) are highlighted. Interestingly, triphones show considerable topological clustering depending on what inflectional function they subserve. For instance, triphones unique for the perfect are strongly represented in the upper part of the graph. This clustering, however, does not imply that it Figure 6. Topology of tenses in the triphone graph of the Latin dataset, using the graphopt layout. Note the local topological coherence of the triphones subserving the perfect, the pluperfect, the future, and the past is in the upper part of the plane that inflected forms realizing the perfect are decomposed or combined. The clustering arises as a consequence of physical constraints on low-level units that have to be allocated to a 2-D plane. Figure 7 (obtained with the graph_topology_segments function) shows that in the very same network, triphones that are unique to the five vowels of Latin also show considerable clustering. (For phoneme-like clustering in the cortex, see Cibelli et al. (2015) and references cited there.) One might take clusterings like this to imply that phonemes exist in our model, yet there are no units for phonemes at all in the system that underlies the graph, but only sequences of triples of phonesand these units have a very different theoretical motivation than the phoneme in structuralist linguistics and offshoots thereof. A conjecture that follows from the present results is that the details of how triphones self-organize is highly dependent on how a language organizes its morphology. Figure 7. Topology of vowels in the triphone graph of the Latin dataset, using the same graphopt layout as for Figure 6 We end this section with a brief note on double dissociations in aphasia and their implications for understanding the nature of morphological processing. Selective impairments, such as relatively unimpaired performance on irregular forms and impaired performance on regulars, are often taken as providing evidence for distinct processing components. However, as shown by Juola (2000), double dissociations can arise in non-modular neural networks when across many simulation runs the network is lesioned randomly. We expect that similar results are within reach for the present approach, as linear mappings are equivalent to twolayer linear networks. Furthermore, when topological organization is imposed, as in Figure 6, local lesions in for instance the upper left of this figure, can have an effect that is easily misunderstood as the perfect tense morpheme (or exponent) having been lost.

Concluding remarks
We have shown that it is computationally possible to map 672 different inflectional forms of Latin onto their semantic vectors, and to map these semantic vectors back onto their corresponding word forms, without requiring constructs such as stems, affixes, exponents, morphemes, allomorphs, theme vowels, and inflectional classes. We have also shown that when the basic units of analysis, triphones, are allowed to self-organize in a 2-dimensional plain, patterns emerge that are reminiscent of morphemes (or exponents) and phonemes, without such units being part of the generating system.
Interestingly, it is apparently not necessary to harness the power of deep learning to see clusterings resembling traditional units emerge. Together with the results of a much larger-scale study for English , we conclude it is worthwhile to explore in further detail the potential of linear mappings for our understanding of lexical processing in comprehension and speech production.
Some caveats are in order, however. In actual language use, it is typically the case that only a minority of the set of possible inflectional forms is used, as shown by Karlsson (1986) and Loo et al. (2018a, b). For more realistic modeling of Latin inflectional morphology, a corpus of Latin will be indispensible. A corpus will not only allow us to test how well the present theory works for sparse data, but it will also make it possible to start working with actual semantic vectors rather than simulated ones. Given a corpus-informed model, it will also become possible to address the productivity of the system (see Baayen et al., 2018, for more detailed discussion and computational implementation) and the role of principal parts.
Undoubtedly, there are limits to what can be done with simple linear mappings. Possibly, these limits constrain morphological form. Even in Latin, periphrastic constructions were in use, possibly in order not to stretch the fusional system beyonds its limits. It is also conceivable that there are black holes in a system, areas where uncertainties about which path to follow are too great to allow production with confidence. A possible case in point is the Polish genitive, see Dabrowska (2001) for detailed discussion of the problem and Divjak and Milin (2018) for a possible solution.
A further complication is that human performance is not error-free. The larger context within which words are used is of vital importance for both comprehension and production. One of the reasons that words 'by themselves' are fragile is the high variability of word forms, a factor that we have ignored alltogether in this study. This variability characterizes not only the spoken word (Johnson, 2004), but also printed words. A printed word (given font and font size) might seem invariable, but how it actually appears on the retina depends on angle and viewing distance. As a consequence, there is much more variability in actual lexical processing than in the present simulations, and as a consequence the present practically error-free performance observed for small simulated data sets does not carry over to full-scale lexical processing (for detailed discussion and modeling results for auditory comprehension of real speech, see Arnold et al. (2017); Baayen et al. (2018); Shafaei Bajestan and Baayen (2018)).
Nevertheless, the results obtained thus far are promising in that they suggest that linear mappings between vectors of form, anchored in triphones, and vectors of reals representing meaning are surprisingly effective. In this study, we have made use of simulated semantic vectors, constructed in such a way that vectors of words that share more features are more similar to each other. Interestingly, performance does not degrade for random semantic vectors, which is mathematically unsurprising as uncorrelated semantic vectors are less confusable and hence afford more accurate mappings. Importantly, once we drop the assumption that meanings are symbolic and representable by isolated units, or by highly specific semantic logical functions, and that instead we start entertaining the possibility that meanings are representations in a high-dimensional vector space, then mappings between form and meaning become much simpler. This result holds irrespective of whether meanings are equated with distributional vector space models constructed from corpora, or whether such vectors are seen as imperfect probes of much richer distributional models of our experience of the world. Thus, linear discriminative learning offers a new perspective on the functional architecture of lexical processing in comprehension and production.

Funding
This research was supported by an ERC advanced Grant (no. 742545) to the first author.