Phylogenetic signal in phonotactics

Phylogenetic methods have broad potential in linguistics beyond tree inference. Here, we show how a phylogenetic approach opens the possibility of gaining historical insights from entirely new kinds of linguistic data--in this instance, statistical phonotactics. We extract phonotactic data from 111 Pama-Nyungan vocabularies and apply tests for phylogenetic signal, quantifying the degree to which the data reflect phylogenetic history. We test three datasets: (1) binary variables recording the presence or absence of biphones (two-segment sequences) in a lexicon (2) frequencies of transitions between segments, and (3) frequencies of transitions between natural sound classes. Australian languages have been characterized as having a high degree of phonotactic homogeneity. Nevertheless, we detect phylogenetic signal in all datasets. Phylogenetic signal is greater in finer-grained frequency data than in binary data, and greatest in natural-class-based data. These results demonstrate the viability of employing a new source of readily extractable data in historical and comparative linguistics.


Introduction
A defining methodological development in 21st century historical linguistics has been the adoption of computational phylogenetic methods for inferring phylogenetic trees of languages (Bowern 2018a). The computational implementation of these methods means that it is possible to analyse large samples of languages, thereby inferring the phylogeny (evolutionary tree) of large language families at a scale and level of internal detail that would be difficult, if not impossible, to ascertain manually by a human researcher (Bowern & Atkinson 2012: 827). There is more to phylogenetics than building trees, and there exists untapped potential to explore the language sciences and human history with a phylogenetic approach. For example, in linguistics, phylogenetic methods have been integrated with geography to infer population movements (Walker & Ribeiro 2011;. In comparative biology, phylogenetic methods have been applied profitably to investigations of community ecology (Webb et al. 2002), ecological niche conservatism (Losos 2008), paeleobiology (Sallan & Friedman 2012) and quantitative genetics (Villemereuil & Nakagawa 2014). At the heart of these methods, however, is a sound understanding of the evolutionary dynamics of comparative structures. In this paper, we pesent a foundational step by detecting phylogenetic signal, the tendency of related species (in our case, language varieties) to share greater-than-chance resemblances (Blomberg & Garland 2002), in quantitative phonotactic variation.
Throughout recent advances in linguistic phylogenetics, less attention has been paid to methodological development at the stage of data preparation. Large-scale linguistic phylogenetic studies continue, by-andlarge, to rely on lexical data which have been manually coded according to the principles of the comparative method (as described by Meillet 1925;Campbell 2004;Weiss 2014)-the comparative method being the long-standing gold-standard of historical linguistic methodology (Chang et al. 2015;Kolipakam et al. 2018). This article demonstrates that phonotactics can also present a source of historical information. We find that, for a sample of 111 Pama-Nyungan language varieties, collections of relatively simple and semi-automatically-extracted phonotactic variables (termed characters throughout) contain phylogenetic signal. This has positive implications for the utility of such phonotactic data in linguistic phylogenetic inquiry, but also introduces methodological considerations for phonological typology.
In Sections 1-2, we discuss the motivations for looking at phonotactics as a source of historical signal, and we give some broader scientific context that motivates the methodological approach we take later on. In Sections 3-6, we present tests for phylogenetic signal in phonotactic characters extracted from wordlists for 111 Pama-Nyungan language varieties. Section 3 details the materials used and reference phylogeny. Section 4 tests for phylogenetic signal in binary characters that code the presence or absence of biphones (two-segment sequences) in each wordlist, capturing information on the permissibility of certain sequences in a language. Section 5 also tests for phylogenetic signal in biphones, but extracts a finer-grained level of variation by taking into account the relative frequencies of transitions between segments. Section 6 groups segments into natural sound classes and tests for phylogenetic signal in characters coding the relative frequencies of transitions between different classes. We finish in Sections 7-8 with discussion of the limitations of the study design, implications of the results and directions for future research.

Motivations
There are at least two reasons why consideration of alternative data sources could be fruitful in historical linguistics. The first is that a bottleneck persists in linguistic phylogenetics when it comes to data processing. The data for most linguistic phylogenetic studies are lexical cognate data-typically binary characters marking the presence or absence of a cognate word in the lexicon of each language-which have been assembled from the manual judgements of expert linguists using the traditional comparative method of historical linguistics (e.g. Weiss 2014). Although data assembled in this way is likely to remain the gold-standard in historical linguistics for the foreseeable future, it nevertheless constitutes slow and painstaking work (notwithstanding efforts to automate parts of the process; see List, Greenhill & Gray 2017;Rama et al. 2018;List et al. 2018). This restricts the pool of languages that can be included in phylogenetic research to those that have been more thoroughly documented, introducing the risk of a sampling bias, where relatively well-studied regions of the global linguistic landscape are over-represented in historical and comparative work.
The second motivation for considering alternative historical data sources is that there are inherent limitations associated with lexical data. Undetected semantic shifts and borrowed lexical items erode patterns of vertical inheritance in a language's lexicon. Put another way, these changes create noise in the historical signal of a language's lexicon. Chance resemblances between non-historically cognate words are another source of noise in lexical data. Eventually, semantic shifts, borrowings and chance resemblances will accumulate to a point where genuine historical signal is indistinguishable from noise. This imposes a maximal cap on the time-depth to which the comparative method can be applied, which is typically assumed to sit somewhere around 10,000 years BP, based on the approximate age of the Afro-Asiatic family (Nichols 1997: 135). Some phylogenetic studies have attempted to push back the time-depth limitations of lexical data by using characters that code for a range of grammatical features, under the rationale that a language's grammatical structures should be more historically stable than its lexicon (Dunn et al. 2005;Rexová, Bastin & Frynta 2006). However, contrary to expectation, a recent study suggests that grammatical characters evolve faster than lexical data . Differing rates of evolution are also found in phonology, specifically the rates of change in vowel inventories versus consonant inventories (Moran & Verkerk 2018;Moran, Grossman & Verkerk 2020). An additional issue with grammatical characters is that the space of possibilities for a grammatical variable is often restricted. This means that chance similarities due to homoplasy (parallel historical changes) will be much more frequent (c.f. Chang et al. 2015). For example, many unrelated languages will share the same basic word order by chance, because there is a logical limit on the number of basic word order categories.

Phonotactics as a source of historical signal
The motivation for considering a language's phonotactics as a potential source of historical information is based partly on practical and partly on theoretical observations. From a practical perspective, it is possible to extract phonotactic data with relative ease, at scale, from otherwise resource-poor languages. This is because the bulk of a language's phonotactic system can be extracted directly from phonemicized wordlists. As long as there is a wordlist of suitable length (Dockum & Bowern 2019) and a phonological analysis of the language, phonotactic information can be deduced and coded from the sequences of segments found in the wordlist with a high degree of automation. This modest minimum requirement with regards to language resources is a valuable property in less documented linguistic regions of the world. We detail the process of data extraction for this study in Section 3 below.
An additional benefit of extracting phonotactic data from wordlists is the potential for expanding the depth of comparative datasets. Although macro-scale studies, including hundreds or even thousands of the world's languages (in other words, broader datasets), are increasingly common in comparative linguistics, less attention has been paid to the number of characters per language (dataset depth). It is quite a different situation in evolutionary biology, where there has been tremendous growth in whole genome sequencing, thanks to technological advances and falling costs (Delsuc, Brinkmann & Philippe 2005;Wortley et al. 2005). This, consequently, has led to tremendous growth in the depth of biological datasets. This is an important consideration because the quantity of characters required by modern computational phylogenetic methods can be substantial (Wortley et al. 2005;Marin, Hedges & Tamura 2018). Certainly, phonotactic data is unlikely to approach the scale of large genomic datasets in biology, but it could effectively deepen historical linguistic datasets.
From a theoretical perspective, there is reason to suspect that the phonotactics of a language preserve a degree of historical signal. There is some evidence that when a borrowed word enters the lexicon of a language, speakers tend to adapt it to suit the phonotactic patterns of that language (Hyman 1970;Silverman 1992;Crawford 2009;Kang 2011). Consequently, in such a case the historical phonotactic structure of the lexicon remains largely intact, even as particular ancestral words are lost and replaced (a property termed pertinacity by Dresher & Lahiri 2005). Similarly, historical phonotactic properties of a language will remain in the phonotactics of a language in the case of an undetected semantic shift.
Laboratory evidence shows that speakers have a high degree of sensitivity to the statistical distribution of phonological segments and structures when producing novel words. Examples of such studies include Coleman & Pierrehumbert (1997), Albright & Hayes (2003) and Hayes & Londe (2006), among others (see Gordon 2016: 20-21). Lexical innovation then, should have a relatively conservative impact on the frequency distributions of phonotactic characters. Every new word that enters a language's lexicon will have a minute impact on the frequencies of segments and particular sequences of segments in that language. But, over time, the cumulative effect of new lexicon entering a language on phonological and phonotactic frequency distributions will be more modest than if speakers generated new words with no regard for existing frequencies. Thus, there is reason to expect that quantitative phonotactic characters are likely to be conservative.

Phylogenetic signal
The concept of phylogenetic signal (Blomberg & Garland 2002;Blomberg, Garland & Ives 2003: 717) originates in comparative biology, where it refers to the tendency of phylogenetically related species to resemble one another to a greater degree than would otherwise be expected by chance. This expectation derives from the evolutionary history shared between species. Two closely-related species, which share a relatively recent common ancestor, have had less time in which to diverge evolutionarily. We expect more distantly-related species, whose most recent common ancestor lies much further in the past, to tend to be more different, since they have spent longer on separate evolutionary paths.
Phylogenetic signal manifests itself as phylogenetic autocorrelation in comparative studies. That is, species observations in a comparative dataset tend not to behave as independent data points, but rather pattern as a function of the amount of shared evolutionary history between species. For many statistical methods that assume data are independent and identically distributed (i.i.d.), this is a problem. Phylogenetic autocorrelation has long been recognized as an issue in linguistic typology and comparative biology, and both fields share comparable histories of developing sampling methodologies that attempt to correct for or offset phylogenetic relatedness in some way. More recent times have seen the rise of phylogenetic comparative methods, statistical methods that directly account for phylogenetic autocorrelation, rather than offsetting it, beginning with foundational works by Felsenstein (1985) and Grafen (1989) 1 . Although now practically ubiquitous in comparative biology, uptake of phylogenetic comparative methods has been slower in comparative linguistics (notwithstanding studies such as Dunn et al. 2011;Maurits & Griffiths 2014;Verkerk 2014;Birchall 2015;Zhou & Bowern 2015;Calude & Verkerk 2016;Dunn et al. 2017;Verkerk 2017;Widmer et al. 2017;Blasi et al. 2019).
Since the turn of the century, methods have been developed for explicitly quantifying the degree of phylogenetic signal in a dataset (Revell et al. 2008: 591). Measuring phylogenetic signal can be the first step of a comparative study, to test whether there is sufficient phylogenetic signal to necessitate implementation of a phylogenetic comparative method in a later stage of analysis, or to establish the suitability of standard statistical methods if no phylogenetic signal is detected. Measures of phylogenetic signal can also be used to re-evaluate the validity of older results that pre-date modern phylogenetic comparative methods, as in Freckleton, Harvey & Pagel (2002). In other instances, the presence or absence of phylogenetic signal in certain data may be an interesting result in itself. In this study, we present a novel source of linguistic data which traditionally has not been considered a salient source of historical signal for historical linguistic study (indeed, given descriptions of Australian languages, it may have been considered a particularly unlikely source of historical signal; see Section 3.1). We use measures of phylogenetic signal to test the hypothesis that our data contain historical information and, therefore, could contribute to future historical linguistic study. Blomberg, Garland & Ives (2003) provide a set of statistics for measuring phylogenetic signal, which remains prevalent today (for example, Balisi, Casey & Valkenburgh 2018;Hutchinson, Gaiarsa & Stouffer 2018;Leff et al. 2018). We use one of these statistics, K. The K statistic has the desirable property of being independent of the size and shape of the phylogenetic tree being investigated, which means that studies with different sample sizes can be compared directly. Briefly (following Blomberg, Garland & Ives 2003: 722), the calculation of K requires three components: (i) character data (i.e., observations for the variable of interest); (ii) a reference phylogeny, a phylogenetic tree which has been generated independently from the character data; and (iii) a Brownian motion model of evolution. 2 These components entail two assumptions of the method: the assumption that the reference phylogeny is an accurate representation of the phylogenetic history of the populations being studied and the assumption that Brownian motion accurately models the evolution of the character data. In practice, the reference phylogeny will be subject to uncertainty. We return to this point in Section 7 and evaluate the robustness of our results against phylogenetic uncertainty. Similarly, in practice, the Brownian motion model may not be realistic. Nevertheless, it is a simple model and straightforward to implement, and thus commonly used as a starting point before exploring more complex models of evolution 1 See Nunn (2011) for discussion. 2 A Brownian motion model of evolution describes a model of character evolution where the character can move up or down with equal probability as it evolves through time. Under this model of evolution, variance in character values throughout a phylogeny will increase proportionally as time elapses. later on. We discuss this further in Section 7 and outline possible extensions to the model for future study, taking sound change processes into account. To the extent that Brownian motion fails to model the evolution of phonotactic characters, this should make it more difficult to detect phylogenetic signal.
The K statistic is then calculated by, firstly, taking the mean squared error of the data (M SE 0 ), as measured from a phylogenetic mean 3 , and dividing it by the mean squared error of the data (M SE), calculated using a variance-covariance matrix of phylogenetic distances between tips in the reference tree (see Blomberg, Garland & Ives 2003 for a complete formula). This latter value, M SE, will be small when the pattern of covariance in the data matches what would be expected given the phylogenetic distances in the reference tree, leading to a high M SE 0 /M SE ratio and vice versa. Thus, a high M SE 0 /M SE ratio indicates higher phylogenetic signal. Finally, the observed M SE 0 /M SE ratio can be scaled according to the expected M SE 0 /M SE ratio given a Brownian motion model of evolution. This gives a statistic, K, which can be compared directly between studies using different trees. When K = 1, this suggests a perfect match between the covariance observed in the data and what would be expected given the reference tree and the assumption of Brownian motion evolution. When K < 1, close relatives in the tree bear less resemblance in the data than would be expected under the Brownian motion assumption. K > 1 is also possible-this occurs where there is less variance in the data than expected, given the Brownian motion assumption and divergence times suggested by the reference tree. In other words, close relatives bear closer resemblance than would be expected if the variable evolved along the tree following a Brownian motion model of evolution. Blomberg, Garland & Ives (2003) also present a randomisation procedure for testing whether the degree of phylogenetic signal in a dataset is statistically significant. The randomisation procedure utilizes Felsenstein's (1985) phylogenetic independent contrasts (PICs) method. Felsenstein's insight is that, although two character values (x and y) from two sister taxa cannot be considered independent due to phylogenetic autocorrelation, the contrast between them (x − y) is phylogenetically independent, since these values can only diverge in the time since the two sisters split from their most recent common ancestor. Given a set of character data and a phylogenetic tree, Felsenstein (1985) presents a method for harvesting a whole set of phylogenetically independent data points, PICs, which can be used for statistical analysis in lieu of the raw set of observations. Blomberg, Garland & Ives (2003) take advantage of the expectation that, given a Brownian motion model of evolution, PIC variance is expected to be proportional to time. PICs among more closely-related taxa will tend to be lower than more distant relatives, since they have had less time to diverge from common ancestors. The randomisation procedure first extracts PICs for a given character and records the variance. Then, it extracts PICs and records the variance after randomly shuffling character data among taxa (thereby destroying phylogenetic signal). PIC variance is recorded typically for many thousands of such random permutations. If the true PIC variance (for original, unshuffled data) is lower than the variance of PICs in > 95% of random permutations, the null hypothesis of no phylogenetic signal can be rejected at the conventional 95% confidence level.
In this study, we also use a second statistic, D, which was developed to measure phylogenetic signal in binary data. The D statistic is described by Fritz & Purvis (2010). To summarize briefly, the D statistic is based on the sum of differences between sister tips and sister clades, Σd. First, differences between values at the tips of the tree are summed. Since D concerns binary variables, each taxon will either have a 0 or 1 value. At the level of the tips, then, all sister tips will either share the same value (in which case, the difference = 0) or one tip will have a 0 value and the other will have a 1 value (in which case, the difference = 1). Nodes immediately above the tree tips are given the average value of their daughter tips below (which, in a fully bifurcating phylogeny, will either be 0, 0.5 or 1). This process is repeated for all nodes in the tree, until a total sum of differences, Σd, is reached. At two extremes, data may be maximally clumped, such that all 1s are grouped together in the same clade in the tree and likewise for all 0s, or data may be maximally dispersed, such that no two sister tips share the same value (every pair of sisters contains a 1 and a 0, leading to a maximal sum of differences). Lying somewhere in between will be both (i) a distribution that is entirely random relative to phylogenetic structure and (ii) a distribution that is clumped exactly to the degree expected if the character evolved along the tree following a Brownian motion model of evolution. Two permutation procedures are used to determine where these two points lie for a given dataset and phylogenetic tree. Firstly, like Blomberg et al.'s permutation test described above, character values are shuffled at random among tips of the tree many times over, thereby destroying phylogenetic signal. The sums of differences are taken from each random permutation to obtain a distribution of sums of differences, given phylogenetic randomness: Σd r . Then, to obtain a contrasting distribution of sums of differences, the process of character evolution along the tree following a Brownian motion model is simulated many times over. Since Brownian motion is a model of evolution of continuous characters, and what we need here is a distribution of binary character values, the permutation test simulates the evolution of a continuous-valued character and then simply binarizes the tip values to 0 or 1 by observing whether they fall above or below a threshold value. This threshold is set to whatever level will produce the same proportion of 1s and 0s as observed in the real data. The sums of differences are then taken from each simulation, giving a distribution where phylogenetic signal is present: Σd b . Finally, the D statistic is determined by scaling the observed sum of differences relative to the means of the two reference distributions just described: Scaling D in this way provides a standardized statistic with the desirable property that it can be compared between different sets of data, with trees of different sizes and shapes, as with K for continuous characters. One disadvantage of D, however, is that it requires quite large sample sizes (>50), below which it loses statistical power.
Two p values determine the statistical significance of D, one each for the null hypotheses that D = 0 (phylogenetic signal present) and D = 1 (the character is distributed randomly relative to phylogenetic structure). These p values are obtained by comparing the observed D score to the two distributions of simulated D scores described above (Σd r and Σd b ). The fraction of randomly simulated D scores smaller than observed D is taken as the p value for H 0(D=1) . Likewise, the proportion of the simulated D scores greater than the observed D value is the p value for H 0(D=0) .

Materials
Our study measures phylogenetic signal in a variety of types of phonotactic characters, extracted using semi-automated methods from wordlists within the Pama-Nyungan family (Australia). Throughout, we take the doculect to be our unit of study. A doculect is a language variety as documented in a given resource (Cysouw & Good 2007;Good & Cysouw 2013). That is to say, we treat each wordlist as its own unit of study, without making any claims about the status of the documented language variety's status as a language or dialect. This agnosticism is advantageous in phylogenetic studies, since the terms 'language' and 'dialect' imply something about the relationship of a documented language variety to other documented language varieties, and a commitment to one term or the other therefore represents a phylogenetic assumption.

Language sample
Pama-Nyungan is by far the largest language family on the Australian continent, covering nearly 90% of its landmass (everywhere except for three areas: part of the Top End, part of the Kimberley, and the whole of Tasmania) and encompassing around two-thirds of the languages present at the time of European settlement (Bowern & Atkinson 2012: 817). Pama-Nyungan was first proposed and named by Kenneth Hale (Wurm 1963: 136) and it has been the subject of considerable historical linguistic study since this time. Although the family has presented some challenges for historical linguistics, the phylogenetic unity of Pama-Nyungan has been established on traditional historical linguistic grounds (Alpher 2004) with many subgroups identified within (for example, O'Grady, Voegelin & Voegelin 1966;Wurm 1972;Austin 1981). For an overview of the history of Pama-Nyungan classification, see Bowern & Koch (2004 ch. 1-5) and Koch (2014). Bowern & Atkinson (2012) perform a computational phylogenetic analysis of Pama-Nyungan using lexical data from 194 language varieties, providing for the first time a fully bifurcating phylogeny of the entire Pama-Nyungan family.  subsequently perform a phylogeographic analysis using the same dataset, but refined and expanded to 306 language varieties and including a geographic element to estimate the point of origin and spread pattern of the family through time and space.
The Pama-Nyungan family provides an excellent test case for this study. It holds practical advantages which make the task of phonological comparison easier, but it also provides us with a deliberately high bar to clear from a theoretical perspective. Both of these features are a result of the unusual degree of phonological homogeneity observed among Australian languages. Australian languages have been noted for a degree of similarity between phonological inventories of contrastive segments that is exceptional and unexpected in light of the phylogenetic and geographical breadth of the family, the level of diversity observed in vocabulary and aspects of grammar, and the level of phonological diversity found in comparably-sized families of languages elsewhere in the world. This has been noted as early as Schmidt (1919) and in more recent times by Capell (1956), Voegelin et al. (1963), Dixon (1980), Busby (1982), Hamilton (1996), Baker (2014), Bowern (2017) and Round (2021a), among others. This curious level of homogeneity extends to phonotactics too (Dixon 1980;Hamilton 1996;Baker 2014;Round 2021b).
On one hand, the abundance of similar phonological inventories makes the task of comparison between them easier, because it limits the problem of dataset sparsity. Consider a character coding the frequency of some sequence of two segments xy in a language: This character can only be compared between languages that contain both x and y segments in their inventories. If a language lacks either segment in its inventory, then the character will be coded as absent or missing (as distinct from 0, where a language possesses both segments but never permits them in sequence). We expect fewer missing values in Australia, where languages tend to share a large proportion of directly comparable segments, when compared to other parts of the world where we would expect to see many more missing values.
On the other hand, an ostensibly high degree of phonological homogeneity, in spite of considerable phylogenetic diversity, presents challenges for historical linguistics. Baker (2014: 141) and Alpher (2004: 103) have both written on the difficulties for historical reconstruction in Australia because of this. Moreover, a phylogeny implies some degree of historical divergence, but in the case of Australian languages, there would appear to be little by way of phonological divergence, let alone divergences which are phylogenetically patterned. We therefore choose to study an Australian language family as a deliberately difficult test case, where we expect the bar be set high with respect to detecting phylogenetic signal. Gasser & Bowern (2014) counter prevailing views on Australian phonological homogeneity. They find that common assumptions, of the kind discussed above and commonly found repeated in reference grammars, mask a degree of variation which is otherwise revealed by, firstly, extracting data on segmental inventories directly from wordlists and, secondly, considering segmental frequencies extracted from wordlists. This result motivates our current approach; here, we are also concerned with matters of frequency, extracted directly from language wordlists. However, we look at different kinds of characters, pertaining not to single segments but to biphones, and consider them with respect to their phylogenetic implications.

Wordlists
Our Pama-Nyungan phonotactic data is extracted from 111 wordlists which are part of a database under development by the last author (Round 2017b), extending the Chirila resources for Australian languages . In this study we restrict our attention to the most accurate sources available, and use only lexical data that is compiled by trained linguists and for which the underlying dataset is available in published or archived form. Additionally, we restrict our sample to wordlists containing a minimum of 250 words. We include this cut-off since measurement accuracy is a concern for smaller wordlists. A documented wordlist is necessarily only a subset of the complete lexicon of a language and it is unclear how big a wordlist must be before frequency statistics begin to stabilize around a sufficient level of accuracy. There is some work in this space concerning frequences of single segments (Dockum & Bowern 2019), suggesting a rapid decline in the accuracy of phoneme frequencies as wordlists drop below 250 words. Longer wordlists will always be better, however we select 250 words as a reasonable compromise which maintains a generally broad coverage of Pama-Nyungan languages. We return to the subject of wordlist sizes and potential implications for our results in Section 7.1 Bibliographic details for all underlying data is available in Section S2 of the Supplementary Information. Owing to differences in the length of primary sources, there is considerable diversity in the size of the lexicons we use. As shown in Figure 1, the difference from smallest to largest is over an order of magnitude (min. 250, max. 4955), with the middle fifty percent between 509 and 1367 items. Mean lexicon size is 1112 (SD 916))  Original source data, which is typically orthographic and, if digital, is sometimes mixed with metadata or other extraneous material, has undergone extensive data scrubbing, conversion to phonemic form using language-specific orthography profiles (Moran & Cysouw 2018), and additional automated and manual error checking. These procedures ensure basic data cleanliness. Separately however, it has long been recognized that the segmental-phonological analysis of languages is a non-deterministic process (Chao 1934;Hockett 1963;Hyman 2008;Dresher 2009). Two linguists faced with the same data may produce different analyses, not due to error but due to different applications of the very many analytic criteria that figure into any analysis of segments. Consequently, the cross-linguistic phonological record varies not only according to language facts per se, but due also to variation in the practice of linguistic analysis. Recent literature (Lass 1984;Hyman 2008; Van der Hulst 2017; Round 2017a; Kiparsky 2018) emphasizes the value of normalizing source descriptions prior to the analysis of cross-linguistic phonological datasets. This is not an information destroying process-it does not 'standardize' languages-but it may shift information from one part of the representation (e.g., contrast between individual symbols) to another (e.g., contrasts between sequences of symbols), in order that information is located in a comparable way across the languages in the dataset, and therefore is more amenable to comparative analysis. Our wordlist data is normalized is this sense. Complex segments are split into simple sequences (e.g., prenasalized stops are split into a homorganic nasal + stop sequence); long vowels are represented as a sequence of identical short vowels, and vowel-glide-vowel sequences in which the glide is homorganic with either vowel are normalized to vowel-vowel; fortis consonants are represented as a sequence of identical short consonants, and positionally neutralized fortis/lenis stops as singletons; laminal consonants which do not figure in a pre-palatal versus dental opposition are represented as palatal, and rhotic glides which do not figure in an alveolar versus post-alveolar opposition are represented as post-alveolar (see also Round 2019a; Round 2019b). The phonotactic character sets used in this study were extracted from these normalized, comparably segmented wordlists.

Reference phylogeny
The reference phylogeny we use is a maximum clade credibility tree 4 of 285 Pama-Nyungan language varieties inferred using lexical cognate characters by the second author ( Figure S1, Supplementary Information). It was inferred independently of this study, prior to this study's conception and without the involvement of the first and third authors. It was inferred using the same Stochastic Dollo model as Bowern & Atkinson (2012), but with an expanded and refined dataset. Further details of the model and phylogeny construction are described in Bowern & Atkinson (2012),  and . The cognate data used to infer the reference phylogeny is available on Zenodo (Bowern 2018b). See Section S1 of the Supplementary Information for more information on the reference phylogeny.
We considered a reference tree from a newer phylogeographic analysis of Pama-Nyungan based on largely the same data plus further expansion to 304 doculects and continued refinement , however, we opted against its use for this particular study. The reason for this is that, although Bayesian inference of phylogenetic tree topology is considered generally robust to the levels of lexical borrowing observed among Pama-Nyungan languages (Greenhill, Currie & Gray 2009;Bowern et al. 2011), borrowing still has the effect of reducing branch lengths across the tree (Greenhill, Currie & Gray 2009). This effect, and consequently the accuracy of branch length estimates, is equally applicable to both trees considered here. However, the geographic element in the phylogeographic study uses, in part, branch lengths to model geographic dispersal. The posterior distribution of trees, which is jointly informed by cognate data and geography, may therefore show a bias towards geographically proximal languages whose apparent divergence times have been reduced by high rates of borrowing. Thus, although branch length estimates will be impacted by borrowing in any phylogenetic study of Pama-Nyungan, there is more chance of borrowing affecting topology in the phylogeographic study.
We consider it unlikely that the overall conclusions of the study would be altered by the choice of which version of the Pama-Nyungan phylogeny we use as a reference tree. Each of the studies referenced above produced highly congruent Pama-Nyungan phylogenies (see Bouckaert, Bowern & Atkinson 2018 for a detailed comparison). Furthermore,  features fixed clade priors based on subgroups identified in earlier studies, so topological differences are constrained to some extent by design. Nevertheless, the accuracy of the reference tree is a key assumption of the methods we use in this study, and thus phylogenetic uncertainty is an important consideration. We return to this point in Section 7.1 and evaluate the overall robustness of our results to phylogenetic uncertainty by replicating a subset of phylogenetic signal tests over a posterior sample of trees.
As discussed above, we treat each wordlist in our study as its own doculect. The reference tree in this study was inferred using a similar approach, while remaining less-commital about the particular status of the unit of analysis. Resources were sometimes combined for a particular language, but they are also frequently broken up into separate units, particularly when the resources come from different authors and different time periods. We have taken care to match the wordlists in this study to their exact or best corresponding tip in the reference tree. In most cases, the wordlists we use here are the same as those used to generate the cognacy judgements used to infer the reference phylogeny. In other cases, we use a different source to the one used for the reference phylogeny but there is, nevertheless, a straightforward one-to-one mapping between the language variety our wordlist represents and a corresponding tip in the tree. In one case, our Mudburra source (Nash et al. 1988) matches neither of the sources for the two Mudburra tips in the reference phylogeny. However, the two varieties in the reference phylogeny have the same date. This entails that when either of them is removed from the tree, the exact same result is obtained in terms of tree geometry, which is what is significant for our investigation. Accordingly, we remove one and match our source to the other.

Phylogenetic signal in binary phonotactic data
In the simplest case, the phonotactics of different languages may be compared in terms of which sequences of two segments (biphones) they permit and which they do not. If claims about the relative homogeneity of phonotactic constraints in Australian languages holds, then we would expect this kind of comparison to yield little, if any, phylogenetic signal.
In this test, we construct the dataset as follows: We automatically extract from all wordlists every unique sequence of two segments-or more accurately, sequences of xy where each of x and y is either a phonological segment or a word boundary '#'. Each sequence becomes a character (variable) in the dataset, for which every language receives a binary value: 1 if the sequence xy is found in the language's wordlist (even if only once); 0 if the language contains both segments in its inventory but the sequence xy never appears in its wordlist; or, NA (not applicable, missing) if the language does not contain one (or both) of either segment x or y in its inventory (and therefore a prioi cannot contain the sequence xy). Binary data of this kind represents sequence permissibility: Where a language contains both segments in its inventory, it will either permit them to appear together in sequence or it will not. In this respect, the information encoded by these characters is similar to what one might find in the phonotactics section of a descriptive grammar, where one often encounters a description in prose and/or a basic tabulation of which segments are permitted and where, within syllable and word structures. However, this kind of information is also, in a sense, quite coarse-grained, since there are only two possible values. A sequence which is very common in one language will be coded in exactly the same way as a sequence which only appears a handful of times in another language.
We apply the D test individually to each character in the dataset that meets two conditions: at least 50 non-missing values (due to the aforementioned reliability issue with sample sizes smaller than this) and at least one instance of variation (we do not test characters where all languages share identical 1 or 0 values). Given the extensive history of description of Australian languages as phonotactically homogenous, our prior expectation is that testing binary data will fail to yield significant phylogenetic signal. Indeed, we might expect that D will fall significantly below 0, indicating that values are clumped among tips on the reference phylogeny even more conservatively than would be expected if they had evolved in the same phylogenetic pattern as lexical data.
To evaluate the statistical significance of D for any given character, a p value is estimated for each of two null hypotheses: The null hypothesis that D = 1 (H 0(D=1) , character values are distributed randomly with regards to phylogeny) and the null hypothesis that D = 0 (H 0(D=0) , character values are distributed as could be expected if the character has evolved along the phylogeny according to a Brownian motion model). Each p value is calculated using a randomization procedure: A random distribution of D scores is acquired by randomly shuffling character values among tips on the tree for 10,000 permutations. The conventional cutoff for statistical significance is p = 0.05. Here, we use the corresponding Bonferroni-corrected cutoff of 0.025. 5 For any given character, there are six possible results: • D is significantly below 0. Character values are even more tightly clumped among close relatives than Brownian motion alone would lead us to expect. • D is significantly below 1 and not significantly different from 0. The data patterns phylogenetically, i.e., there is phylogenetic signal. • D is significantly above 0 and below 1. The data is neither clearly random nor clearly phylogenetic. • D is significantly above 0 and not significantly different from 1. It is consistent with randomness, not phylogeny. • D is significantly above 1. It is even more dispersed than expected via a random process.
• D is not significantly distinct from 0 nor 1. The patterning of the data is indeterminate, and cannot be distinguished from randomness nor from Brownian phylogenetic evolution.
To summarize, the testing procedure proceeds as follows. For each binary biphone character, if the character has at least 50 non-NA values and the character has at least one '1' and one '0' value (i.e. not every value is identical), then test as follows (otherwise discard): • Calculate D • Conduct randomisation procedure to calculate p for H 0 : D = 0 • Conduct randomisation procedure to calculate p for H 0 : D = 1 A result is interpreted from the combination of D and two p values.

Results for binary phonotactic data
We estimate D for 415 biphone characters using a script based on the phylo.d function in the caper package (Orme et al. 2013), implemented in the statistical software R (R Core Team 2017) 6 . As described in Section 2, D is calibrated by simulating character evolution under two models-one where the character evolves at random relative to phylogeny and one where the character evolves following a Brownian motion threshold model. In this study, we conduct 10,000 permutations of each model for each character. The 415 D values cluster centrally around a mean of 0.59. The distribution is leptokurtic (kurtosis = 10.9), meaning there are more outliers relative to a normal distribution, making the distribution appear as a tall, narrow peak with long tails (Figure 2). The standard deviation is large (3.64). The D test was of indeterminate significance for half of all characters (238 characters, 57% of the dataset). The D scores for 157 characters (38% of the total dataset) show evidence of phylogenetic signal. Just 16 characters (4%) show the opposite result, where the character is consistent with randomness and there is no phylogenetic signal present. Both null hypotheses are rejected for the remaining 4 characters. Of these, 3 are more clumped than their phylogenetic expectation and 1 falls somewhere between phylogenetic and random expectations (0 < D < 1). None are more dispersed than the random expectation. The distribution of these results among different biphone characters is plotted in Figure 3. Note that we expect around 5% of null hypothesis rejections (approximately 18 of 180 rejected null hypotheses) to be false discoveries. Nevertheless, when considering the whole dataset as an ensemble rather than each character individually, a general result can be discerned. The clearest conclusion is that binary, permissibility-based characters tend to be low yielding in information, giving a statistically significant outcome in fewer than half of cases. Nevertheless, where a significant result can be determined, phylogenetic signal does tend to be present-to a degree that is perhaps surprising in light of previous literature describing the relative homogeneity of Australian phonotactic restrictions and their lack of utility in historical endeavours. This result suggests there may be a greater degree of historical information contained in Pama-Nyungan phonotactics than previously thought. However, it may be that a finer-grained approach to data extraction is needed in order to detect it.
These results can be compared to two earlier studies performing the same test on much smaller samples of languages. In the first, Macklin-Cordes & Round (2015) find no evidence for phylogenetic signal in the Yolngu subgroup of Pama-Nyungan-rather, data are significantly over-clumped, suggesting a higher degree of conservatism in phonotactic restrictions relative to lexical data. They fail to reject a null hypothesis of D = 0 for Ngumpin-Yapa, suggesting there may be a degree of phylogenetic signal in the Ngumpin-Yapa dataset. However, the pilot study results should be treated with caution-particularly the failure to reject the D = 0 null hypothesis in the case of Ngumpin-Yapa-due to the small sample sizes (10 languages for Ngumpin-Yapa, 7 for Yolngu), well below the minimum of 50 taxa recommended by Fritz & Purvis (2010). Dockum (2018) performs the same analysis using biphone characters from 20 Tai lects of the Kra-Dai family.
In contrast to Macklin-Cordes & Round (2015), Dockum finds some evidence of phylogenetic signal in the Tai data and suggests perhaps the earlier result was due to insufficient variation in that particular language sample rather than a limitation of binary biphone characters per se. Although the low information yield from binary data is to be expected, our results here appear to support Dockum's conclusion.  Figure 3: Phylogenetic signal significance testing for binary biphone characters. This grid colour-codes each biphone character according to the results of its respective significance tests. The grid is arranged such that the vertical axis represents the first segment of the biphone and the horizontal axis represents the second segment. Besides a tendency for phonotactic restrictions at word boundaries to show phylogenetic signal, few patterns stand out.

Robustness checks
Given a sufficient number of taxa for which data are available (>50), D scores should reflect a degree of phylogenetic signal present in the data, independently of tree size (the number of taxa) and shape (branching patterns). To check this, in Figure 4(A) for each character we plot its D score against the number of doculects for which it had non-missing values. Irrespective of the number of doculects that supply non-missing values, the D scores appear to cluster centrally around mean D for the dataset, suggesting that D is not being unduly affected by missing values for particular characters. A second check leads to rather different results, however. We check whether skewed distributions of character values affects D scores (Figure 4(B)). Here, we consider the distribution of 1s and 0s for each character and plot D against how skewed the distribution of character values is towards a particular value. For example, a character where there are 107 '1' values in the dataset and only 4 '0' values will have a skewing rating of 0.96 (the count of '1' values, 107, divided by 111 total observations). Here, we find that when the ratio of 1s and 0s for a character is highly unequal, estimates of D tend towards extreme magnitudes while also being unrevealing, that is, statistically distinguishable neither from 0 nor 1. 7 As described above, Australian languages are known for homogeneity in phonological inventories and phonotactic restrictions, so consequently there are many characters with skewed distributions affecting the results. In Figures 5-6, we plot a subset of the D test results, restricted to characters with less skewing.

Phylogenetic signal in continuous phonotactic data
We test whether a higher degree of phylogenetic signal is detectable in continuous-valued biphone characters. As in the previous test, we take every possible sequence of two segments, or biphones, in our sample of 111 Pama-Nyungan wordlists. In this case, however, rather than simply coding for the presence of a biphone in a language's lexicon, we consider the relative frequencies of transitions between the segments in that biphone across the language's lexicon. For each biphone character, we take two values: The Markov chain forward transition probability-that is, for a biphone xy, the probability of x being followed by y, normalized over all instances of x. This captures, if only in a basic way, our awareness that words do not consist of strings of independent segments, but rather the probability of observing some segment is very much dependent on what came before it. Secondly, we take Markov chain backward transition probabilities-that is, for the biphone xy, the probability of y being preceeded by x, normalized over all instances of y in the lexicon. The frequency characters we extract come from wordlists. This is advantageous in that they are somewhat independent of word frequency effects since each word is counted only once, in contrast to frequencies extracted from language corpora. On the other hand, speakers show sensitivity to phoneme frequencies in language use (for example, when coining novel words) (Coleman & Pierrehumbert 1997;Zuraw 2000;Ernestus & Baayen 2003;Albright & Hayes 2003;Eddington 2004;Hayes & Londe 2006;Gordon 2016) so word frequency will likely have some effect on phoneme and biphone frequency even in a wordlist. Investigation of phylogenetic signal in frequency characters extracted from corpora versus wordlists may be a possibility for future study.
We quantify phylogenetic signal by estimating K (Blomberg, Garland & Ives 2003) individually for each character, using the multiPhylosignal function, in the picante package (Kembel et al. 2010), in R statistical software. The K test has somewhat greater statistical power than the D test, enabling us to apply the test to characters with as few as 20 non-missing values. Calculation of K works with non-zero values only, so zero values (where the language contains both segments x and y but x is never followed by y, or vice versa) are considered not applicable and removed from calculation. A total of 490 characters (245 biphone forward transition probabilities and 245 backward transition probabilities) meet the criterion of at least 20 languages with non-missing and non-zero values for testing. Subsequently, to evaluate whether the level of phylogenetic signal is significant for a given character, we conduct Blomberg, Garland and Ives' (2003) randomization procedure with 10,000 random permutations per character.
To summarize, this testing procedure proceeds as follows. For each biphone frequency character, if the character has at least 20 non-NA values and the character has at least two unique frequency values (i.e. not every character value is the same), then test as follows (otherwise discard): • Calculate K • Conduct randomisation procedure to calculate p for H 0 : K = 0 Mean K for all 490 characters is 0.54 (SD 0.21) (Figure 7). This is comparable to certain physiological traits We consider whether phylogenetic signal is higher or lower in certain kinds of biphone characters. Figure 8 shows a matrix of K scores for forward transition characters, with rows and columns arranged by phonological natural class. No clear pattern stands out. This heat grid shows K scores for biphone characters (forward transition frequencies only). Each square represents a biphone (where the first segment is listed on the vertical axis and second segment on the horizontal axis). Data points are taken from the frequencies of each biphone, xy, over the total frequency of segment x in each language, and then phylogenetic signal K is measured for each biphone. Darker red shades indicate a stronger degree of phylogenetic signa.
As with the D test, no clear pattern of high versus low K scores stands out, although there is a high degree of phylogenetic signal in the dataset overall.

Robustness checks
Although K is intended to be a measure of phylogenetic signal that is independent of tree size and shape, tree size and shape can have some effect on results in practice (Münkemüller et al. 2012). We wish to check that the Pama-Nyungan tree does not contain any unusual properties that could cause either the K statistic or the randomization procedure to perform unexpectedly. To do this, we allow simulated characters to evolve specifically along the Pama-Nyungan reference tree. We vary the model of evolution, between perfect Brownian motion along the entire tree and pure randomness generated directly at the tips of the tree, by mixing different strengths of Brownian phylogenetic signal and non-phylogenetic noise. 1000 traits are simulated at each percentage point interval for 100,000 total simulated traits (in other words, 1000 traits simulated with 100% Brownian motion, then 1000 traits simulated with 99% Brownian motion and 1% randomness, and so on, until the traits evolve 100% at random). Each simulated trait is then tested for statistical significance using the randomisation procedure described in Section 2, with 1000 repetitions to determine a p value. In a robust testing scenario, K will scale appropriately between 0 and 1 according to the level of Brownian motion and random noise being simulated, and the randomisation procedure will distinguish between traits with and without a significant degree of phylogenetic signal with a satisfactory amount of Type I (false positive) and Type II (false negative) errors.
The results are plotted in Figure 9. The K statistic shows a considerable degree of variability but, in the absence of substantial random noise, centres slightly below K = 1 which suggests the statistic is behaving as expected (if not slightly conservatively) when phylogenetic signal is present. For characters whose simulated evolution is near-random, the baseline of K seems to be elevated a little by our particular reference tree, with non-phylogenetic simulated characters ranging from the expected K = 0 to around K = 0.3. This should be kept in mind when interpreting K scores across our results. As for the randomisation procedure, Figure 9(B) shows the percentage of simulated traits that were identified as having a significant degree of phylogenetic signal (p < 0.05) at a given level of Brownian motion mixed with random noise. Above around 65% Brownian motion, there are no Type II errors. The ability to detect significant phylogenetic signal drops as the level of random noise increases beyond 35%, though overall the test's sensitivity seems acceptable. At the opposite extreme, where characters are simulated completely at random (and, therefore, there is no phylogenetic signal to detect) the randomisation procedure falsely detects phylogenetic signal 5.2% of the time, very close to the expected false discovery rate of 5% (given the conventional threshold for statistical significance of α = 0.05).
On the basis of these simulations, we are satisfied that randomisation procedure is sufficiently robust, given the particular size and shape of the Pama-Nyungan reference phylogeny.  Figure 9: Behaviour of the K statistic and randomization procedure with the Pama-Nyungan reference phylogeny. Artificial characters are simulated evolving along the phylogeny with varying levels of non-Brownian noise. Where a pure Brownian motion process operates, K averages around 1, as expected. Where there is no Brownian process at all (and therefore no phylogenetic signal) K is elevated to around 0.2-likely an artefact of this particular tree size and shape.
As a final check, we consider whether the K statistic might be affected by the quantity of missing or 'not applicable' values for a given character. We inspect this visually by plotting, for all biphone characters, the relation between a biphone's K score and the number of language varieties with non-missing data points on which K was calculated ( Figure 10). When K is calculated on fewer than around 40 non-missing values, the statistic shows a wider degree of variability. In addition, phylogenetic signal is deemed statistically significant for fewer characters in this range, suggesting that the quantity of missing values is affecting the statistical power of the test. However, all K scores cluster centrally around the mean regardless of the number of languages with non-missing values, suggesting that the mean K we observe for the dataset overall is not significantly affected by missing data.

Forward transitions versus backward transitions
We find no significant difference in the means of K for forward transition characters (mean K = 0.54) and backward transition characters (mean K = 0.53) (t = 0.44, df = 487.98, p = 0.661, 95% CI [-0.03, 0.05]). The distributions of K scores for forward and backward transitions are plotted in Figure 11, showing a high degree of overlap between the two.  We find no significant difference between these character types.

Normalisation of character values
Visual inspection of the density plots for each character shows there is a tendency for character data to be negatively skewed (the weight of the distribution is left-of-centre), although this is not universally the case.
To test whether the particular, heavy-tailed nature of the data has an effect on tests for phylogenetic signal, we apply Tukey's Ladder of Powers transformation to each character in the dataset and re-run both the K test and randomization procedure. This is a power transformation, which makes the data fit a normal distribution as closely as possible. It does this by finding the power transformation value, λ, that maximizes the W statistic of the Shapiro-Wilk test for normality for each character individually. For our purposes, this transformation is effectively a change in the evolutionary model: A Brownian motion process is still assumed-a character value may wander up or down with equal probability-but, in this model, character values shift up or down along a transformed scale.
Mean K for normalized character data is 0.55 (SD 0.2). Of 490 characters, 408 or 83% (208 forward transitions, 200 backward transitions) contain phylogenetic signal significantly above the random expectation.

Phylogenetic signal in natural-class-based characters
One limitation of analysing phylogenetic signal in biphone characters is the assumption that every biphone character is a statistically independent observation. In historical linguistic processes, however, phonological segments rarely behave independently. Rather, sound changes are applied to whole sound classes, thereby affecting any one of various cross-cutting sets of phonological segments (and, therefore, biphone characters we have used in this study).
To account for this non-independence and more faithfully model what we know about how phonotactic systems operate in a language, we extract forward and backward transition probabilities for sequences of phonological features. For the purposes of this experiment, word boundaries are counted as a class and vowels are reduced to a single 'vowel' class. Three sets of characters are extracted: foward and backward transition probabilities between natural classes based on place of articulation (segments belonging to the following classes: word boundary, labial, dental, alveolar, retroflex, palatal, velar, glottal, vowel); forward and backward transition probabilities between natural classes based on major places of articulation, where coronal contrasts have been collapsed (word boundary, labial, apical, laminal, velar, vowel); and natural classes based on manner of articulation (word boundary, obstruent, nasal, vibrant, lateral, glide, rhotic glide, vowel). The choice of natural classes is based on well-established principles of organisation among segments in Australian languages (Dixon 1980;Hamilton 1996;Baker 2014;Round 2021a;Round 2021b). Table 1 presents mean K and the proportion of significant characters for each of these three natural-class-based datasets. All show highly similar distributions ( Figure 13). There is no statistically significant difference in the means of K for the three feature types, according to a one-way ANOVA (F (, ) = 0.46, p = 0.6288925). An Anderson-Darling k-sample test, which tests the hypothesis that k independent samples come from a common, unspecified distribution (i.e., no prior assumption about normality) also finds no significant difference in the distributions of K scores for the three natural-class-based datasets (AD = 2.15, T.AD = 0.14, p = 0.34).

natural-class-based characters versus biphones
We compare the degree of phylogenetic signal in the biphone data tested in Section 5 to the natural-classbased data tested here (see Figure 14). Mean K for all natural-class-based characters is 0.61, which is significantly higher than the mean K for biphone data, 0.54 (t = 4.37, df = 618.89, p = 0, 95% CI [0.04, 0.1]). The Kolmogorov-Smirnov test, which is more sensitive to the overall shape of the distribution than a t-test comparison of means, also finds a significant distinction between K for biphone characters and K for natural-class-based characters (D = 0.2, p < 0.001).  Figure 14: Distributions of K scores for biphone characters, coding the relative frequencies of transitions between phonological segments, and natural-class-based characters, coding the relative frequencies of transitions between natural classes of segments. Phylogenetic signal is higher overall in the natural-class-based dataset than the biphone-based dataset

Discussion
Computational phylogenetic methods are increasingly commonplace in historical linguistics. However, there has been relatively less consideration of the range of data types that might profitably be used with computational phylogenetic methods, beyond traditional, manually-assembled sets of lexical cognate data. In this study, we have considered the potential utility of quantitative phonotactic data for historical linguistics, for the reasons that quantitative phonotactic data is (i) readily extractible from basic wordlists, and (ii) may show certain kinds of historical conservatism, where the historical signal in more traditional lexical data would be affected by borrowing and lexical innovation.
We extracted frequencies of transitions between phonological segments in scrubbed and comparably segmented wordlists representing 111 Pama-Nyungan language varieties. As points of comparison, we extracted two additional datasets: Firstly, a binarized version of the dataset, which simply records whether or not particular two-segment sequences, a.k.a. biphones, are present in a language's wordlist. This is to emulate, in a simple sense, the kind of information which is often recorded in the phonology section of published descriptive grammars. We also extracted frequencies for transitions between natural classes of sounds. This is to account (at least, to some partial degree) for the fact that phonological segments tend not to evolve independently but pattern into natural classes, thereby limiting the independence of biphone-based variables.
To test whether historical information is preserved in our phonotactic datasets, we tested for phylogenetic signal, that is, the degree to which variance in the data reflects the evolutionary history of the 111 language varieties. We took an independent phylogenetic tree, inferred by the second author using lexical cognate data, and assumed a simple Brownian motion model of evolution. Our first key finding is that a significant degree of phylogenetic signal is detected in all three datasets-binary, segment-based and sound class-based. Finding phylogenetic signal in the binary dataset is somewhat surprising, given previous descriptions of homogeneity in the phonotactics of Australian languages. Our second key finding is that phylogenetic signal is significantly stronger in the higher-definition, frequency-based datasets than it is in binary dataset. In turn, phylogenetic signal in the sound class-based dataset is significantly stronger than the segment-based dataset. We took a closer look at certain comparisons within datasets, namely, whether there is a difference between forward and backward transitions, different kinds of sound classes, or between original (heavy-tailed) data and transformed data (to more closely fit a normal distribution). In all three cases, no significant differences were found.

Overall robustness
One important assumption in this study is that the tree being used as a reference phylogeny is an accurate depiction of the phylogeny for the languages in question. In the absence of time travel, it is impossible to observe directly the true Pama-Nyungan phylogeny and thus surely satisfy this assumption. Instead, we must rely on best available data and methods to infer the phylogeny. As described in Section 3.3, the Pama-Nyungan phylogeny we use in this study was inferred by the second author using Bayesian computational phylogenetic methods, which produce a posterior sample of possible trees. The specific reference tree we use in this study is a best possible summation of this posterior sample as determined by the maximum clade credibility method. Naturally, there is a degree of uncertainty associated with the topology of the maximum clade credibility tree. It may be the case that a better representation of the true Pama-Nyungan phylogeny exists among the many slight permutations contained within the posteior tree sample.
To evaluate the robustness of our results against phylogenetic uncertainty, we repeat the K test for phylogenetic signal on a subset of sound-class-based characters and each of a subset of 100 trees selected from the posterior sample. The subset of characters includes forward and backward transitions between place features, plus forward and backward transitions for manner features. The trees selected are the 100 best trees in the posterior sample, based on the maximum clade credibility metric. In this way, we capture a degree of uncertainty in the topology and branch lengths of the Pama-Nyungan phylogeny, while restricting our attention to a subset of the most credible alternatives. 214 sound class characters in total are tested, giving 21,400 K statistics in total (each character multiplied by 100 trees). The mean of these K scores is 0.59, which compares to a mean K of 0.6 for the same characters applied only to the maximum clade credibility tree as in Section 6. This difference is not statistically significant (t = -0.95, df = 217.17, p = 0.342, 95% CI [-0.05, 0.02]). The distributions of these K scores are illustrated in Figure 15.
In Section 3.2 we mentioned that our wordlists vary significantly is size. The length of a wordlist can correlate with many other linguistic properties that it has. For instance, in our data, the number of entries in a wordlist and the mean phonemic length of those entries have a Pearson's correlation coefficient of r = 0.71 (p < 0.001).
That is, longer lists tend to contain longer words (presumably since shorter lists are weighted towards more basic, shorter vocabulary items). The existence of correlations like this means that it is not possible simply to 'counterbalance' the length of wordlists by sub-sampling or resampling their items so that the resampled lists all have the same length. For example, the resampled lists that derive from longer underlying lists would still contain longer words. Nevertheless, it would still be desirable to know whether our results in Sections  4-6 are unduly influenced by the disparity in our wordlist lengths, by manipulating it in a controlled fashion.
To do this, we extracted two different language sub-samples from our dataset. For the first, we ranked all wordlists by size and selected every second language, producing a sample half the original size but with the same disparity in lengths. For the second, we ranked all wordlists by size and selected the middle 50% of the ranking, again producing a sample of half the size but this time with heavily reduced disparity, as shown in Figure 16. Wordlists in the middle 50% range in length from 528 to 1361 (mean 872).
For these samples, we ran K tests on place and manner natural class characters. Our reasoning is that if wordlist disparity strongly affects the estimation of phylogenetic signal, then we should see a clear difference in the results. Mean K for the middle 50% of wordlists is 0.77, which is somewhat greater than the mean for every second wordlist 0.72. This difference is small but statistically significant (t = 2.44, df = 423.93, p = 0.015, 95% CI [0.01, 0.1]). This suggests that disparities in wordlist length are somewhat degrading the phylogenetic signal in our data, although there remains broad similarity between them, as pictured in Figure  17.
One attributing factor for this small degradation in phylogenetic signal may be measurement error among the bottom quartile of small wordlists. Throughout this study, we have assumed that all character values are accurate and do not account for measurement error. Accounting for measurement error when testing for phylogenetic signal is an area of active development in comparative biology (Zheng et al. 2009). In future studies, this degradation in phylogenetic signal could be investigated by relating variation in K statistics carefully to various linguistic properties that correlate with the length of wordlists.

Limitations
Any investigation of phylogenetic signal in essence is an investigation of cross-linguistic (dis)similarity. Accordingly, whenever our representations of linguistic facts are altered, then the phylogenetic signal detectable in them will almost certainly change to some degree. In this paper our focus has been trained on the initial, fundamental question of whether phylogenetic signal is detectable in maximally simple phonotactic characters. However, as work like ours increases, one priority will be to investigate how investigators' choices about how data is represented affects results.
Relevant for the current study, in Section 3.2 we described a process of normalisation. The motivation for this was to attempt to minimize certain aspects of variation in phonological representation that can arise from variation in how different linguists analyses the same essential facts (Chao 1934;Hockett 1963;Hyman 2008;Dresher 2009). While normalisation per se ought to improve the quality of cross-linguistic comparisons that the data enables, there is still the question of which targets one ought to normalize the data towards, on what effect that choice can have. For example, a reviewer asks whether our choice to split up complex segments might amplify phylogenetic signal if it leads to certain phylogenetically distributed complex segments counting instead as biphones. This can be answered in three ways. First, in the general case, since splitting segments changes representations, it will alter aspects of (dis)similarity in the data, and so is very likely to affect phylogenetic signal in some manner. Second, in this particular case, the reviewer is likely to be correct, due to details of our method. Any cross-linguistically rare, complex segment would likely get excluded from our dataset. This is because, although it would figure in certain biphones, we have made use only of biphones that reach a minimal level of recurrence across our language sample, and thus the biphones containing the rare segment quite likely would not qualify. However, if we split this complex segment into two segments, thus into a biphone, the resulting biphone may well have sufficient cross-linguistic recurrence to qualify for inclusion in our dataset, and subsequently may contribute to raising phylogenetic signal. A final observation on this point is that such questions, about how choices in data representation interact with the results from corresponding quantitative analysis, are made tractable by our method of data preparation. Unlike many state-of-the-art cross-linguistic datasets, in which values for each language are hand-coded and thus incapable of being 'recalculated' under altered assumptions, our phonotactic characters are generated algorithmically from an underlying, very rich dataset. With a change to algorithmic parameters, we can systematically split segments or glue them together, neutralize them or keep them distinct, and document what we have done and how. As mentioned, in this paper our focus is on the simple existence of phylogenetic signal. Our methods, though, naturally extend to enable comprehensive checking of such interactions between data choices and results. Ultimately, as a discipline, we would like this to be true for all typological research, not just phylogenetics (Round 2017a). An advantage of our general approach, is that it open the doors to this rigorous mode of inquiry.
A further limitation of this study relates to the assumption that the data being tested for phylogenetic signal are independent of the data that was used to infer the reference phylogeny. In this study, the wordlists from which we extracted phonotactic characters contain, as a small subset, the basic vocabulary items from which lexical cognate characters were inferred and subsequently used to build the reference phylogeny. It is unclear exactly to what degree this inclusion of basic vocabulary compromizes the independence of our reference tree and phonotactic data. A reviewer points out that cognate data and phonotactic data are still somewhat independent, even when extracted from identical wordlists, since phonotactic attributes are not directly called on to make cognacy judgements. Nevertheless, sound change affects both phonotactics and cognate identification, so some degree of non-indepenence is to be expected. To ascertain whether this effect is significant, future studies could parameterize the inclusion/exclusion of basic vocabulary from the phonotactic data.
One reviewer raises the correlation between phylogeny and geography. A noted limitation of phylogenetic comparative methods is the inability to account for geography as a possible confound (Sookias, Passmore & Atkinson 2018) and this limitation applies to this study. Although we leave it as a priority for future work, the task of disentangling phylogeny and geography is not intractable. For example, Freckleton & Jetz (2009) present a method for quantifying the relative degree of spatial versus phylogenetic effects in comparative characters.
One final point to note is that recent research suggests that phylogenetic signal can be inflated when character values evolve according to a Lévy process, where a character value can wander as per a Brownian motion process, but with the addition of discontinuous paths (i.e., sudden jumps in the character's value) (Uyeda et al. 2018). This is a realistic concern in the linguistic context, where segment frequencies are subject to sudden shifts caused by phonological mergers and splits. The possibility of Lévy-like evolutionary processes is a matter of concern also in comparative biology and methods to investigate it are subject to active development in that field (Uyeda et al. 2018).

Conclusion
Historical and synchronic comparative linguistics are increasingly making use of phylogenetic methods for the same reasons that led biologist to switch to them several decades ago. Our central contention has been that phylogenetic methods not only give us new ways of studying existing comparative data sets, but open up the possibility to derive insights from new kinds of data. Here we demonstrate the potential for phylogenetically investigating phonotactic data, by showing that it indeed contains the kind of phylogenetic signal which is the prerequisite for a whole spectrum of phylogenetic analyses.
We find significant phylogenetic signal for several hundred phonotactic characters extracted semi-automatically from 111 Pama-Nyungan wordlists, demonstrating that historical information is detectable in phonotactic data, even at the relatively simple level of biphones and despite ostensibly high phonological uniformity. Contrary to the prevailing view in literature on Australian languages, and contrary to the findings of an earlier pilot study on a much smaller language sample, we find that binary characters marking the presence or absence of biphones in a doculect contain enough phylogenetically-patterned variation to detect phylogenetic signal. However, we find that statistical power is relatively low when operating with coarse-grained binary data and quantification of the degree of phylogenetic signal is affected by a large number of low-variation characters, where all but one or a few doculects share the same value. We find stronger phylogenetic signal in biphone characters of forward and backward transition frequencies. This reaffirms the results of earlier work, for the first time on a sample of languages spanning an entire large family and the vast majority of a continent. It also reaffirms earlier findings that Australian phonologies show a greater level of variation than traditionally has been appreciated, once matters of frequency are taken into account. We find a significantly greater level of phylogenetic signal again in characters based on the frequencies of forward and backward transitions between natural sound classes. The sound-class-based approach reduces the quantity of characters available to test, but limits sparsity in the dataset and accounts somewhat better for the role of sound-classes in the evolutionary processes that affect phonotactic patterns in human language. Interestingly, although there exists considerable variation in the level of phylogenetic signal found in individual characters, we find no observable pattern to this variation in segment-based biphone characters nor between the mean levels of phylogenetic signal observed for different kinds of sound classes (e.g., characters concerning place versus manner of articulation).
This work has implications for comparative linguistics, both typological and historical. Firstly, we recommend the use of phylogenetic comparative methods in typological work where the phylogenetic independence of a language sample (or lack thereof) is paramount. In the immediate term, this should be the case for any typological work concerning phonotactics, even in parts of the world such as Australia where phonotactics traditionally have been assumed to be relatively independent of phylogeny. Beyond phonotactics, however, explicit measurements of phylogenetic signal can be made for any set of cross-linguistic data and this can be built into statistical analysis, even in the presence of gaps and uncertainty in phylogenetic knowledge.
In two decades of quantitative development in historical linguistics, there has still been relatively limited consideration of the kinds of characters used for inferring linguistic histories. The phonotactic characters presented here can be extracted relatively simply and in large quantities from wordlists, even where a full descriptive grammar is not available. Here, we test only the degree to which patterns of variation in our data match our independent, pre-existing knowledge of the phylogenetic history of the Pama-Nyungan family, however, the results suggest that phonotactic data of this kind could be used where the phylogeny is less certain, either by incorporating phonotactic data into phylogenetic inference directly or by constraining parts of the tree where lexical data on its own returns some doubt.

S1. Pama-Nyungan reference phylogeny
Quantifying phylogenetic signal requires an independently-derived reference phylogeny as a yardstick. Our reference phylogeny is a 285-tip Pama-Nyungan phylogeny inferred by the second author. Figure S1 gives a 111-tip subset of this phylogeny, corresponding to the 111 doculects used in this study.
As discussed in the main paper, the reference phylogeny was constructed using Bayesian phylogenetic methods in the software BEAST2 (Bouckaert et al. 2014). Bayesian phylogenetic methods use a Markov Chain Monte Carlo (MCMC) procedure to efficiently search the hypothesis space of possible trees and return a large posterior sample of similarly credible alternatives (capturing phylogenetic uncertainty). The tree in Figure S1 above is a maximum clade credibility tree, which is a summation of the posterior sample where the likelihood of all nodes in the tree (in terms of how frequently a given node reappears across the posterior sample) is maximized.
For further details on the Pama-Nyungan phylogeny used as a reference phylogeny in this study, see . See also Bowern & Atkinson (2012), which infers a Pama-Nyungan phylogeny in the same way, using exactly the same evolutionary model parameters, but with an earlier iteration of the dataset containing fewer doculects. Additional discussion of the general process of constructing language phylogenies in BEAST2 can be found in , although a different evolutionary model is used.
The reference phylogeny was inferred using lexical cognate data, coded according to the principles of the Comparative Method. The cognate data used in the reference phylogeny is publicly available on Zenodo (Bowern 2018) and also as a subset of the 305-language dataset in Bouckaert, . This latter source also includes a Perl script for converting multistate cognate judgements into a binary matrix for use with BEAST2 phylogenetic software and will include information on underlying sources.  Figure S1. Pama-Nyungan reference phylogeny. Node labels are posterior probabilities, giving an indication of support for each node. Although there is no strict, conventional cut-off, clades with posterior values above 0.5 are considered supported and values above 0.8 are considered strongly supported (Bowern & Atkinson 2012, p.829).

S2. Wordlist sources
The 111 wordlists used in this study are contained within the Ausphonlex database, under development by Round (2017). All underlying wordlist data is available, either publicly in the CHIRILA database  or elsewhere in published or archived form. A list of original sources for all wordlists is presented below.

S3. Guide to code and data
The code and data used in this study are publicly accessible on Zenodo at http://doi.org/10.5281/zenodo. 3936353. Unzipping the file reveals a directory containing four subdirectories, /trees, /data, /R, and /results.
The /trees subdirectory contains two files: PNY10_285.(time).sum.tree, which is a Nexus format tree file for the Pama-Nyungan maximum clade credibility tree used as a reference tree throughout the study. PNY10_285.trees contains the full posterior sample of trees and was used to check the robustness of our results against phylogenetic uncertainty.
The /data subdirectory contains 9 comma-separated (csv) spreadsheets containing all the frequency data used in the study. In all cases, the first column lists language variety names and the first row lists characters (variables) for analysis. The biphone_binary spreadsheet contains binary permissibility data for the D test for phylogenetic signal. A '1' value indicates that the biphone occurs at least once in that language variety's wordlist. A '0' indicates that the biphone never appears in the language variety's wordlist. A missing value (represented by 'NA') is entered where one or both of the phonological segments in the biphone is not part of the language variety's phonological inventory and therefore would be impossible to observe. The biphone_fwd and biphone_bkwd spreadsheets give the forward and backward transition probabilities for each language variety. Once again, missing values occur where the language lacks entirely one of the segments in a particular biphone. Otherwise, as discussed in the main paper body, the frequencies given are the frequencies of the sequency xy, relativised over all instances of x (forward transition) or the frequencies of the sequency xy relativised over all instances of y (backward transition). The remaining spreadsheets give frequencies of transitions between natural sound classes. Natural classes are split into three categories, manner, place and major place. The format of the spreadsheet filenames is {class type}_{transition direction}_{file creation date}.csv. So, for example, the spreadsheet beginning with place_fwd gives the forward transition frequencies for transitions between places of articulation.
The /R subdirectory contains code used to perform the analysis for the study and create figures for the main text of the paper. The analysis.R script is written to run in R statistical software (R Core Team 2017). To run the analysis, the first step is to set the R working directory to the /R subdirectory. It is important to keep the file structure of the S2 directory intact, since the script requires access to the data, trees and results subdirectories. The first lines of the script load all its required packages. If any packages are missing from the machine, these will need to be installed. All packages used are standard packages available on the CRAN network (https://cran.r-project.org), and installation is straightforward using the install.packages("package-name") command in the R console. The script can be run from the R console using the command source("analysis.R"). It has been run succcessfully (approximately 45 minutes runtime) on a 2015 Macbook Pro with 8GB memory, with the following R session info: R version 3.6.2 (2019-12-12) Platform: x86_64-apple-darwin15. Finally, the create_figs.R script contains code used to produce figures for the main text body. The script saves each figure as a PDF file in a /fig subdirectory (you will need to create this directory first, if it is not present already). Figures are produced with the ggplot2 package, using the system of visualisation described by Wilkinson & Wills (2005).
The /results subdirectory contains original csv spreadsheets of results, generated as output from analysis.R and used in the study. Results from the D test for phylogenetic signal in binary phonotactic data are contained in the spreadsheet D_test_results_2020-06-06.csv. The biphone column lists the biphone character tested. Note that the underscore in each biphone label is purely to aid readability and carries no linguistic meaning (it is not intended to look like the environment of a generative phonological rule). The languages column gives the number of languages with non-missing values for which phylogenetic signal was tested for that particular character. count_0 and count_1 columns give the number of observed 0 values and 1 values respectively. D gives the observed D statistic for each character and the pval_0 and pval_1 columns give the uncorrected p values for the two null hypotheses that D = 0 and D = 1 respectively. The columns pval0_sig and pval1_sig give text descriptions interpreting the significance of the p values. If a p value is small, the character will be either significantly more clumped or significantly more dispersed than the null hypothesis. Otherwise, the character will be consistent with the null hypothesis, either the phylogenetic null hypothesis (D = 0) or randomness null hypothesis (D = 1). Finally, the result column gives an overall interpretation of the result of the D test and two accompanying p values. There are six possible categories a character may fall into: • (i) More clumped than the phylogenetic null hypothesis, listed as more clumped.
• (ii) consistent with the phylogenetic null hypothesis and more clumped than the random null hypothesis (i.e. there is significant phylogenetic signal), listed as phylogenetic.
• (iii) consistent with both null hypotheses, so a result cannot be determined either way, listed as indeterminate (neither H0 rejected).
• (v) consistent with the randomness null hypothesis and more dispersed than the phylogenetic null hypothesis, listed as random.
The sixth possible category a character can fall into, which is written into the analysis.R script, is: more dispersed than the randomness null hypothesis, listed as more dispersed. However, in this study, we find no characters that fall into this category and thus it does not appear in the spreadsheet of results.
Results files for the K tests are given in spreadsheets named according to the format K_{character type}_{transition direction}_results_{date of analysis}_.csv. There are four spreadsheets of results for biphone characters: Original forward and backward transition frequencies, and normalised forward and backward transition frequencies. For natural class-based characters, there are two spreadsheets corresponding to forward and backward transition frequencies for each of three class types: manner, place and major place.
The \results subdirectory also contains spreadsheets of results produced by wordlist_size_uncertainty.R. The results files obtained by taking every 2nd wordlist (when wordlists are ranked by size) are prefixed with Every2nd. The results files obtained by taking the middle 50% of wordlists are prefixed with IQR ('interquartile range'). The results of tree_uncertainty.R are saved as an .Rdata file which can be opened in R software. The file contains four list objects, each of length 100. The four lists contain results of K tests applied to (i) forward transitions between manner classes, (ii) backward transitions between manner classes, (iii) forward transitions between place classes and (iv) backward transitions between place classes. Each of the 100 elements in each list corresponds to a tree in the Pama-Nyungan posterior tree sample. Each element contains a data frame giving the results of K tests conducted on the applicable character set using a given tree in the posterior tree sample.