Volume 18, Issue 1
  • ISSN 2211-6834
  • E-ISSN: 2211-6842
Buy:$35.00 + Taxes


This article describes efforts to collect, process, and automatically annotate a corpus of Spanish as spoken in Texas. It elaborates the protocols for the development of the corpus and the procedures for automatic annotation, illustrating the common pitfalls to language identification in bilingual corpora and potential methods for circumventing them. The benefits of a comparative corpus approach to contact varieties is illustrated by a case study of a putative verbal calque from the Spanish in Texas data. It is demonstrated that the relative frequency of the verb is much higher than in its source Mexican variety and that the verb selects different complements in Texas than it does in other varieties. The article concludes with a discussion of how computational tools might be fruitfully exploited to resolve long-standing debates about language variation in contact settings.


Article metrics loading...

Loading full text...

Full text loading...


  1. Adamou, Evangelia
    2016A corpus-driven approach to language contact: Endangered languages in a comparative perspective. Walter de Gruyter GmBH & Co KG.10.1515/9781614516576
    https://doi.org/10.1515/9781614516576 [Google Scholar]
  2. Balam, Osmer, Ana de Prada Pérez & Damaris Mayans
    2014 A congruence approach to the study of bilingual compound verbs in Northern Belize contact Spanish. Spanish in Context11. 243–265.10.1075/sic.11.2.05bal
    https://doi.org/10.1075/sic.11.2.05bal [Google Scholar]
  3. Bullock, Barbara E. & A. Jacqueline Toribio
    2013 The Spanish in Texas Corpus project. Center for Open Education Resources and Language Learning (COERLL), the University of Texas at Austin. www.spanishintexas.org.
    [Google Scholar]
  4. Bybee, Joan L.
    2007Frequency of use and the organization of language. New York & Oxford: Oxford University Press.10.1093/acprof:oso/9780195301571.001.0001
    https://doi.org/10.1093/acprof:oso/9780195301571.001.0001 [Google Scholar]
  5. Çentinoğlu, Özlem, Sarah Schulz, and Ngoc Thang Vu
    . “Challenges of computational processing of codeswitching.” arXiv preprint arXiv:1610.02213 (2016).
    [Google Scholar]
  6. Coetsem, Frans van
    1990 Review of Thomason and Kaufman (1988), Lehiste (1988), and Wardhaugh (1987), Language in Society19. 260–268.
    [Google Scholar]
  7. Deuchar, Margaret & Jonathan R. Stammers
    2012 What IS the “Nonce Borrowing Hypothesis” anyway?Bilingualism: Language and Cognition15. 649–650.10.1017/S1366728911000563
    https://doi.org/10.1017/S1366728911000563 [Google Scholar]
  8. Davies, Mark
    2002 Corpus del Español: 100 million words, 1200s-1900s. www.corpusdelespanol.org. (12March 2014.)
  9. Diab, Mona & Ankit Kamboj
    2011 Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: a pilot annotation. 9th Workshop on Asian Language Resources, 36–40. Chiang Mai, Thailand.
    [Google Scholar]
  10. Donnelly, Kevin & Margaret Deuchar
    2011 Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. InConstraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia, 17–25.
    [Google Scholar]
  11. Elfardy, Heba, Mohamed Al-Badrashiny & Mona Diab
    2013 Code switch point detection in Arabic. InElisabeth Métais, Farid Meziane, Mohamad Sararee, Vijayan Sugumaran & Sunil Vadera (eds.) Natural Language Processing and Information Systems: Proceedings of the 18th International Conference on Applications of Natural Language to Information Systems (NLDB2013), Salford, UK, 412–416. Heidelberg: Springer.10.1007/978‑3‑642‑38824‑8_51
    https://doi.org/10.1007/978-3-642-38824-8_51 [Google Scholar]
  12. González-Vilbazo, Kay & Luis López
    2011 Some properties of light verbs in code-switching. Lingua121. 832–850.10.1016/j.lingua.2010.11.011
    https://doi.org/10.1016/j.lingua.2010.11.011 [Google Scholar]
  13. Guzmán, Gualberto, Joseph Ricard, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio
    2017 Moving code-switching research towards more empirically grounded methods. CDH 2017 Corpora in the Digital Humanities, CEUR Workshop Proceedings, 1–9.
    [Google Scholar]
  14. 2017 Metrics for modeling code-switching across corpora. Proceedings of Interspeech 2017, 67–71.
    [Google Scholar]
  15. Guzmán, Gualberto, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio
    2016 Simple tools for exploring variation in code-switching for linguists. Proceedings of EMNLP (Empirical Methods in Natural Language Processing 2016), Second Workshop on Computational Approaches to Code-switching, 12–20. Association for Computational Linguistics.
    [Google Scholar]
  16. Jarvis, Scott & Scott Crossley
    2012Approaching language transfer through text classification: Explorations in the detection-based approach. Bristol, UK: Multilingual matters.
    [Google Scholar]
  17. Jarvis, Scott & Aneta Pavlenko
    2008Crosslinguistic influence in language and cognition. New York & London: Routledge.
    [Google Scholar]
  18. Jenkins, Devin
    2003 Bilingual verb constructions in southwestern Spanish. Bilingual Review27. 195–204.
    [Google Scholar]
  19. King, Ben & Steven Abney
    2013 Labeling the languages of words in mixed-language documents using weakly supervised methods. InProceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1110–1119. Association for Computational Linguistics.
    [Google Scholar]
  20. Koehn, Philipp
    2005 Europarl: A parallel corpus for statistical machine translation. Machine Translation Summit 2005, 79–86.
    [Google Scholar]
  21. Li, Ying, Yue Yu & Pascale Fung
    2012 A Mandarin-English code-switching corpus. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 2515–2519. European Language Resources Association.
    [Google Scholar]
  22. LIPPS Group
    LIPPS Group 2000 The LIDES coding manual: A document for preparing and analyzing language interaction data. International Journal of Bilingualism4. 131–270.
    [Google Scholar]
  23. Lipski, John M.
    1985Linguistic aspects of Spanish-English language switching. Tempe: Arizona State University Center for Latin American Studies.
    [Google Scholar]
  24. 2008Varieties of Spanish in the United States. Washington, DC: Georgetown University Press.
    [Google Scholar]
  25. Mackey, William F.
    1970 Interference, integration and the synchronic fallacy. InJames E. Alatis (ed.) Bilingualism and Language Contact: Anthropological, Linguistic, Psychological, and Sociological Aspects. Monograph Series on Languages and Linguistics (Georgetown University Round Table on Languages and Linguistics), vol.23, 195–227. Washington: Georgetown University School of Languages and Linguistics.
    [Google Scholar]
  26. MacWhinney, Brian
    2007 The TalkBank Project. InJoan C. Beal, Karen P. Corrigan & Hermann L. Moisl (eds.), Creating and Digitizing Language Corpora: Synchronic Databases, vol.1, 163–180. Houndmills, UK: Palgrave-MacMillan.
    [Google Scholar]
  27. Mougeon, Raymond, Terry Nadasdi & Katherine Rehner
    2005 Contact-induced linguistic innovations on the continuum of language use: The case of French in Ontario. Bilingualism: Language and Cognition8. 99–115.10.1017/S1366728905002142
    https://doi.org/10.1017/S1366728905002142 [Google Scholar]
  28. Muysken, Pieter
    2000Bilingual speech: A typology of code-mixing. Cambridge, UK: Cambridge University Press.
    [Google Scholar]
  29. Otheguy, Ricardo
    1995 When contact speakers talk, linguistic theory listens. InEllen Contini-Morava & Barbara S. Goldberg (eds.), Meaning as explanation: Advances in linguistic sign theory (Trends in Linguistics, Studies and Monographs), vol.84, 213–242. Berlin: Mouton de Gruyter.10.1515/9783110907575.213
    https://doi.org/10.1515/9783110907575.213 [Google Scholar]
  30. Otheguy, Ricardo & Nancy Stern
    2011 On so-called Spanglish. International Journal of Bilingualism15. 85–100.10.1177/1367006910379298
    https://doi.org/10.1177/1367006910379298 [Google Scholar]
  31. Otheguy, Ricardo & Ana Celia Zentella
    2012Spanish in New York: Language contact, dialectal leveling, and structural continuity. New York & Oxford: Oxford University Press.10.1093/acprof:oso/9780199737406.001.0001
    https://doi.org/10.1093/acprof:oso/9780199737406.001.0001 [Google Scholar]
  32. Polinsky, Maria & Olga Kagan
    2007 Heritage languages: In the ‘wild’ and in the classroom. Language and Linguistics Compass1. 368–395.10.1111/j.1749‑818X.2007.00022.x
    https://doi.org/10.1111/j.1749-818X.2007.00022.x [Google Scholar]
  33. Poplack, Shana
    1980 Sometimes I’ll start a sentence in Spanish y termino en español: Toward a typology of code-switching. Linguistics18. 581–618.10.1515/ling.1980.18.7‑8.581
    https://doi.org/10.1515/ling.1980.18.7-8.581 [Google Scholar]
  34. 2012 What does the Nonce Borrowing Hypothesis hypothesize?Bilingualism: Language and Cognition15. 644–648.10.1017/S1366728911000496
    https://doi.org/10.1017/S1366728911000496 [Google Scholar]
  35. Putnam, Michael T. & Liliana Sánchez
    2013 What’s so incomplete about incomplete acquisition? A prolegomenon to modeling heritage language grammars. Linguistic Approaches to Bilingualism3. 478–508.10.1075/lab.3.4.04put
    https://doi.org/10.1075/lab.3.4.04put [Google Scholar]
  36. R Development Core Team
    R Development Core Team 2009 R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN3-900051-07-0, URLwww.R-project.org
  37. Roggia, Aaron B.
    2011Unaccusativity and word order in Mexican Spanish: An examination of syntactic interfaces and the split intransitivity hierarchy. Ph.D. dissertation. State College, Pennsylvania: The Pennsylvania State University.
    [Google Scholar]
  38. Schmid, Helmut
    1994 Probabilistic part-of-speech tagging using decision trees. InProceedings of international conference on new methods in language processing, Manchester, UK, 44–49.
    [Google Scholar]
  39. Sebba, Mark
    1998 A congruence approach to the syntax of codeswitching. International Journal of Bilingualism2(1). 1–19.
    [Google Scholar]
  40. Serigos, Jacqueline Larsen
    2013The social stratification of loanwords: A computational and corpus-based approach to Anglicisms in Argentina. Austin, TX: University of Texas at Austin master’s report.
    [Google Scholar]
  41. Silva-Corvalán, Carmen
    1994/2000Language contact and change. Oxford: Clarendon Press.
    [Google Scholar]
  42. Solorio, Thamar & Yang Liu
    2008a Learning to predict code-switching points. The Conference Empirical Methods on Natural Language Processing, EMNLP 2008, 973–981. Honolulu, HI: Association for Computational Linguistics.10.3115/1613715.1613841
    https://doi.org/10.3115/1613715.1613841 [Google Scholar]
  43. 2008b Part-of-speech tagging for English-Spanish code-switched text. The Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, 1051–1060. Honolulu, HI: Association for Computational Linguistics.10.3115/1613715.1613852
    https://doi.org/10.3115/1613715.1613852 [Google Scholar]
  44. Solorio, Thamar, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Gohneim, Abdelati Hawwari, Fahad AlGhamdi, Julia Hirschberg, Alison Chang & Pascale Fung
    2014 Overview for the first shared task on language identification in code-switched data. First Workshop on Computational Approaches to Code Switching. Proceedings of the Workshop. EMNLP 2014, 62–72. Doha, Qatar: Association for Computational Linguistics.10.3115/v1/W14‑3907
    https://doi.org/10.3115/v1/W14-3907 [Google Scholar]
  45. Stammers, Jonathan & Margaret Deuchar
    2012 Testing the Nonce Borrowing Hypothesis: Counter-evidence from English-origin verbs in Welsh. Bilingualism: Language and Cognition15. 630–643.10.1017/S1366728911000381
    https://doi.org/10.1017/S1366728911000381 [Google Scholar]
  46. Thomason, Sarah & Terrence Kaufman
    1988Language contact, creolization, and genetic linguistics. Berkeley, CA: University of California Press.
    [Google Scholar]
  47. Torres Cacoullos, Rena & Catherine E. Travis
    2010 Testing convergence via code-switching: Priming and the structure of variable subject expression. International Journal of Bilingualism14. 1–27.
    [Google Scholar]
  48. Toribio, Almeida Jacqueline & Barbara E. Bullock
    2016 A new look at heritage Spanish and its speakers. InDiego Pascual y Cabo (ed.), Advances in Spanish as a Heritage Language, 27–50. John Benjamins.
    [Google Scholar]
  49. Tortora, Christina, Beatrice Santorini, Frances Blanchette & C. E. A. Diertani
    2017The Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE). csivc.csi.cuny.edu/aapcappe/.
    [Google Scholar]
  50. Villa, Daniel J.
    2005 Back to patrás: A process of grammaticalization in a contact variety of Spanish. InJames Cohen, Kara T. McAlister, Kellie Rolstad & Jeff MacSwan (eds.) Proceedings of the 4th International Symposium on Bilingualism, 2310–2316. Somerville, MA: Cascadilla Press.
    [Google Scholar]
  51. Vossen, Piek
    (ed.) 1998EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer.10.1007/978‑94‑017‑1491‑4
    https://doi.org/10.1007/978-94-017-1491-4 [Google Scholar]
  52. Wang, William S-Y.
    1969 Competing changes as a cause of residue. Language45. 9–25.10.2307/411748
    https://doi.org/10.2307/411748 [Google Scholar]
  53. Wohlgemuth, Jan
    2009A Typology of Verbal Borrowings. New York, Berlin: Mouton de Gruyter10.1515/9783110219340
    https://doi.org/10.1515/9783110219340 [Google Scholar]
  54. Zenner, Eline, Dirk Speelman & Dirk Geeraerts
    2012 Cognitive sociolinguistics meets loanword research: Measuring variation in the success of Anglicisms in Dutch. Cognitive Linguistics23. 749–792.10.1515/cog‑2012‑0023
    https://doi.org/10.1515/cog-2012-0023 [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error