Full text loading...
Abstract
This paper introduces the Hoosiers Arabic Ellipsis Corpus, a novel dataset targeting syntactic ellipsis in Arabic. Addressing the significant challenge ellipsis poses to natural language processing (NLP) technologies, the Hoosiers Arabic Ellipsis Corpus leverages the Corpus Query Language (CQL) to extract ellipsis instances from the ArTenTen corpus. To the best of our knowledge, this is the first comprehensive dataset of its kind, filling a critical gap in resources for Arabic, which remains under-resourced in NLP studies. We evaluate the corpus through three computational experiments: detecting sentences with ellipsis, predicting the location of elided elements, and generating missing words using state-of-the-art large language models (LLMs). Results demonstrate that few-shot prompting significantly improves LLM performance, with Gemini 2.5 Pro achieving the highest accuracy in ellipsis detection (95.6%). However, LLMs struggled with precisely locating and reconstructing elided elements. The findings highlight the challenges of ellipsis processing in Arabic and point to the need for larger, more balanced datasets and further refinement of NLP models to handle structural inference.