The Aragón Institute for Engineering Research (I3A) at the University of Zaragoza hosted a scientific meeting to collect evaluation data in the Aragonese language, with a view to improving the proficiency of large language models (LLMs, such as ChatGPT or Gemini) in this language.
The aim of this ‘Datathon’, which took place last Friday the 13th, is to increase the digital presence of Aragonese and facilitate its survival in today’s technological ecosystem. The initiative follows the model successfully applied to other languages such as Basque, Catalan and Galician.
Around twenty people have registered, 13 of whom attended this first session.
The ‘Datathon’ is organised as part of Miguel López Otal’s doctoral thesis, supervised by Professor Jorge Gracia del Río; both are members of the Distributed Information Systems (SID) research group, and the project is advised by Juan Pablo Martínez, director of the Institute of Aragonese at the Aragonese Academy of Language and also a member of the I3A.
The data collected at this event will be used to test the capabilities of these artificial intelligence models in this Romance language and to seek ways to improve them. Although Aragonese is currently at serious risk of extinction according to UNESCO, it has a strongly committed and highly active community of speakers, whose role was fundamental to this event.

The volunteers who took part in the ‘Datathon’ had to proofread a set of over 10,000 sentences, automatically translated from Spanish into Aragonese using the Apertium tool, to check whether the translations were correct. Although this tool often produces accurate translations, it can make mistakes, so the proofreaders at the event had to correct any errors that were bound to be found. They worked using machine translation and manual correction.
The resulting datasets will be made openly available online to facilitate experimentation with the language. All of this forms part of an active effort to support Aragonese in today’s world of AI, where a shortage of training texts hinders language models’ ability to use the language competently, and encourages the search for alternative strategies. The compilation of these evaluation datasets will serve as a decisive step in this direction.