It's no surprise to say that technology has become a crucial part of our lives, with most people owning a mobile phone, tablet or computer. Devices that keep us connected, allow us to create content and access services such as banking, e-commerce and many more. Their implementation in everyday life has also led to new lines of research to create systems with more secure access, such as the use of artificial intelligence techniques to recognise our face or voice.
Techniques based on large neural networks that try to learn as our brains do, simulating our neurons and their hit-and-miss learning process. "These techniques already work quite well when there is a lot of pre-prepared data for the system to learn who to allow access to. But even so, there are many challenges to be faced in this type of system," explains Victoria Mingote, a young researcher at the I3A who has just received the prize for the best doctoral thesis at the IberSpeech-2022 conference, held in Granada, which brings together research groups in speech and language technologies.
In this same forum, it has also won the prize awarded by the Thematic Network on Speech Technologies (RTTH) for the best article published in the IEEE/ACM Transactions on Audio, Speech and Language.
Can a machine differentiate between the voices of different people?
If you try to use the usual large neural networks when there is little adequate data for the system, it will be impossible to differentiate between several people talking. It is in this area that Victoria Mingote has delved into in her doctoral thesis, "to find solutions adapted to these situations. This has allowed the development of techniques capable of differentiating quite well who are the people who are talking," says this young researcher from the ViVoLab group.
But in these years of work, Victoria Mingote has also studied what happens at the other extreme: what happens when we have too much data? It is good to have a lot of data, yes, but only if it is properly prepared and controlled so that it can be used," she says.
Recognise voice and face at the same time
However, the development of technology has led to the creation of a large amount of audiovisual content that is available on the Internet. "We need these videos to be tagged for certain applications so that we know exactly what information is in them". The doctoral thesis awarded at IberSpeech addresses this situation, providing options that avoid doing this work manually by developing voice and face recognition systems that help to analyse and catalogue audiovisual content more efficiently and automatically so that it can be used easily.
As for the award for the best paper published in the IEEE/ACM Transactions on Audio, Speech and Language, this work is focused on the line of research of his thesis, the development of systems for the verification of people based on their physical features that are unique and non-transferable, such as their face or voice.
Victoria Mingote studied at the University of Zaragoza the Degree in Telecommunication Technologies and Services and the Master in Telecommunication Engineering. She has completed her PhD in the I3A ViVoLab research group, whose main lines of work are speech, language and machine learning technologies.