Arquivo.pt presents itself as a solution for Artificial Intelligence (AI) based tools to perform better in the Portuguese language. This digital service from the Foundation for Science and Technology, developed through FCCN, is the largest set of Portuguese-language textual data in Portugal, available in open access, for researchers to train natural language processing (NLP) models.
The need for AI to interpret the complexities of the Portuguese language
Artificial Intelligence covers various areas of knowledge, such as linguistics and computer science, and is present in the new technologies used daily by everyone worldwide. When we search for information on the Internet, for example, and an answer is generated in a certain language, this process uses AI.
Natural language processing is what allows machines to perfect the algorithm that generates these responses adapted to users, and this is the aspect of artificial intelligence that helps computers understand, interpret and manipulate human language. However, these models have mostly been developed for the English language and not so much for others, such as Portuguese.
The truth is that the more NLPs are trained in a language, the better they will be able to interpret its complexities. However, this is only possible if they use quality data and it is precisely in this sense that Arquivo.pt, the digital service of the Foundation for Science and Technology, has emerged as a solution.
Arquivo.pt: the largest collection of textual data in Portuguese
Arquivo.pt is presented here as the largest set of textual data in Portuguese and in Portugal, available in open access, for researchers to train natural language processing models.
With more than 1 Petabyte of content preserved since the 1990s, including everything that can be found on web pages, Arquivo.pt not only provides text, but also images, audio files, video and various metadata, among other types of content in Portuguese.
The contents are accessible via the search interface and the Arquivo.pt APIs.
Gloria, a model for the Portuguese language
One of the projects that used Arquivo.pt to obtain large amounts of text is called GlórIA, a large-scale linguistic model (LLM) focused on the European Portuguese language .
"Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese," as Ricardo Lopes, João Magalhães and David Semedo, authors of the project and researchers at the Faculty of Science and Technology of NOVA University Lisbon, explain in their article GlórIA - A Generative and Open Large Language Model for Portuguese.
The model used 35 million tokens, or expressions that machines can process, from various sources, with Arquivo.pt contributing a collection of 1.4 million news items and periodicals archived in Portuguese .