Arquivo.pt presents itself as a solution for Artificial Intelligence (AI) based tools to perform better in the Portuguese language. This digital service from the Foundation for Science and Technology, developed through FCCN, is the largest set of Portuguese-language textual data in Portugal, available in open access, for researchers to train natural language processing (NLP) models.

The need for AI to interpret the complexities of the Portuguese language

Artificial Intelligence covers various areas of knowledge, such as linguistics and computer science, and is present in the new technologies used daily by everyone worldwide. When we search for information on the Internet, for example, and an answer is generated in a certain language, this process uses AI.

Natural language processing is what allows machines to perfect the algorithm that generates these responses adapted to users, and this is the aspect of artificial intelligence that helps computers understand, interpret and manipulate human language. However, these models have mostly been developed for the English language and not so much for others, such as Portuguese.

The truth is that the more NLPs are trained in a language, the better they will be able to interpret its complexities. However, this is only possible if they use quality data and it is precisely in this sense that Arquivo.pt, the digital service of the Foundation for Science and Technology, has emerged as a solution.

Arquivo.pt: the largest collection of textual data in Portuguese

Arquivo.pt is presented here as the largest set of textual data in Portuguese and in Portugal, available in open access, for researchers to train natural language processing models.

With more than 1 Petabyte of content preserved since the 1990s, including everything that can be found on web pages, Arquivo.pt not only provides text, but also images, audio files, video and various metadata, among other types of content in Portuguese.

The contents are accessible via the search interface and the Arquivo.pt APIs.

Gloria, a model for the Portuguese language

One of the projects that used Arquivo.pt to obtain large amounts of text is called GlórIA, a large-scale linguistic model (LLM) focused on the European Portuguese language .

"Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese," as Ricardo Lopes, João Magalhães and David Semedo, authors of the project and researchers at the Faculty of Science and Technology of NOVA University Lisbon, explain in their article GlórIA - A Generative and Open Large Language Model for Portuguese.

The model used 35 million tokens, or expressions that machines can process, from various sources, with Arquivo.pt contributing a collection of 1.4 million news items and periodicals archived in Portuguese .

Latest Articles

"We need to look to the past to build a more robust and informed future"

Luísa Ribeiro Lopes, Chairman of the .PT Board of Directors, shared what they are looking for with the award of this Honorable Mention.

Read article

João Nuno Ferreira: "With Deucalion there has been a big leap forward in HPC capacity in Portugal"

The general coordinator of the FCCN spoke about advanced computing and the investment made at national level.

Read article

"Knowledge of the past forms the foundations of a more participatory, plural and democratic society"

Maria Inácia Rezola, Executive Commissioner of the Commission to Commemorate 50 years of April 25, shared what the award of this Honorable Mention represents.

Read article

Pedro Vale Pinheiro: " FCCN is a guiding light and motivator of good practices in the operation and management of various information technologies"

The FCCN spoke to the community to take stock of the work carried out over the last year and the expectations for 2025.

Read article

Portugal a leader in ORCID adoption

Portugal has achieved a leading position in this transformation process through PTCRIS.

Read the news

The FCCN 2025 Conference features João Gabriel Silva as keynote speaker

A leading figure in Portuguese academia, Professor João Gabriel Silva takes the stage on May 7th.

Read the news

Innovation Hub: applications open to take the new stage at the FCCN 2025 Conference

The 16th edition of the FCCN Days brings new themes and novelties to the hundreds of...

Read the news

Deucalion attends EuroHPC Summit 2025

The event featured two presentations on the capabilities of the Portuguese supercomputer.

Read the news

Get to know the program of the FCCN 2025 Conference

Find out all about the agenda for the 16th annual meeting of the FCCN community.

More info

APDSI promotes cycle of webinars dedicated to Arquivo.pt

A total of four sessions will be held from March 20 to April 1.

More info

João Pagaime: "The FCCN Days highlight the latest technological advances and anticipate future trends"

In 2025, Coimbra will host the FCCN Days. The organization of this edition is headed by João Pagaime, who, in a brief interview, shared his expectations for the event.

More info

The FCCN 2025 Conference is now open for registration

Register and guarantee your place at the 2025 edition of the FCCN Days, which will be held from May 6 to 8 at the Convento de São Francisco in Coimbra.

More info