Created by Mark Davies.
Funded by the US National Endowment for the Humanities
The addition to the Corpus do PortuguÍs contains about one billion words of data in web pages from
four different Portuguese-speaking
countries (Brazil, Portugal, Angola, Mozambique). This corpus allows you to look at very recent Portuguese (the texts were collected 2013-14), and to compare
among the different dialects.
The new corpus is also much larger than the previous corpus -- more than 50
times as large for Modern Portuguese (one billion words, compared to just 20
million words from the 1900s in the original corpus). So where you might have
20-25 tokens with the original corpus, you might have 1,000 or more with the new
In 2022, we added many new features to this corpus: 1) browsing and
searching the top 40,000 lemmas in the corpus 2) detailed "word pages" with
information on each of these 40,000 words, including definitions, synonyms,
links to images and videos, frequency information (by genre and country),
collocates, related topics, and concordance lines), 3) the ability to input and
analyze entire texts, find keywords in these texts, and then see detailed
information (#2) for each word, as well as the ability to highlight phrases in
your text and find related phrases in the corpus, and 4) extensive links to
external resources in the frequency and conordance displays.