The following is a summary of the composition of the corpus.
You can also download a file that lists all of
the ~57,000 texts in the corpus.
The corpus is composed of more than 45 million words in nearly
57,000 texts. There are 20 million words from the 1900s, 10 million from
the 1800s, and 15 million words from the 1300s-1700s. For the 1900s,
there are six million words from fiction, six million from newspapers
and magazines, six million from academic texts, and two million from
spoken. For each of these four genres (and therefore overall) the texts
from the 1900s are evenly divided between texts from Portugal and texts
from Brazil.
# WORDS |
CENTURY |
COUNTRY |
GENRE |
Historical |
550,968 |
1200s |
Portugal |
|
1,316,268 |
1300s |
Portugal |
|
2,875,653 |
1400s |
Portugal |
|
4,435,031 |
1500s |
Portugal / Brazil |
|
3,407,741 |
1600s |
Portugal / Brazil |
|
2,234,951 |
1700s |
Portugal / Brazil |
|
10,008,622 |
1800s |
Portugal / Brazil |
|
|
Modern
Portuguese: Genres / Countries |
3,087,052 |
1900s |
Portugal |
Academic |
3,271,328 |
1900s |
Portugal |
Newspaper |
3,048,020 |
1900s |
Portugal |
Fiction |
1,100,303 |
1900s |
Portugal |
Spoken |
|
2,816,802 |
1900s |
Brazil |
Academic |
3,346,988 |
1900s |
Brazil |
Newspaper |
3,028,646 |
1900s |
Brazil |
Fiction |
1,078,586 |
1900s |
Brazil |
Spoken |
|