The corpus is composed of about one billion billion
words in more than one million web pages from 85,000 websites in
four different Portuguese-speaking countries (more information).
You can download metadata (country, genre, URL, title, # words, etc)
for all 2.0 million web pages (ZIP file: 115 MB). See also some good examples of using the
corpus
to look at differences between the dialects.
Country |
Code |
General (may also include blogs) |
(Only) Blogs |
Total |
|
|
Words |
Web pages |
Web sites |
Words |
Web pages |
Web sites |
Words |
Web pages |
Web sites |
Brazil |
BR |
319,435,592 |
286,712 |
25,351 |
336,244,918 |
321,305 |
35,248 |
655,680,510 |
608,017 |
60,599 |
Portugal |
PT |
136,144,529 |
184,512 |
12,082 |
190,503,822 |
221,338 |
9,005 |
326,648,351 |
405,850 |
21,087 |
Angola |
AO |
17,877,399 |
19,178 |
1,240 |
17,255,595 |
21,233 |
418 |
35,132,994 |
40,411 |
1,658 |
Mozambique |
MZ |
16,936,743 |
19,236 |
1,065 |
15,070,829 |
17,910 |
404 |
32,007,572 |
37,146
|
1,469 |
TOTAL |
|
490,394,263 |
509,638 |
39,738 |
559,075,164 |
581,786 |
45,075 |
1,049,469,427 |
1,091,424 |
84,813
|
Notes on duplicate texts. |