The corpus is composed of about one billion billion
words in more than one million web pages from 85,000 websites in
four different Portuguese-speaking countries (more information).
You can download metadata (country, genre, URL, title, # words, etc)
for all 2.0 million web pages (ZIP file: 115 MB). See also some good examples of using the
corpus
to look at differences between the dialects.
Country |
Code |
General (may also include blogs) |
(Only) Blogs |
Total |
|
|
Words |
Web pages |
Web sites |
Words |
Web pages |
Web sites |
Words |
Web pages |
Web sites |
Brasil | BR |
347,834,509 |
291,592 | 25,535 |
365,213,219 |
327,093 | 35,345 |
713,047,728 |
618,685 | 60,880 |
Portugal | PT |
171,208,029 |
186,772 | 12,127 |
214,924,924 |
225,129 | 9,022 |
386,132,953 |
411,901 | 21,149 |
Angola | AO |
19,629,046 |
20,021 | 1,257 |
19,010,093 |
22,322 | 419 |
38,639,139 |
42,343 | 1,676 |
Moçambique |
MZ | 18,562,424
| 19,904 |
1,074 |
16,587,071 | 18,455 |
404 | 35,149,495
| 38,359 |
1,478 |
TOTAL | |
| |
| |
| |
1,172,969,315 |
1,111,288 |
85,183 |
Notes on duplicate texts. |