the Corpus do Português (Web / Dialects) has about one billion words
of data, there are much larger web-based corpora. For example,
Sketch Engine has a 3.9 billion word
corpus of Portuguese. Why not just use a corpus like this instead?
The reason why is that size is not everything.
Once the corpus is created, it is annotated for part of speech and lemma (e.g.
disse, dizemos, and dirão are all forms of the lemma dizer).
While it's easy to create a large corpus from the web for any language nowadays,
it's much harder to annotate it correctly and accurately. And without good
annotation, the corpus is almost unusable, at least for some purposes.
To correct the corpus, it requires that someone actually know Portuguese. Based on the accuracy of the Sketch Engine, it
appears that nobody did. They simply blindly ran the tagger on the corpora and
then placed them online, with little or no attempt to fix things. Quick, but not
To see what types of problems have
resulted from the inaccurate tagging and lemmatization, take a look at
the following spreadsheet.
This spreadsheet shows words starting with s- in the
Sketch Engine corpus, which have a frequency of between 1000 and 2000 tokens in
the corpus. (In other words, these are relatively frequent words) The spreadsheets group words by lemma and part of speech (noun,
verb, adjective). Potential "problem" words are highlighted in yellow
and (most problematic) orange.
look at just the verbs, we find that more than at least 46 of
these 68 frequent "verbs" aren't really verbs at all (and these
are supposedly common "verbs" -- occurring 1000 times or more). Some
of them are forms of Portuguese verbs (saíu, saimos, selecionaram,
sabiamos, sorocaba), but they are not actually lemmas (i.e. the
entry that one would find in a dictionary). Some of these at least
end in an -r, which would suggest that they might be Portuguese
verbs in some alternate universe (secalhar, sanduichar, sinistrar,
saír, siar, sapar, soccer), but they are not actually words in
this universe. And others clearly could never be verbs (at least in
Portuguese, the language of the corpus): sensei, sibutramina,
simpatica, sm, sabados, semiárido, sobrevivencia, simple, silver,
If we were to go further
down the list -- words that occur 100-200 times, for example -- we
would find that nearly 90% of all of the entries are problematic.
For proof of this, see the
from Spanish, which covers a much larger frequency range, and
which used the same (FreeLing) tagger. We've done "probes" of the
Portuguese data at different frequency bands, and it is very similar
to what that page shows for Spanish. (Since that Spanish data was
downloaded, Sketch Engine has blocked people from downloading so
much data.) But even with these very frequent "verbs" (which occur
1000-2000 times each in the corpus), the data is extremely messy.
If you're going to create word frequency
data or language learning tools
like we've done for English, you need to carefully review thousands upon
thousands of words -- looking at their context, fixing lemmas and part of
speech, etc. And you need to have at least a rudimentary knowledge of the
language you're working with. None of this was done for these larger
Portuguese corpora and so they are -- as we have mentioned -- almost
unusable for many purposes.
With our corpus, we are reviewing each and every lemma (for
the top 40,000 lemmas in the corpus), to make sure that the lemma and the part
of speech are correct. It's a lot of work, and it takes several months to
compete. But will such a correction, we believe that we have the only large (>
one billion words) and reliable
corpus of Portuguese.