o corpus do português

o corpus do português

The Corpus do Português that was released in 2016 (Web / Dialects: CdP:New) contains about one billion words of data, which is about 50 times as much data as in the 1900s portion of the previous Corpus do Português (Historical / Genres: CdP:Old). As a result, it provides much richer data on a wide range of phenomena. The following are just a few examples.


There are 282 verbs with a lemma frequency of between 300 and 600 in CdP:New, and which are also found in at least two of the three online dictionaries that we have used to correct the lemma lists. The following shows how many times these same verbs appear in CdP:Old. Of the 282 verbs in CdP:New, about 42% have ten tokens or less in CdP:Old, which really isn't enough to say anything about the verbs. And only 33 / 282 (about 12%) have 50 tokens or more.

Frequency CdP:Old  (300-600 in CdP:New) # verbs % verbs Examples
50 tokens or more 33 12% assoar, arremeter, crepitar
26-49 tokens 45 16% coalhar, arreganhar, fender
11-25 tokens 87 31% emparelhar, arrear, reincidir
1-10 tokens 106 38% aplainar, encerar, solapar
0 tokens 12 4% eletrizar, afobar, conflitar


Without enough tokens of a given word, it is impossible to look at collocates ("nearby words") to say much about the meaning and usage of a word. For example, we have chosen (almost at random) a verb, noun, adjective, and adverb from CdP:New, to show how many different collocates occur with this word (at least three times as a lemma, between four words to the left and four words to the right of the node word) in CdP:New and CdP:Old. (You might need to manually reset the SEC 1 value to just the 1900s for the CdP:Old to get the correct type count.) As we see, CdP:New provides much better data to examine the meaning and usage of words.

lemma (PoS node:collocate) CdP:New CdP:Old
frigir (VERB : NOUN) 540 1
faceta (NOUN : NOUN) 434 2
interpessoal (ADJ : NOUN) 453 3
inconscientemente (ADV : VERB) 404 7


Because CdP:New is about 50 times as large as the 1900s portion of the CdP:Old, it provides many more tokens for lower frequency syntactic constructions. The following shows the number of tokens in the two corpora for a number of different constructions. (You might need to manually reset the SEC 1 value to just the 1900s for the CdP:Old to get the correct type count.)

CdP:New CdP:Old search string explanation example(s)
805 3 parecem|pareciam que [v*3p*] "Split subject raising" (see #59 and #60) parecem que querem causar um conflito
354 9 os|as [fazer] [v*] o|os|as|um|uma Accusative case for 3PL agent in causative construction (see #67, 68, and #71) não as faz perder o entusiasmo
481 21 sem lhes [v*] Pre-verbal clitic (see #62); here just with sem and lhes sem lhes dar tempo de refletirem
7151 175 estava* sendo [vps*] Progressive + passive (just with estava / estavam) o Bitcoin estava sendo usado por criminosos