o corpus do português

The Corpus do Português that was released in 2016 (Web / Dialects: CdP:New) contains about one billion words of data, which is about 50 times as much data as in the 1900s portion of the previous Corpus do Português (Historical / Genres: CdP:Old). As a result, it provides much richer data on a wide range of phenomena. The following are just a few examples.

Lexical

There are 282 verbs with a lemma frequency of between 300 and 600 in CdP:New, and which are also found in at least two of the three online dictionaries that we have used to correct the lemma lists. The following shows how many times these same verbs appear in CdP:Old. Of the 282 verbs in CdP:New, about 42% have ten tokens or less in CdP:Old, which really isn't enough to say anything about the verbs. And only 33 / 282 (about 12%) have 50 tokens or more.

Frequency CdP:Old (300-600 in CdP:New)	# verbs	% verbs	Examples
50 tokens or more	33	12%	assoar, arremeter, crepitar
26-49 tokens	45	16%	coalhar, arreganhar, fender
11-25 tokens	87	31%	emparelhar, arrear, reincidir
1-10 tokens	106	38%	aplainar, encerar, solapar
0 tokens	12	4%	eletrizar, afobar, conflitar

Semantic

Without enough tokens of a given word, it is impossible to look at collocates ("nearby words") to say much about the meaning and usage of a word. For example, we have chosen (almost at random) a verb, noun, adjective, and adverb from CdP:New, to show how many different collocates occur with this word (at least three times as a lemma, between four words to the left and four words to the right of the node word) in CdP:New and CdP:Old. (You might need to manually reset the SEC 1 value to just the 1900s for the CdP:Old to get the correct type count.) As we see, CdP:New provides much better data to examine the meaning and usage of words.

lemma (PoS node:collocate)	CdP:New	CdP:Old
frigir (VERB : NOUN)	540	1
faceta (NOUN : NOUN)	434	2
interpessoal (ADJ : NOUN)	453	3
inconscientemente (ADV : VERB)	404	7

Syntactic

Because CdP:New is about 50 times as large as the 1900s portion of the CdP:Old, it provides many more tokens for lower frequency syntactic constructions. The following shows the number of tokens in the two corpora for a number of different constructions. (You might need to manually reset the SEC 1 value to just the 1900s for the CdP:Old to get the correct type count.)

CdP:New	CdP:Old	search string	explanation	example(s)
805	3	parecem\|pareciam que [v3p]	"Split subject raising" (see #59 and #60)	parecem que querem causar um conflito
354	9	os\|as [fazer] [v*] o\|os\|as\|um\|uma	Accusative case for 3PL agent in causative construction (see #67, 68, and #71)	não as faz perder o entusiasmo
481	21	sem lhes [v*]	Pre-verbal clitic (see #62); here just with sem and lhes	sem lhes dar tempo de refletirem
7151	175	estava* sendo [vps*]	Progressive + passive (just with estava / estavam)	o Bitcoin estava sendo usado por criminosos