The Corpus do Português that was
released in 2016 (Web /
Dialects: CdP:New) contains about one billion words of data, which
is about 50 times as much data as in the 1900s portion of the
previous Corpus do Português (Historical / Genres: CdP:Old). As a result, it provides much richer data on a
wide range of phenomena. The following are just a few examples.
Lexical There are
282
verbs with a lemma frequency of between 300 and 600 in CdP:New,
and which are also found in at least two of the three online
dictionaries that we have used to correct the lemma lists. The
following shows how many times these same verbs appear in CdP:Old.
Of the 282 verbs in CdP:New, about 42% have ten tokens or less in
CdP:Old, which really isn't enough to say anything about the verbs.
And only 33 / 282 (about 12%) have 50 tokens or more.
Frequency CdP:Old (300-600 in
CdP:New) |
# verbs |
% verbs |
Examples |
50 tokens or more |
33 |
12% |
assoar, arremeter, crepitar |
26-49 tokens |
45 |
16% |
coalhar, arreganhar, fender |
11-25 tokens |
87 |
31% |
emparelhar, arrear, reincidir |
1-10 tokens |
106 |
38% |
aplainar, encerar, solapar |
0 tokens |
12 |
4% |
eletrizar, afobar, conflitar |
Semantic
Without enough tokens of a given word, it is impossible to look at collocates
("nearby words") to say much about the meaning and usage of a word. For example,
we have chosen (almost at random) a verb, noun, adjective, and adverb from
CdP:New, to show how many different collocates occur with this word (at least
three times as a lemma, between four words to the left and four words to the
right of the node word) in CdP:New and CdP:Old. (You might need to manually
reset the SEC 1 value to just the 1900s for the CdP:Old to get the correct type
count.) As we see, CdP:New provides much
better data to examine the meaning and usage of words.
lemma (PoS node:collocate) |
CdP:New |
CdP:Old |
frigir (VERB : NOUN) |
540 |
1 |
faceta (NOUN : NOUN) |
434 |
2 |
interpessoal (ADJ : NOUN) |
453 |
3 |
inconscientemente (ADV : VERB) |
404 |
7 |
Syntactic
Because CdP:New is about 50 times as large as the 1900s portion of the CdP:Old,
it provides many more tokens for lower frequency syntactic constructions. The
following shows the number of tokens in the two corpora for a number of
different constructions. (You might need to manually reset the SEC 1 value to
just the 1900s for the CdP:Old to get the correct type count.)
CdP:New |
CdP:Old |
search string |
explanation |
example(s) |
805 |
3 |
parecem|pareciam que [v*3p*] |
"Split subject raising" (see
#59 and #60) |
parecem
que querem causar um conflito |
354 |
9 |
os|as [fazer]
[v*] o|os|as|um|uma |
Accusative case
for 3PL agent in causative
construction (see
#67, 68, and #71) |
não as
faz perder o entusiasmo |
481 |
21 |
sem lhes
[v*] |
Pre-verbal clitic (see
#62); here just with sem and lhes |
sem lhes
dar tempo de refletirem |
7151 |
175 |
estava*
sendo [vps*] |
Progressive +
passive (just with estava / estavam) |
o Bitcoin
estava sendo usado por criminosos |
|