Thanks for helping with the correction of the Portuguese lemmas and part of speech for the new one billion word addition to the Corpus do Português. We're grateful for your help, and we hope to make it worth your while.

Here's the problem. The corpus has been tagged for part of speech (e.g. casa = noun, fazem = verb form, etc) and it has been lemmatized (e.g. faço, fizeram, and fizemos are all forms of the lemma fazer). But there are still problems. We've compared the tagging and lemmatization to three online dictionaries of Portuguese to find the words that might be errors, and there are about 15,000 or so that we need to look at more carefully. This is where you can help.

Just a quick note: Three people will look at each word, and we'll go with what two of the three people have chosen. So don't worry that all of the pressure is on you to do things perfectly right.


The first thing you'll do is to enter your email address (A in the form at the left). (There's no need to do that right now; we're just in "demo" mode on this page.) You will then see five words that have been assigned to you (C). Once you're done with these five, you can get five more (D). When you're done, log out (B). You can also see a list of the last 100 words that you've done (E).


Usually you will click on the lemma (column 1) to see the individual words (e.g. facto = facto, factos). If you do that, then you can change the lemma (see the next section). You would do one of the actions in this section if you want to mark the lemmas as "good", "bad", etc without even looking at the individual word forms.

  • OK: But if you're 100% sure -- even without seeing the forms -- that the word is good (as in #6: regulação) then click OK. On the lower frequency words (e.g. ossário #8), you'll probably want to click on the word and see the forms above, just to make sure.

  • BAD: If you're 100% sure that it can't be a word, click BAD. (But don't use that very often, since that lemma should often be corrected to be another lemma). #7: lev might be an example of that (but again, usually you want to at least see the forms).

  • ALMOST: Often the lemma is "almost" right, but you need to just change it a little bit. For example, ideia (#1) should really be idéia and caractere (#5) should be caráter. In both of these cases, click on the word (1) and then enter the correct lemma (we'll discuss this more below). There will be lots of cases of words that are missing a diacritic (e.g. idéia), since about 50% of the corpus is from blogs, where people leave out diacritics and where they misspell words.

  • SPELLING: Related to this, there are words that used to have two forms (one in Brazil, one in Portugal), but which have just one form with the 1990 spelling reform. Facto (#2; should be fato) would be an example of this.

  • ENGLISH (FOREIGN): Look at web (#3). Is this actually a word in Portuguese now, or is it just English? In other words, would you expect to see it in a new dictionary of Portuguese? If so, click OK. If it seems like it still "isn't part" of Portuguese (and you wouldn't find it in a dictionary) click ENG (English. Most of these are from English, but use this for other languages too). Some of these will be really "tricky" to decide.

  • ADJECTIVE: Look at social (#4). Is this an adjective, or a noun, or could it be both? You'll probably want to click on the word to see the forms, and from there you can click to see the word in context (100 sample entries). Don't worry too much about the noun/adjective distinction -- it's often almost impossible to know. For example, is Maria é católica a noun or an adjective? Only when it's very, very clear that nearly all of the entries are clearly adjectives, and couldn't be a noun, would you select this.


Usually, you do want to look at the forms of the lemma, in order to change the lemma. In order to do that, just click on the word. For example, click on caractere in the form to the left at you'll see the forms of the "word" in the form above. Here's an explanation of what you'll see and what you do in that form.

Note that you can click on a word [G] to see up to 100 sample tokens from the corpus, which will help you see how the word is used (and then click on "Back to instructions" on that page to come back to these instructions).

To change the lemma

  • A. Enter the correct lemma here (in this case caráter). Please make sure you use diacritics, when necessary.

  • B. Enter the correct part of speech here (noun, verb, adjective, adverb, x = anything else)

  • F. Choose individual forms that your lemma (A) will apply to. If you don't select any word forms, nothing will change.

  • E. Select or de-select all of the forms below (so that you don't have to individually click on F for each form).

  • After entering the correct lemma and/or part of speech, click SUBMIT.

Mark a single word form as ADJ or bad

  • H. Mark this one form as ADJective. This is a shortcut to selecting the one word and then entering "J" in (B) above. Usually you will then mark the remaining forms as OK or BAD (C or D).

  • I. Mark this one form as BAD. Usually you then will then marking the remaining forms as OK or BAD (C or D).

Mark all remaining word forms as "good" or "bad"

  • C. After you've made some changes to individual forms (see the two sections above), you can click here to mark all of the remaining forms shown below as "OK". If you haven't made any changes to individual forms, it's the same as choosing OK in the form to the left.

  • D. After you've made some changes to individual forms (see the two sections above), you can click here to mark all of the remaining forms shown below as "BAD". If you haven't made any changes to individual forms, it's the same as choosing BAD in the form to the left.

If you have any questions, feel free to contact us before starting on some words. Thanks!