When we collected the one million web pages, we relied on Google's identification of the country for the web page. This was more difficult when, for example, it was a .COM site (e.g. www.felicidade.com). One might wonder how Google knew what country this was from.

To test how well Google did, we looked up a number of words and constructions from web pages that focused on dialects of Portuguese (example), where a particular word or construction was supposedly more common in a given country or region. The fact that the following words and phrases do appear much more frequently in that country suggests that Google's categorization is quite good.

Lexical
 

English Brazil Portugal
train trem comboio
bus ônibus autocarro
ice cream sorvete gelado
cup xícara chávena
sports esporte desporto
goal (futebol) gol golo
cell/mobile (phone) celular telemóvel
bathroom banheiro sala de banhos
truck caminhão camião
jacket jaqueta blusão
breakfast café da manhã pequeno-almoço
cancer câncer cancro
slum favela bairro de lata
"cool" legal fixe
swimsuit maiô fato de banho
(electronic) screen tela escrã

While the contrasts above focused on Brazil and Portugal, the corpus can of course be used to compare Angola and Mozambique to the other dialects. For example, the following are words that are more common in Angola:

       cacimba (well), garina (girl), jinguba (peanut), muceque (slum quarter), and cubata (house)

And the following are more common in Mozambique (the first two are fairly common in Angola as well):

       machimbombo (bus), mata-bicho (breakfast),  madala (person of high esteem or status), and machamba (agricultural land).

Syntactic and morphological

Of course the corpus can be used to look at syntactic and morphological differences between dialects as well. The following are just a few examples of differences between Brazil and Portugal:
 
Brazil Explanation Portugal Explanation
eu iria VERB "analytic" conditional    
estava* [vpp*] progressive construction estava* a [vr*] progressive construction
eu [po*] VERB pre-verbal clitic (with finite verb) eu *-_v [po*] post-verbal clitic (with finite verb)
e me|lhe disse pre-verbal clitic (with finite verb) eu *-_v [po*] pre-verbal clitic (with finite verb)
para|pra DAR lhe post-verbal clitic (with nonfinite verb) para|pra lhe DAR pre-verbal clitic (with nonfinite verb)
para|pra fazer- lo post-verbal clitic (with nonfinite verb) para|pra o fazer pre-verbal clitic (with nonfinite verb)
eu VERB você você instead of clitic    
eu VERB para|pra ela|ele stressed pronoun instead of clitic    
DAR para|pra mim stressed pronoun instead of clitic    
    me os|as|a|o [vis*] contracted pronouns *
    Note: two clitics in the corpus; would be one in the actual text, e.g. mo disse
Also note that many of the Brazilian ones are from archives of older literature