The corpus is composed of 1.0 billion words in 1 million web pages from four Portuguese-speaking countries. The web pages were collected in late 2015, using the following process:

1. The list of web pages was created by running hundreds of high frequency n-grams from the Corpus do Português (e.g. e que é, não é um) against Google to generate essentially "random" web pages (e.g. presumably there would be no AdSense entries or meaningful page rankings for phrases like não é um).
2. We repeated this process for each of the 4 different countries (Brazil, Portugal, Angola, and Mozambique), and limited the country by using Google "Advanced Search" [Region] function. The question, of course, is how well Google knows which country a page comes from, if it isn't marked by a top-level domain (e.g. BR for Brazil). As Google explains,

"we'll rely largely on the site's 1 country domain (.ca, .de, etc.). If an international domain (.com, .org, .eu, etc) has been used, we'll rely on several signals, including 2 IP address, 3 location information on the page, 4 links to the page, and 5 any relevant information from Google Places."

For example, for a .com address (where no top-level domain is listed), it will try to use the IP address (which shows where the computer is physically located). But even if that fails, Google could still see that 95% of the visitors to the site come from Brazil, and that 95% of the links to that page are from Brazil (and remember that Google knows both of these things), and it would then guess that the site is probably from Brazil. It isn't perfect, but it's very, very good, as is shown in the results from the dialect-oriented searches.

3. In addition, besides doing 4 different sets of searches (for each of the 4 countries) with "General" Google searches (all web pages), we also repeated this with Google "Blog" searches (using the Advanced / Region searches in both cases). The blog searches are obviously just blogs, and the "General" searches also included some blogs as well.
4. We then downloaded all of the one million unique web pages using HTTrack.
5. After this, we ran all of the two million web pages through JusText to remove boilerplate material (e.g. headers, footers, sidebars). (Thanks, Michael Bean, for helping to set this up)
6. Finally, we used n-gram matching to eliminate the remaining duplicate texts. Even more difficult was the removal of duplicate "snippets" of text on multiple web pages (e.g. legal notices or information on the creator of a blog or a newspaper columnist), which JusText didn't eliminate. While there are undoubtedly still some duplicate texts, we are continuing to work on this.