Marco's Blog

All content personal opinions or work.
en eo

The Most Translated Pages on Wikipedia

2012-12-31 6 min read marco

There is a list on Meta, the wiki about Wikipedia, that contains 1,000 articles deemed important for a new language version of Wiki. So if you come up with some new language (not Klingon, please!), you should first translate these articles to kick start your project.

I looked through the list and found it extremely arbitrary. It was pretty obvious it was a curated list that had someone’s opinion of the things that matter. So I decided to do my own footwork and determine what articles you should translate first.

anarchism-wikiTo do so, I decided that the thing that matters the most is what other languages decide that matters. And a language decides what matters by writing an article about it. And if a language other than the language in which you are reading the article has the article, too – there is going to be a link on the left that tells you about that.

The little image on the right shows what I mean: on every Wikipedia page there is a list of languages for which the article you are reading is available. Curiously, the list contains the names of the languages, although the link is to the exact article that is the translation.

Now the question: what are the articles with the most translations? The ones that are most translated are probably also the ones you want to translate into your new language, no?

After a brief investigation, it appeared that nobody had asked that question before. There are all sorts of lists of articles, but none that did what I wanted. Since Wikipedia is an open effort, i decided to look at what it would take to find the information on my own.

Wikipedia provides a service that allows you to download the entire world of wikis. It’s called dump.wikimedia.org. If you go there, you will see all the different versions that you can download, plus all the languages for which there is a wikipedia. Some of them are the regulars – English, German, French, Dutch are the ones that have over 1,000,000 articles. Then there are the smaller languages, like Hebrew, Danish, or Korean. Then there are vanishing languages, like Hawaiian or Basque (hoping for a revival of all of them!). Finally, there are dead languages (like Latin) and made-up languages (like Esperanto).

For all of them, you can download the entire Wikipedia. No questions asked.

Each Wikipedia comes with a giant set of files that correspond to database tables. Apparently, in the Wikipedia database, each Wikipedia has its own database, and the wiki is made up of tables, one per file (or rather vice versa).

At first, I thought I’d download the whole thing. Turns out it would take several days. So I tried to be smart about it. I looked at all the files available and figured I really needed only the language links (which are stored in a separate table). At first I thought I’d do a really good job and download and cross-reference all the different language link tables. But then I realized that the English Wikipedia is so much larger than the other ones, if there is an article, it’s probably in the English Wikipedia, too. So I would get preliminary results from the English file alone that would approximate the total tally.

Good think I did that, because loading the single language file alone took several hours. I could have probably done better, but I just told mysql to slurp the whole thing.

The structure of the table is very simple. It contains an article ID (column ll_from), a language code like ‘de’ for German (column ll_lang), and the title of the page in that language (column ll_title). Of course, the link is simply built as ll_lang.wikipedia.org/wiki/ll_titile. (With URL encoding for the title).

Here was the only hickup of the story: the English language link file contains all the titles of the pages in other languages. Just not in English. Fortunately, the second-largest Wikipedia is the German one. I speak German, so I decided to get the names of pages in German, figuring that an article that didn’t have a translation into German was highly unlikely to be a very popular article.

Next, I had to remove all the internal links – mostly translations of user homepages, which are treated like articles in Wikipedia. They are easy to find, as they all contain the character ‘:’ (colon), which is illegal in an article title.

That all took several hours. Then I decided to write a query that would do the trick and came up with this (clumsy thing):

select a.ll_from, count(*), b.ll_title from langlinks a, langlinks b where a.ll_from = b.ll_from and b.ll_lang = ‘de’ group by a.ll_from order by count(*) desc

That ran for several minutes and then spit out the most amazing list.

Here are the top 10 entries:

  1. 260 translations, Russia
  2. 256 True Jesus Church
  3. 254 United States of America
  4. 253 (Empty)
  5. 249 Germany
  6. 249 Wikipedia
  7. 247 Europe
  8. 242 Curitiba
  9. 239 Africa
  10. 239 France

Now, looking at this list, I realize immediately that my original goal of creating a less biased list completely failed. While all in all the list makes sense, there are three entries that make no sense at at.

There is the fourth entry, which is empty. Apparently there is page with an empty title that has 253 translations. Admittedly, that sounds like an easy page to translate.

Then there are the entries for the True Jesus Church, which is apparently an Evangelical Christian Church in China I wasn’t familiar with, and that for Curitiba, a city in Brazil. I would venture the guess someone did some serious astroturfing there.

(If you are interested in the complete list, comment at the bottom and I’ll add it to the download section.)

There was another interesting thing I noticed. The IDs of pages are not random. if you plot them by smallest to highest, they reference articles with titles that are in roughly alphabetical order. After the alphabet follow the numbers. Then, after ID 36487 (Sparta, the first of the non-numeric ones), titles are not tied to IDs in a discernible way. This probably means that the current low page IDs are from an import of Wikipedia into a new database, with new pages (after Sparta) getting IDs assigned as they are created.

The page that has the lowest ID right now is 12, for Anarchism.