Allgemein

Taavi Väänänen: How to import a new Wikipedia language edition (in hard mode)

I created the latest Wikipedia language edition, the Toki Pona Wikipedia, last month.
Unlike most other wikis which start their lives in the Wikimedia Incubator
before the full wiki is created, in this case the community had been using a
completely external MediaWiki site to build the wiki before it was approved as
a “proper” Wikipedia wiki,1 and now that external wiki needed to be imported
to the newly created Wikimedia-hosted wiki. (As far as I’m aware, the last and previously
only time an external wiki has been imported to a Wikimedia project was in 2013 when
Wikitravel was forked as Wikivoyage.)

Creating a Wikimedia wiki these days is actually pretty straightforward, at least
when compared to what it used to be like a couple of years ago. Today the process
mostly involves using a script to generate two configuration changes, one to add
the basic configuration for a wiki to operate and an another to add the wiki to the
list of all wikis that exist, and then running a script to create the wiki database
in between of deploying those two configuration changes. And then you wait half an
hour while the script to tell all Wikidata client wikis about the new wiki runs on
one wiki at a time.

The primary technical challenge in importing a third-party wiki is that there’s
no SUL making sure that a single username maps to the same account on both
wikis. This means that the usual strategy of using the functionality I wrote in
CentralAuth to manually create local accounts can’t be used as is, and so we
needed to come up with a new way of matching everyone’s contributions to their
existing Wikimedia accounts.

(Side note: While the user-facing interface tries to present a single “global”
user account that can be used on all public Wikimedia wikis, in reality the
account management layer in CentralAuth is mostly just a glue layer to link
together individual “local” accounts on each wiki that the user has ever
visited. These local accounts have independent user ID numbers — for example I
am user #35938993 on the English Wikipedia but #4 on the new Toki Pona
Wikipedia — and are what most of MediaWiki code interacts with except for a few
features specifically designed with cross-wiki usage in mind. This distinction
is also still very much present and visible in the various administrative and
anti-abuse workflows.)

The approach we ended up choosing was to re-write the dump file before importing,
so that a hypothetical account called $Name would be turned $Name~wikipesija.org
after the import.2 We also created empty user accounts that would take
ownership of the edits to be imported so that we could use the standard account
management tools on them later on. MediaWiki supports importing contributions
without a local account to attribute them to, but it doesn’t seem to be possible
to convert an imported actor3 to a regular user later on which we wanted
to keep as a possibility, even with the minor downside of creating a few hundred
users that’ll likely never get touched again later.

We also made specific decisions to add the username suffix to everyone, not to
just those names that’d conflicted with existing SUL accounts, and to deal with
renaming users that wanted their contributions linked to an existing SUL account
only after the import. This both reduced complexity and thus risk from the
import phase, which already had much more unknowns compared to the rest of the
process, but also were much better options ethically as well: suffixing all names
meant we would not imply that those people chose to be Wikimedians with those
specific usernames (when in reality it was us choosing to import those edits to
the Wikimedia universe), and doing renames using the standard MediaWiki account
management tooling meant that it produced the normal public log entries that
all other MediaWiki administrative actions create.

With all of the edits imported, the only major thing remaining was doing those merges
I mentioned earlier to attribute imported edits to people’s existing SUL accounts.
Thankfully, the local account -based system makes it actually pretty simple. Usually
CentralAuth prevents renaming individual local accounts that are attached to a global
account, but that check can be bypassed with a maintenance script or a privileged
enough account. Renaming the user automatically detached it from the previous global
account, after which an another maintenance script could be used to attach the
user to the correct global account.


  1. That external site was a fork of a fork of the original Toki Pona Wikipedia
    that was closed in 2005. And because cool URIs don’t change,
    we made the the URLs that the old Wikipedia was using work again. Try it: https://art-tokipona.wikipedia.org↩︎

  2. wikipesija.org was the domain where the old third-party wiki was hosted on, and
    ~ was used as a separator character in usernames during the
    SUL finalization in the early
    2010s so using it here felt appropriate as well. ↩︎

  3. An actor is a MediaWiki term and a
    database table referring to anything that can do edits or logged actions. Usually an actor
    is a user account or an IP address, but an imported user name in a specific format can
    also be represented as an actor. ↩︎