[[!meta title="Mirroring MediaWiki with Git-Mediawiki and gitolite"]] [[!meta author="rohieb"]] [[!meta license="CC-BY-SA 3.0"]] From Murphy’s Law we can deduct that Internet failures always come when you least expect them. In my case, the [Stratum 0 wiki][s0wiki] was offline for a few minutes (only, thankfully!) when I really urgently(1!11) needed to look something up there. If I only had an offline clone of the wiki… [s0wiki]: https://stratum0.org/wiki/ "Stratum 0 wiki" Enter: Git-Mediawiki -------------------- I had already before discovered [Git-Mediawiki][], which lets you mirror certain or all pages of a MediaWiki instance to a local Git repository. It achieves this by implementing the `mediawiki::` remote handler, which lets you configure the URL of the remote MediaWiki instance as a Git remote, and loads the raw revisions from the MediaWiki API everytime you do a `git fetch`: $ git clone mediawiki::https://stratum0.org/mediawiki/ Cloning into 'mediawiki'... Searching revisions... No previous mediawiki revision found, fetching from beginning. Fetching & writing export data by pages... Listing pages on remote wiki... 6 pages found. page 1/78: Vorstand Found 2 revision(s). page 2/78: Atomuhr Found 15 revision(s). page 3/78: Corporate Identity Found 6 revision(s). page 4/78: Presse Found 2 revision(s). [...] 1/804: Revision #738 of Presse 2/804: Revision #3036 of Atomuhr 3/804: Revision #3053 of Atomuhr 4/804: Revision #3054 of Atomuhr [...] Checking connectivity... done. Not to mention, this can take a very long time if you try to import a whole wiki (say, Wikipedia (NO, DON’T ACTUALLY DO THIS! (or at least don’t tell them I told you how))), but you [can also][gmw-partialimport] import only single pages or pages from certain categories with the `-c remote.origin.pages=` and `-c remote.origin.categories=` options to `git-clone`. After the clone has finished, you can view the raw MediaWiki source files of the pages you imported from your computer. You can even edit them and push the changes back to the wiki if you [configure your wiki user account][gmw-auth] in your Git config! [Git-Mediawiki]: https://github.com/moy/Git-Mediawiki "Git-Mediawiki on GitHub" [gmw-partialimport]: https://github.com/moy/Git-Mediawiki/wiki/User-manual#partial-import-of-a-wiki "Git-Mediawiki: Partial imports" [gmw-auth]: https://github.com/moy/Git-Mediawiki/wiki/User-manual#authentication "Git-Mediawiki: Authentication" Since I had already played around with Git-Mediawiki, I had a local mirror of the Stratum 0 wiki on my laptop. Unfortunately, I had not pulled for a few weeks, and the information I needed was only added to the wiki some days ago. So for the future, it would be nice to have an automatically synchronising mirror… And not only one on my personal laptop, but also for other interested users, at least read-only. Mirroring repositores with gitolite ----------- The best solution for me was a mirror repository on my own server, which was already running [gitolite][], the popular Git hosting solution. I would simply add a read-only repository in gitolite and let a cron job handle care of automatic synchronisation. [gitolite]: http://gitolite.com/gitolite/index.html "gitolite main page" Creating the new repository was easy, you simple add a line to your `gitolite.conf`, and when push the changes, gitolite creates the repository for you. But furthermore, I also wanted to configure the MediaWiki remote directly in my repository setup, for which I needed to specify the corresponding `remote` options for the Git configuration. [The appropriate setting to allow this][gitolite-config-keys] is in `.gitolite.rc` (gitolite’s main configuration file which resides in the gitolite base directory, say `/home/git/` in my case), you can simply add the Git config options you want to set from `gitolite.conf` to the `$GL_GITCONFIG_KEYS` variable. Mine now looks like this: $GL_GITCONFIG_KEYS = "remote\.* gitweb\.owner gitweb\.description"; [gitolite-config-keys]: http://gitolite.com/gitolite/g2/rc.html#rcsecurity "configuring gitolite's advanced features -- the .gitolite.rc file: variables with a security impact" Now I could easily add the corresponding options to my repository setup: repo stratum0-wiki config gitweb.description = "Read-only Git mirror of the Stratum 0 wiki" config remote.origin.url = "mediawiki::https://stratum0.org/mediawiki" config remote.origin.fetch = "+refs/heads/*:refs/remotes/origin/*" config remote.origin.fetchstrategy = "by_rev" RW+ = rohieb R = @all daemon gitweb Note that I let Git-Mediawiki work with the `by_rev` fetch strategy, which queries the MediaWiki API for all recent revisions rather than first looking for changed pages and then fetching the revisions accordingly. This is more efficient since I want to import every revision nonetheless. I also found out the hard way (i.e. through print debugging) that adding the `remote.origin.fetch` option is critical for Git-Mediawiki to work correctly. Then, a simple cron job for the `git` user (which owns all the gitolite repositories), was created with `crontab -e` to update the mirror every 30 minutes: # m h dom mon dow command */30 * * * * /home/git/update-stratum0-mediawiki-mirror The script which does all the work resides in `/home/git/update-stratum0-mediawiki-mirror`: [[!format sh <&1 | grep -i 'fatal\|error\|warn' git update-ref refs/heads/master refs/mediawiki/origin/master EOF]] Note that we cannot simply `git-merge` the master branch here, because the gitolite repository is a bare repo and `git-merge` needs a working tree. Therefore, we only fetch new revisions from our MediaWiki remote (which fetches to `refs/mediawiki/origin/master`), and update the master branch manually. Since the mirror is read-only and there are no real merges to be done, this is sufficient here. So far, we have a fully working mirror. But since the Stratum 0 wiki has grown to more than 7000 revisions to date, the initial fetch would need a while. To reduce the load on the MediaWiki API, I figured that I could reuse my existing repository on my laptop. Re-using a previous Git-Mediawiki repo -------------- So before activating the cron job, I pushed my exiting repository to the mirror: ~/stratum0-wiki$ git push rohieb.name master ~/stratum0-wiki$ git push rohieb.name refs/mediawiki/origin/master A test run of the mirror script however was not happy with that and wanted to fetch ALL THE revisions anyway. So it took me another while to find out that for efficiency reasons, Git-Mediawiki stores the corresponding MediaWiki revisions in [Git notes][git-notes] under `refs/notes/origin/mediawiki`. For example: $ git log --notes=refs/notes/origin/mediawiki commit 7e486fa8a463ebdd177e92689e45f756c05d232f Author: Daniel Bohrer Date: Sat Mar 15 14:42:09 2014 +0000 /* Talks am Freitag, 14. März 2014, 19:00 */ format, youtube-links Notes (origin/mediawiki): mediawiki_revision: 7444 [...] So after I also pushed `refs/notes/origin/mediawiki` to the mirror repo, everything was fine and a the cron job only fetched a small amount of new revisions. [git-notes]: http://git-scm.com/docs/git-notes "git-notes(1) Manual Page" Conclusion --------- To conclude this post, I have now a working MediaWiki mirror for the Stratum 0 wiki, which uses a cron job and Git-Mediawiki to fetch new revisions every 30 minutes, integrated with gitolite for hosting. If you also want to keep track of changes in the wiki and have an offline mirror for reference, feel free to pull from [git://git.rohieb.name/stratum0-wiki.git][git-stratum0-wiki]. [git-stratum0-wiki]: http://git.rohieb.name/stratum0-wiki.git "gitweb: stratum0-wiki.git summary" [[!tag Git gitify gitolite git-notes MediaWiki mirror]]