From: Roland Hieber Date: Mon, 24 Mar 2014 02:12:03 +0000 (+0100) Subject: new blag post: Mirroring MediaWiki with Git-Mediawiki and gitolite X-Git-Url: https://git.rohieb.name/www-rohieb-name.git/commitdiff_plain/573fd641cca67524bfaec81d4a19c976b6092149?ds=sidebyside new blag post: Mirroring MediaWiki with Git-Mediawiki and gitolite --- diff --git a/blag/post/mirroring-mediawiki-with-git-mediawiki-and-gitolite.mdwn b/blag/post/mirroring-mediawiki-with-git-mediawiki-and-gitolite.mdwn new file mode 100644 index 0000000..e8fea90 --- /dev/null +++ b/blag/post/mirroring-mediawiki-with-git-mediawiki-and-gitolite.mdwn @@ -0,0 +1,188 @@ +[[!meta title="Mirroring MediaWiki with Git-Mediawiki and gitolite"]] +[[!meta author="rohieb"]] +[[!meta license="CC-BY-SA 3.0"]] + +From Murphy’s Law we can deduct that Internet failures always come when you +least expect them. In my case, the [Stratum 0 wiki][s0wiki] was offline for a +few minutes (only, thankfully!) when I really urgently(1!11) needed to look +something up there. If I only had an offline clone of the wiki… + +[s0wiki]: https://stratum0.org/wiki/ "Stratum 0 wiki" + + +Enter: Git-Mediawiki +-------------------- + +I had already before discovered [Git-Mediawiki][], which lets you mirror certain +or all pages of a MediaWiki instance to a local Git repository. It achieves +this by implementing the `mediawiki::` remote handler, which lets you configure +the URL of the remote MediaWiki instance as a Git remote, and loads the raw +revisions from the MediaWiki API everytime you do a `git fetch`: + + $ git clone mediawiki::https://stratum0.org/mediawiki/ + Cloning into 'mediawiki'... + Searching revisions... + No previous mediawiki revision found, fetching from beginning. + Fetching & writing export data by pages... + Listing pages on remote wiki... + 6 pages found. + page 1/78: Vorstand + Found 2 revision(s). + page 2/78: Atomuhr + Found 15 revision(s). + page 3/78: Corporate Identity + Found 6 revision(s). + page 4/78: Presse + Found 2 revision(s). + [...] + 1/804: Revision #738 of Presse + 2/804: Revision #3036 of Atomuhr + 3/804: Revision #3053 of Atomuhr + 4/804: Revision #3054 of Atomuhr + [...] + Checking connectivity... done. + +Not to mention, this can take a very long time if you try to import a whole wiki +(say, Wikipedia (NO, DON’T ACTUALLY DO THIS! (or at least don’t tell them I told +you how))), but you [can also][gmw-partialimport] import only single pages or +pages from certain categories with the `-c remote.origin.pages=` and +`-c remote.origin.categories=` options to `git-clone`. + +After the clone has finished, you can view the raw MediaWiki source files of +the pages you imported from your computer. You can even edit them and push the +changes back to the wiki if you [configure your wiki user account][gmw-auth] in +your Git config! + +[Git-Mediawiki]: https://github.com/moy/Git-Mediawiki "Git-Mediawiki on GitHub" +[gmw-partialimport]: https://github.com/moy/Git-Mediawiki/wiki/User-manual#partial-import-of-a-wiki "Git-Mediawiki: Partial imports" +[gmw-auth]: https://github.com/moy/Git-Mediawiki/wiki/User-manual#authentication "Git-Mediawiki: Authentication" + +Since I had already played around with Git-Mediawiki, I had a local mirror of +the Stratum 0 wiki on my laptop. Unfortunately, I had not pulled for a few +weeks, and the information I needed was only added to the wiki some days ago. So +for the future, it would be nice to have an automatically synchronising mirror… +And not only one on my personal laptop, but also for other interested users, +at least read-only. + + +Mirroring repositores with gitolite +----------- + +The best solution for me was a mirror repository on my own server, which was +already running [gitolite][], the popular Git hosting solution. I would simply +add a read-only repository in gitolite and let a cron job handle care of +automatic synchronisation. + +[gitolite]: http://gitolite.com/gitolite/index.html "gitolite main page" + +Creating the new repository was easy, you simple add a line to your +`gitolite.conf`, and when push the changes, gitolite creates the repository for +you. But furthermore, I also wanted to configure the MediaWiki remote directly +in my repository setup, for which I needed to specify the corresponding `remote` +options for the Git configuration. [The appropriate setting to allow +this][gitolite-config-keys] is in `.gitolite.rc` (gitolite’s main configuration +file which resides in the gitolite base directory, say `/home/git/` in my case), +you can simply add the Git config options you want to set from `gitolite.conf` +to the `$GL_GITCONFIG_KEYS` variable. Mine now looks like this: + + $GL_GITCONFIG_KEYS = "remote\.* gitweb\.owner gitweb\.description"; + +[gitolite-config-keys]: http://gitolite.com/gitolite/g2/rc.html#rcsecurity "configuring gitolite's advanced features -- the .gitolite.rc file: variables with a security impact" + +Now I could easily add the corresponding options to my repository setup: + + repo stratum0-wiki + config gitweb.description = "Read-only Git mirror of the Stratum 0 wiki" + config remote.origin.url = "mediawiki::https://stratum0.org/mediawiki" + config remote.origin.fetch = "+refs/heads/*:refs/remotes/origin/*" + config remote.origin.fetchstrategy = "by_rev" + RW+ = rohieb + R = @all daemon gitweb + +Note that I let Git-Mediawiki work with the `by_rev` fetch strategy, which +queries the MediaWiki API for all recent revisions rather than first looking for +changed pages and then fetching the revisions accordingly. This is more +efficient since I want to import every revision nonetheless. I also found out +the hard way (i.e. through print debugging) that adding the +`remote.origin.fetch` option is critical for Git-Mediawiki to work correctly. + +Then, a simple cron job for the `git` user (which owns all the gitolite +repositories), was created with `crontab -e` to update the mirror every 30 +minutes: + + # m h dom mon dow command + */30 * * * * /home/git/update-stratum0-mediawiki-mirror + +The script which does all the work resides in +`/home/git/update-stratum0-mediawiki-mirror`: + +[[!format sh <&1 | grep -i 'fatal\|error\|warn' +git update-ref refs/heads/master refs/mediawiki/origin/master +EOF]] + +Note that we cannot simply `git-merge` the master branch here, because the +gitolite repository is a bare repo and `git-merge` needs a working tree. +Therefore, we only fetch new revisions from our MediaWiki remote (which fetches +to `refs/mediawiki/origin/master`), and update the master branch manually. Since +the mirror is read-only and there are no real merges to be done, this is +sufficient here. + +So far, we have a fully working mirror. But since the Stratum 0 wiki has grown +to more than 7000 revisions to date, the initial fetch would need a while. To +reduce the load on the MediaWiki API, I figured that I could reuse my existing +repository on my laptop. + + +Re-using a previous Git-Mediawiki repo +-------------- + +So before activating the cron job, I pushed my exiting repository to the mirror: + + ~/stratum0-wiki$ git push rohieb.name master + ~/stratum0-wiki$ git push rohieb.name refs/mediawiki/origin/master + +A test run of the mirror script however was not happy with that and wanted to +fetch ALL THE revisions anyway. So it took me another while to find out that for +efficiency reasons, Git-Mediawiki stores the corresponding MediaWiki revisions +in [Git notes][git-notes] under `refs/notes/origin/mediawiki`. For example: + + $ git log --notes=refs/notes/origin/mediawiki + commit 7e486fa8a463ebdd177e92689e45f756c05d232f + Author: Daniel Bohrer + Date: Sat Mar 15 14:42:09 2014 +0000 + + /* Talks am Freitag, 14. März 2014, 19:00 */ format, youtube-links + + Notes (origin/mediawiki): + mediawiki_revision: 7444 + + [...] + +So after I also pushed `refs/notes/origin/mediawiki` to the mirror repo, +everything was fine and a the cron job only fetched a small amount of new +revisions. + +[git-notes]: http://git-scm.com/docs/git-notes "git-notes(1) Manual Page" + + +Conclusion +--------- + +To conclude this post, I have now a working MediaWiki mirror for the Stratum 0 +wiki, which uses a cron job and Git-Mediawiki to fetch new revisions every 30 +minutes, integrated with gitolite for hosting. If you also want to keep track of +changes in the wiki and have an offline mirror for reference, feel free to pull +from [git://git.rohieb.name/stratum0-wiki.git][git-stratum0-wiki]. + +[git-stratum0-wiki]: http://git.rohieb.name/stratum0-wiki.git "gitweb: stratum0-wiki.git summary" + +[[!tag Git gitify gitolite git-notes MediaWiki mirror]]