more wishlist update
[www-rohieb-name.git] / blag / post / mirroring-mediawiki-with-git-mediawiki-and-gitolite.mdwn
1 [[!meta title="Mirroring MediaWiki with Git-Mediawiki and gitolite"]]
2 [[!meta author="rohieb"]]
3 [[!meta license="CC-BY-SA 3.0"]]
4
5 From Murphy’s Law we can deduct that Internet failures always come when you
6 least expect them. In my case, the [Stratum 0 wiki][s0wiki] was offline for a
7 few minutes (only, thankfully!) when I really urgently(1!11) needed to look
8 something up there. If I only had an offline clone of the wiki…
9
10 [s0wiki]: https://stratum0.org/wiki/ "Stratum 0 wiki"
11
12
13 Enter: Git-Mediawiki
14 --------------------
15
16 I had already before discovered [Git-Mediawiki][], which lets you mirror certain
17 or all pages of a MediaWiki instance to a local Git repository. It achieves
18 this by implementing the `mediawiki::` remote handler, which lets you configure
19 the URL of the remote MediaWiki instance as a Git remote, and loads the raw
20 revisions from the MediaWiki API everytime you do a `git fetch`:
21
22 $ git clone mediawiki::https://stratum0.org/mediawiki/
23 Cloning into 'mediawiki'...
24 Searching revisions...
25 No previous mediawiki revision found, fetching from beginning.
26 Fetching & writing export data by pages...
27 Listing pages on remote wiki...
28 6 pages found.
29 page 1/78: Vorstand
30 Found 2 revision(s).
31 page 2/78: Atomuhr
32 Found 15 revision(s).
33 page 3/78: Corporate Identity
34 Found 6 revision(s).
35 page 4/78: Presse
36 Found 2 revision(s).
37 [...]
38 1/804: Revision #738 of Presse
39 2/804: Revision #3036 of Atomuhr
40 3/804: Revision #3053 of Atomuhr
41 4/804: Revision #3054 of Atomuhr
42 [...]
43 Checking connectivity... done.
44
45 Not to mention, this can take a very long time if you try to import a whole wiki
46 (say, Wikipedia (NO, DON’T ACTUALLY DO THIS! (or at least don’t tell them I told
47 you how))), but you [can also][gmw-partialimport] import only single pages or
48 pages from certain categories with the `-c remote.origin.pages=<page list>` and
49 `-c remote.origin.categories=<category list>` options to `git-clone`.
50
51 After the clone has finished, you can view the raw MediaWiki source files of
52 the pages you imported from your computer. You can even edit them and push the
53 changes back to the wiki if you [configure your wiki user account][gmw-auth] in
54 your Git config!
55
56 [Git-Mediawiki]: https://github.com/moy/Git-Mediawiki "Git-Mediawiki on GitHub"
57 [gmw-partialimport]: https://github.com/moy/Git-Mediawiki/wiki/User-manual#partial-import-of-a-wiki "Git-Mediawiki: Partial imports"
58 [gmw-auth]: https://github.com/moy/Git-Mediawiki/wiki/User-manual#authentication "Git-Mediawiki: Authentication"
59
60 Since I had already played around with Git-Mediawiki, I had a local mirror of
61 the Stratum 0 wiki on my laptop. Unfortunately, I had not pulled for a few
62 weeks, and the information I needed was only added to the wiki some days ago. So
63 for the future, it would be nice to have an automatically synchronising mirror…
64 And not only one on my personal laptop, but also for other interested users,
65 at least read-only.
66
67
68 Mirroring repositores with gitolite
69 -----------
70
71 The best solution for me was a mirror repository on my own server, which was
72 already running [gitolite][], the popular Git hosting solution. I would simply
73 add a read-only repository in gitolite and let a cron job handle care of
74 automatic synchronisation.
75
76 [gitolite]: http://gitolite.com/gitolite/index.html "gitolite main page"
77
78 Creating the new repository was easy, you simple add a line to your
79 `gitolite.conf`, and when push the changes, gitolite creates the repository for
80 you. But furthermore, I also wanted to configure the MediaWiki remote directly
81 in my repository setup, for which I needed to specify the corresponding `remote`
82 options for the Git configuration. [The appropriate setting to allow
83 this][gitolite-config-keys] is in `.gitolite.rc` (gitolite’s main configuration
84 file which resides in the gitolite base directory, say `/home/git/` in my case),
85 you can simply add the Git config options you want to set from `gitolite.conf`
86 to the `$GL_GITCONFIG_KEYS` variable. Mine now looks like this:
87
88 $GL_GITCONFIG_KEYS = "remote\.* gitweb\.owner gitweb\.description";
89
90 [gitolite-config-keys]: http://gitolite.com/gitolite/g2/rc.html#rcsecurity "configuring gitolite's advanced features -- the .gitolite.rc file: variables with a security impact"
91
92 Now I could easily add the corresponding options to my repository setup:
93
94 repo stratum0-wiki
95 config gitweb.description = "Read-only Git mirror of the Stratum 0 wiki"
96 config remote.origin.url = "mediawiki::https://stratum0.org/mediawiki"
97 config remote.origin.fetch = "+refs/heads/*:refs/remotes/origin/*"
98 config remote.origin.fetchstrategy = "by_rev"
99 RW+ = rohieb
100 R = @all daemon gitweb
101
102 Note that I let Git-Mediawiki work with the `by_rev` fetch strategy, which
103 queries the MediaWiki API for all recent revisions rather than first looking for
104 changed pages and then fetching the revisions accordingly. This is more
105 efficient since I want to import every revision nonetheless. I also found out
106 the hard way (i.e. through print debugging) that adding the
107 `remote.origin.fetch` option is critical for Git-Mediawiki to work correctly.
108
109 Then, a simple cron job for the `git` user (which owns all the gitolite
110 repositories), was created with `crontab -e` to update the mirror every 30
111 minutes:
112
113 # m h dom mon dow command
114 */30 * * * * /home/git/update-stratum0-mediawiki-mirror
115
116 The script which does all the work resides in
117 `/home/git/update-stratum0-mediawiki-mirror`:
118
119 [[!format sh <<EOF
120 #!/bin/sh
121 if [ "`whoami`" != "git" ]; then
122 echo "fatal: run as user 'git'."
123 exit 1;
124 fi
125
126 cd /home/git/git/stratum0-wiki.git/
127
128 git fetch 2>&1 | grep -i 'fatal\|error\|warn'
129 git update-ref refs/heads/master refs/mediawiki/origin/master
130 EOF]]
131
132 Note that we cannot simply `git-merge` the master branch here, because the
133 gitolite repository is a bare repo and `git-merge` needs a working tree.
134 Therefore, we only fetch new revisions from our MediaWiki remote (which fetches
135 to `refs/mediawiki/origin/master`), and update the master branch manually. Since
136 the mirror is read-only and there are no real merges to be done, this is
137 sufficient here.
138
139 So far, we have a fully working mirror. But since the Stratum 0 wiki has grown
140 to more than 7000 revisions to date, the initial fetch would need a while. To
141 reduce the load on the MediaWiki API, I figured that I could reuse my existing
142 repository on my laptop.
143
144
145 Re-using a previous Git-Mediawiki repo
146 --------------
147
148 So before activating the cron job, I pushed my exiting repository to the mirror:
149
150 ~/stratum0-wiki$ git push rohieb.name master
151 ~/stratum0-wiki$ git push rohieb.name refs/mediawiki/origin/master
152
153 A test run of the mirror script however was not happy with that and wanted to
154 fetch ALL THE revisions anyway. So it took me another while to find out that for
155 efficiency reasons, Git-Mediawiki stores the corresponding MediaWiki revisions
156 in [Git notes][git-notes] under `refs/notes/origin/mediawiki`. For example:
157
158 $ git log --notes=refs/notes/origin/mediawiki
159 commit 7e486fa8a463ebdd177e92689e45f756c05d232f
160 Author: Daniel Bohrer <Daniel Bohrer@stratum0.org/mediawiki>
161 Date: Sat Mar 15 14:42:09 2014 +0000
162
163 /* Talks am Freitag, 14. März 2014, 19:00 */ format, youtube-links
164
165 Notes (origin/mediawiki):
166 mediawiki_revision: 7444
167
168 [...]
169
170 So after I also pushed `refs/notes/origin/mediawiki` to the mirror repo,
171 everything was fine and a the cron job only fetched a small amount of new
172 revisions.
173
174 [git-notes]: http://git-scm.com/docs/git-notes "git-notes(1) Manual Page"
175
176
177 Conclusion
178 ---------
179
180 To conclude this post, I have now a working MediaWiki mirror for the Stratum 0
181 wiki, which uses a cron job and Git-Mediawiki to fetch new revisions every 30
182 minutes, integrated with gitolite for hosting. If you also want to keep track of
183 changes in the wiki and have an offline mirror for reference, feel free to pull
184 from [git://git.rohieb.name/stratum0-wiki.git][git-stratum0-wiki].
185
186 [git-stratum0-wiki]: http://git.rohieb.name/stratum0-wiki.git "gitweb: stratum0-wiki.git summary"
187
188 [[!tag Git gitify gitolite git-notes MediaWiki mirror]]
This page took 0.0531 seconds and 5 git commands to generate.