[outages] Wikipedia

George Herbert george.herbert at gmail.com
Fri Aug 3 14:57:47 EDT 2012


On Fri, Aug 3, 2012 at 12:51 AM, Robert Brockway
<robert at timetraveller.org> wrote:
> On Thu, 2 Aug 2012, George Herbert wrote:
>
>> I reported it on their internal/external tech list, I was seeing the
>> outage for about 5-8 min and then it is back working in the last
>> 5-ish.
>
>
> Several times over the last few years I've seen WP outages which turned out
> to be bad config pushed in to production and then quickly reverted.  A few
> were patches to the Mediawiki software, for example.
>
> I guess they don't have a preprod/UAT environment :)  While I can understand
> them being able to simulate the scale, a small UAT environment to test
> config rationality wouldn't go astray.
>
> I hear Wikipedia has a monitoring system.  It involves alerts issued by
> millions of people around the world :)
>
> Cheers,
>
> Rob

I know some of the ops folks and have talked about ops stability on
and off with the deputy director and VP of technology of the Wikimedia
Foundation.  I haven't professionally consulted per se, but have some
info about the ops.

They do have a preprod environment, but there are limitations to it,
and the systems management process is not perfect.  They have been
focused over the last couple of years on stability and disaster
recovery, but with the user growth they see and budget envelope, it's
hard to make huge leaps ahead on stability while growing.

Frankly, most of the large commercial environments I have seen were
run worse, all things considered...


-- 
-george william herbert
george.herbert at gmail.com



More information about the Outages mailing list