WebPath and Wikipedia

The WebPath bug reports continue to roll in. For one, queries against *.wikipedia.* don’t seem to work. You get something back, but it has no resemblance to the page you were looking for. The problem comes from the W3C tidy service that I use, specifically that the (understandably overworked and understaffed) admins at the Wikimedia Foundation seem to have blocked it. It seems like more than a simple IP or user-agent-based block. I’ve emailed them about it but haven’t heard back yet.

So, this highlights the limitation of having a single-source converter in the Platonic Web module of WebPath. So I turn to my readers: do you know of any other tidy servers? Or converters of a non-tidy origin? For any of these to work, they need to return clean XML corresponding to the original page (as opposed to, say, returning something with big headers/footers or ampersand-encoded). This seems like an outstanding need for the open source community.

Please comment below with ideas. Thanks! -m

UPDATE: heard back from the Wikipedia admins, and although professional and helpful-as-can-be-expected, they won’t be changing anything on their end. Still looking for more open source options.

2 Responses to “WebPath and Wikipedia”

  1. Dom

    FWIW, the code underlying the W3C Online tidy service is available as open source:
    http://dev.w3.org/cvsweb/2000/tidy-svc/

    So it should be relatively straightforward to set up other tidy services.

  2. M. David Peterson http://xmlhacker.com/

    I use SgmlReader[1] coupled with a custom static extension function written in C#[2]. Of course this will only work if using WebPath via IronPython, or setting up a separate .NET-based service on the server for processing HTML-to-XML requests, but given the use of an external service anyway this may be just fine. (the same code runs perfectly via Mono.)

    [1] http://wiki.opengarden.org/Community/SgmlReader_1.7.2
    [2] http://nuxleus.googlecode.com/svn/trunk/nuxleus/Source/Xameleon/Function/HttpSgmlToXml.cs