WebPath and Wikipedia


The WebPath bug reports continue to roll in. For one, queries against *.wikipedia.* don’t seem to work. You get something back, but it has no resemblance to the page you were looking for. The problem comes from the W3C tidy service that I use, specifically that the (understandably overworked and understaffed) admins at the Wikimedia Foundation seem to have blocked it. It seems like more than a simple IP or user-agent-based block. I’ve emailed them about it but haven’t heard back yet.

So, this highlights the limitation of having a single-source converter in the Platonic Web module of WebPath. So I turn to my readers: do you know of any other tidy servers? Or converters of a non-tidy origin? For any of these to work, they need to return clean XML corresponding to the original page (as opposed to, say, returning something with big headers/footers or ampersand-encoded). This seems like an outstanding need for the open source community.

Please comment below with ideas. Thanks! -m

UPDATE: heard back from the Wikipedia admins, and although professional and helpful-as-can-be-expected, they won’t be changing anything on their end. Still looking for more open source options.

Related Posts

2 Replies to “WebPath and Wikipedia”

  1. I use SgmlReader[1] coupled with a custom static extension function written in C#[2]. Of course this will only work if using WebPath via IronPython, or setting up a separate .NET-based service on the server for processing HTML-to-XML requests, but given the use of an external service anyway this may be just fine. (the same code runs perfectly via Mono.)

    [1] http://wiki.opengarden.org/Community/SgmlReader_1.7.2
    [2] http://nuxleus.googlecode.com/svn/trunk/nuxleus/Source/Xameleon/Function/HttpSgmlToXml.cs

Comments are closed.

© All Right Reserved
Proudly powered by WordPress | Theme: Shree Clean by Canyon Themes.