WebPath and Wikipedia
The WebPath bug reports continue to roll in. For one, queries against *.wikipedia.* don’t seem to work. You get something back, but it has no resemblance to the page you were looking for. The problem comes from the W3C tidy service that I use, specifically that the (understandably overworked and understaffed) admins at the Wikimedia Foundation seem to have blocked it. It seems like more than a simple IP or user-agent-based block. I’ve emailed them about it but haven’t heard back yet.
So, this highlights the limitation of having a single-source converter in the Platonic Web module of WebPath. So I turn to my readers: do you know of any other tidy servers? Or converters of a non-tidy origin? For any of these to work, they need to return clean XML corresponding to the original page (as opposed to, say, returning something with big headers/footers or ampersand-encoded). This seems like an outstanding need for the open source community.
Please comment below with ideas. Thanks! -m
UPDATE: heard back from the Wikipedia admins, and although professional and helpful-as-can-be-expected, they won’t be changing anything on their end. Still looking for more open source options.
2 Replies to “WebPath and Wikipedia”
FWIW, the code underlying the W3C Online tidy service is available as open source:
So it should be relatively straightforward to set up other tidy services.
I use SgmlReader coupled with a custom static extension function written in C#. Of course this will only work if using WebPath via IronPython, or setting up a separate .NET-based service on the server for processing HTML-to-XML requests, but given the use of an external service anyway this may be just fine. (the same code runs perfectly via Mono.)
Comments are closed.