Archive for the 'intentional web' Category

Saturday, April 13th, 2013

Semantics!

This week marked the MarkLogic World conference and with it some exciting news. Without formally “announcing” a new release, the company showed off a great deal of semantic technology in-progress. Part of that came from me, on stage during the Wednesday technical keynote. I’ve been at MarkLogic five years next month, and the first piece of code I wrote there was an RDFa parser. This has been a long time coming.

It was an amazing experience. I was responsible for sifting through the huge amounts of public data–both in RDF formats and on public web pages–and writing the semantic code to pull everything together, culminating in those ten minutes on stage.

Picture this: just behind the big stage and the projected screens was a hive of impressive activity. I counted 8 A/V people backstage, plus 4 more at the back of the auditorium. The conference has reached  a level of production values that wouldn’t be vastly different if it was a stadium affair. So in back there’s a curtained-off “green room” with some higher-grade snacks (think PowerBars and Red Bull) with a flatscreen that shows the stage. From back there you can’t see the projected slides or demos, but if you step just outside, you’re at the reverse side of the screen, larger-than-life. The narrow walkway leads to the “chute”, right up the steps onto the main stage. As David Gorbet went through the opening moments of his talk in fine form, I did some stretches and did everything I could think of to prepare myself.

Then he called me up and the music blasted out from the speakers. I had been playing through my mind all the nightmare scenarios–tripping on the stairs and falling on my face as I come onstage (etc.)–but none of that happened. I’ve done public speaking many times before so I had an idea what to expect, though on a stage like that the lights are so bright that it’s hard to see beyond about the third row. So despite the 300-400 people in the room, it didn’t even feel much different than addressing an intimate group of peers. It was fun. On with the demos:

The first showed our internal MarkMail cluster with a simple ‘infobox’ of the sort that all the search engines are doing these days. This was an icebreaker to talk about semantics and how it works–in this case locate the concept of Hadoop in the database, and from there find all the related labels, abstracts, people, projects, releases, and so on. During the construction of the demo, we uncovered some real world facts about the author of the top-ranked message for the query, including a book he wrote. The net effect was that these additional facts made the results a lot more useful by providing a broader context for them.

The second demo showed improved recall–that is finding things that would otherwise slip under the radar. The existing [from:IBM] query in MarkMail does a good job finding people that happen to have the letters i-b-m in their email address. The semantic query [affiliation:IBM] in contrast knows about the concept of IBM, the concept of people, and the relationship of is-affiliated-with (technically foaf:affiliation) to run a query that more closely models how a person would ask the question: “people that work for IBM” as opposed to “people that have i-b-m in their email address”. This the results included folks posting from gmail accounts and other personal addresses, and the result set jumped from about 277k messages to 280k messages.

At this point, a pause to talk about the architecture underlying the technology. It turns out that a system that already supports shared-nothing scale out, full ACID transactions, multiple HA/DR options, and a robust security model is a good starting point for building semantic capabilities.  (I got so excited at this point that I forgot to use the clicker for a few beats and had to quickly catch-up the slides.) SPARQL code on the screen.

Then the third demo, a classic semantic app with a twist. Pulling together triples from several different public vocabularies, we answered the question of “find a Hadoop expert” with each row of the results representing not a document, as in MarkMail results, but an actual person. We showed location data (which was actually randomized to avoid privacy concerns) and aggregate cost-of-living data for each city. When we added in a search term, we drew histograms of MarkMail message traffic over time and skipped over the result that had no messages. The audience was entranced.

This is exciting work. I had several folks come up to me afterwards with words to the effect of they hadn’t realized it before, but boy do they ever need semantics. I can’t think of a better barometer for a technical keynote. So back to work I go. There’s a lot to do.

Thanking by name is dangerous, because inevitably people get left out, but I would like to shout out to David Gorbet who ran the keynote, John Snelson who’s a co-conspirator in the development effort, Eric Bloch who helped with the MarkMail code more than anyone will ever know, Denis Shehan who was instrumental in wrangling the cloud and data, and Stephen Buxton who patiently and repeatedly offered feedback that helped sharpen the message.

I’ll post a pointer to the video when it’s available. -m

Sunday, March 31st, 2013

Introducing node-node:node.node

Naming is hard to do well, almost as hard as designing good software in the first place. Take for instance the term ‘node’ which depending on the context can mean

  1. A fundamental unit of the DOM (Document Object Model) used in creating rich HTML5 applications.
  2. A basic unit of the Semantic Web–a thing you can say stuff about. Some nodes are even unlabeled, and hence ‘blank nodes’.
  3. In operations, a node means, roughly, a machine on the network. E.g. “sixteen-node cluster”
  4. A software library for event-driven, asynchronous development with JavaScript.

I find myself at the forefront of a growing chorus of software architects and API designers that are fed up with this overloading of a perfectly good term. So I’m happy today to announce node-node:node.node.

The system is still in pre-alpha, but it solves all of the most pressing problems that software developers routinely run in to. In this framework, every node represents a node, for the ultimate in scalable distributed document storage. In addition, every node additionally serves as a node, which provides just enough context to make open-world assumption metadata assertions at node-node-level granularity. Using the power of Node, every node modeled as a node has instant access to other node-node:nodes. The network really is the computer. You may never write a program the old way again. Follow my progress on Sourceforge, the latest and most cutting-edge social code-sharing site. -m

Wednesday, January 26th, 2011

Explosive growth of RDFa

Some great data from my one-time colleague Peter Mika. Based on data culled from 12 billion web pages, RDFa is on 3.5 percent of them, even after discounting “trivial” uses of it. Just look at how much that dark blue bar shot up since the last measurement, some 18 months earlier.

Also of note: eRDF has dropped off the map. hAtom and hReview are continuing their climb.

-m

Sunday, October 24th, 2010

Geek Thoughts: statistical argument against link shortener sustainability

I’ve seen lots of discussion for and against link shorteners, but not specifically this line of argument:

Let me grab a random shortened link from Twitter. Don’t go away, I’ll be right back.

http://bit.ly/b1fYi1

OK, that’s six characters in the domain, a slash, and six more characters. 50 years from now, if bit.ly is still in operation, the URLspace will be rather more crowded, and the part after the slash might be eight or nine characters. This is a significant cliff, since most people have trouble remembering more than 6 or 7 things in their head at a time. Thus, one could conclude that 50 years from now, newly minted bit.ly URLs will be less fashionable than those from newer link-shortening services, particularly if more short TLDs come online, which seems likely. In that scenario, fewer and fewer people will use bit.ly, and it will become a resource-pit as costs go up (for more database storage, among other things) while usage drops, an economic trend that has only one eventual outcome, leading to the breaking all the external links relying on this service.

I’ve been picking on bit.ly here, but the same principle applies to any shortener service. In fact, the more popular, the more quickly the URLspace will fill.

The moral: don’t use link shorteners for anything that needs to be more durable than something you’d scribble on a scrap of paper at your desk.

More collected Geek Thoughts at http://geekthoughts.info.

Thursday, September 2nd, 2010

Is XForms really MVC?

This epic posting on MVC helped me better understand the pattern, and all the variants that have flowed outward from the original design. One interesting observation is that the earlier designs used Views primarily as output-only, and Controllers primarily as input-only, and as a consequence the Controller was the one true path for getting data into the Model.

But with browser forms, input and output are tightly intermingled. The View takes care of input and output. Something else has primary responsibility for mediating the data flow to and from the model–and that something has been called a Presenter. This yields the MVP pattern.

The terminology gets confusing quickly, but roughly

XForms Instance == MVP Model

XForms Model == MVP Presenter

XForms User Interface == MVP View

It’s not wrong to associate XForms with MVC–the term has become so blurry that it’s easy to lump variants like MVP into the same bucket. But to the extent that it makes sense to talk about more specific patterns, maybe we should be calling the XForms design pattern MVP instead of MVC. Comments? Criticism? Fire away below. -m

Wednesday, July 7th, 2010

Grokking Selenium

As the world of web apps gets more framework-y, I need to get up to speed on contemporary automation testing tools. One of the most popular ones right now is the open source Selenium project. From the look of it, that project is going through an awkward adolescent phase. For example:

  • Selenium IDE lets you record tests in a number of languages, but only HTML ones can be played back. For someone using only Selenium IDE, it’s a confusing array of choices for no apparent reason.
  • Selenium RC has bindings for lots of different languages but not for the HTML tests that are most useful in Selenium IDE. (Why not include the ability to simply play through an entire recorded script in one call, instead of fine grained commands like selenium.key_press(input_id, 110), etc.?)
  • The list of projects prominently mentions Selenium Core (a JavaScript implementation), but when you click through to the documentation, it’s not mentioned. Elsewhere on the site it’s spoken of in deprecating terms.
  • If you look at the developer wiki, all the recent attention is on Web Drivers, a new architecture for remote-controlling browsers, but those aren’t mentioned in the docs (yet) either.

So yeah, right now it’s awkward and confusing. The underlying architecture of the project is undergoing a tectonic shift, something that would never see public light of day in a proprietary project. In the end it will come out leaner and meaner. What the project needs in the short term is more help from fresh outsiders who can visualize the desirable end state and help the ramped and productive developers on the project get there.

By the way, if this kind of problem seems interesting to you, let me know. We’re hiring. If you have any tips for getting up to speed in Selenium, comment below.

-m

Wednesday, June 9th, 2010

“Google syntax” for semantic queries?

Thought experiment: are there any commonly-expressed semantic queries–the kind of queries you’d run over a triple store, or perhaps a SearchMonkey-annotated web site–expressible in common type-in-a-searchbox query grammar?

As a refresher, here’s some things that Google and other search engines can handle. The square brackets represent the search box into which the queries are typed, not part of the queries themselves.

[term]

[term -butnotthis]

[term1 OR term2]

["phrase term"]

[tem1 OR term2 -"but not this" site:dubinko.info filetype:html]

So what kind of semantic queries would be usefully expressed in a similar way, avoiding SPARQL and the like? For example, maybe [by:"Micah Dubinko"] could map to a document containing a triple like <this document> <dc:author> “Micah Dubinko”. What other kinds of graph queries are interesting, common, and simple to express like this? Comments welcome.

-m

Sunday, May 30th, 2010

Balisage contest: solving the wikiml problem

I wish I could say I had something to do with the planning of this: part of Balisage 2010 is a contest to “encourage markup experts to review and to research the current state of wiki markup languages and to generate a proposal that serves to de-babelize the current state of affairs for the long haul.”  To enter, you must propose a set of concrete steps (organizational, social, and/or technological) that will enable wiki content interchange, a real WYSIWYG editor, and/or wiki syntax standardization.

This pushes all of my buttons. It’s got structured documents, Web, parser geekery, writing, engineering, and standards. There’s a bunch of open source prior art, including PyXMLWiki, which I adapted from some fantastic earlier work from Rick Jelliffe.

Sadly, MarkLogic employees aren’t eligible to enter. Get your write-up done by July 15 and sent to balisage-2010-contest at marklogic dot com. The winner will be announced at Balisage and will take home some serious prize winnings, and also will be strongly encouraged (but not required) to give a brief summary (~10 minutes) of their winning entry.

Can’t wait to see what comes out of this. -m

Friday, March 5th, 2010

A Hyperlink Offering revisited

The xml-dev mailing list has been discussing XLink 1.1, which after a long quiet period popped up as a “Proposed Recommendation”, which means that a largely procedural vote is is all that stands between the document becoming a full W3C Recommendation. (The previous two revisions of the document date to 2008 and 2006, respectively)

In 2005 I called continued development of XLink a “reanimated spectre”. But even earlier, in 2002 I wrote one of the rare fiction pieces on xml.com, A Hyperlink Offering, which using the format of a Carrollian dialog between Tortoise and Achilles, explained a few of the problems with the XLink specification. It ended with this:

What if the W3C pushed for Working Groups to use a future XLink, just not XLink 1.0?

Indeed, this version has minor improvements. In particular, “simple” links are simpler now–you can drop an xlink:href attribute where you please and it’s now legit. The spec used to REQUIRE additional xlink:type=”simple” attributes all over the place. But it’s still awkward to use for multi-ended links, and now even farther away from the mainstream hyperlinking aspects of HTML5, which for all of its faults, embodies the grossly predominant description of linking on the web.

So in many ways, my longstanding disappointment with XLink is that it only ever became a tiny sliver of what it could have been. Dashed visions of Xanadu dance through my head. -m

Sunday, November 22nd, 2009

How Xanadu Works: technical overview

One particular conversation I’ve overheard several times, often in the context of web and standards development, has always intrigued me. It goes something like this:

You know, Ted Nelson’s hypertext system from the 60′s had unbreakable, two-way links. It was elegant. But then came along Tim Berners-Lee and HTML, with its crappy, one-way, breakable links, and it took over the world.

The general moral of the story is usually about avoiding over-thinking problems and striving for simplicity. This has been rolling around in the back of my mind ever since the first time I heard the story. Is it an accurate assessment of reality? And how exactly did Nelson’s system, called Xanadu (R), manage the trick of unbreakable super-links? Even if the web ended up going in a different direction, there still might be lessons to learn for the current generation of people building things that run (and run on) the web.

Nelson’s book Literary Machines describes the system in some detail, but it’s hard to come by in the usual channels like Amazon, or even local bookstores. One place does have it, and for a reasonable price too: Eastgate Systems. [Disclosure: I bought mine from there for full price. I'm not getting anything for writing this post on my blog.] The book has a versioning notation, with 93.1 being the most recent, describing the “1993 design” of the software.

Pause for a moment and think about the history here. 1993 is 16 years ago as I write this, about the same span of time between Vannevar Bush’s groundbreaking 1945 article As We May Think (reprinted in full in Literary Machines) and Nelson’s initial work in 1960 on what would become the Xanadu project. As far as software projects go, this one has some serious history.

So how does it work? The basic concepts, in no particular order, are:

  • A heavier-weight publishing process: Other than inaccessible “privashed” (as opposed to “pub”lished) documents, once published, documents are forever, and can’t be deleted except in extraordinary circumstances and with some kind of waiting period.
  • All documents have a specific owner, are royalty-bearing, and work through a micropayment system. Anyone can quote, transclude, or modify any amount of anything, with the payments sorting themselves out accordingly.
  • Software called a “front end” (today we’d call it a “browser”) works on behalf of the user to navigate the network and render documents.
  • Published documents can be updated at will, in which case unchanged pieces can remain unchanged, with inserted and deleted sections in between. Thus, across the history of a document, there are implicit links forward and backward in time through all the various editions and alternatives.
  • In general, links can jump to a new location in the docuverse or transclude part of a remote document into another, and many more configurations, including multi-ended links, and are granular to the character level, as well as attached to particular characters.
  • Document and network addressing are accomplished through a clever numbering system (somewhat reminiscent of organic versioning, but in a way infinitely extensible on multiple axes). These address, called tumblers, represent a Node+User+Document+Subdocument, and a minor variant to the syntax can express ranges between two points therein.
  • The system uses its own protocol called FEBE (Front End Back End) which contains at several verbs including on page 4/61: RETRIEVEV (like HTTP GET), DELETEVSPAN, MAKELINK, FINDNUMOFLINKSTOTHREE, FINDLINKSFROMTOTHREE, and FINDDOCSCONTAINING [Note that "three" in this context is an unusual notation for a link type] Maybe 10 more verbs are defined in total.

A few common themes emerge. One is the grandiose scope: This really is intended as a system to encompass all of literature past, present, and future, and to thereby create a culture of intellect and reshape civilization. “We think that anyone who actually understands the problems will recognize ours approach as the unique solution.” (italics from original, 1993 preface)

Another theme is simple solutions to incredibly difficult problems. So the basic solution to unbreakable links is to never change documents.  Sometimes these solutions work brilliantly, sometimes they fall short, and many times they ends up somewhere in between. In terms of sheer vision, nobody else has come close to inspiring as many people working on the web. Descriptions of what today we’d call a browser would sound familiar, if a bit abstract, even to casual users of Firefox or IE.

Nothing like REST seems to have occurred to Nelson or his associates. It’s unclear how widely deployed Xanadu prototypes ever were, or how many nodes were ever online at any point. The set of verbs in the FEBE protocol reads like that a competent engineer would come up with. The benefits of REST, in particular of minimizing verbs and maximizing nouns, are non-obvious without a significant amount of web-scale experience.

Likewise Creative Commons seems like something the designers never contemplated.  “Ancient documents, no longer having a current owner, are considered to be owned by the system–or preferably by some high-minded literary body that oversees their royalties.” (page 2/29) While this sounds eerily like the Google Books settlement, this misses the implications of truly free-as-in-beer content, but equally misses the power of free-as-in-freedom documents. In terms of social impact there’s a huge difference between something that costs $0 and $0.000001.

In this system anyone can include any amount of any published document into their own without special permission. In a world where people writing Harry Potter Lexicons are getting sued by the copyright industry, it’s hard to imagine this coming to pass without kicking and screaming, but it is a nice world to think about. Anyway, in Xanadu per-byte royalties work themselves out according to the proportion of original vs. transcluded bytes.

Where is Google in this picture? “Two system directories, maintained by the system itself, are anticipated: author and title, no more” (page 2/49) For additional directories or search engines, it’s not clear how that would work: is a search results page a published or privashed document? Does every possible older version of every result page stick around in the system? (If not, links to/from might break) It’s part of a bigger question about how to represent and handle dynamic documents in the system.

On privacy: “The network will not, may not monitor what is written in private documents.” (page 2/59) A whole section in chapter 3 deals with these kinds of issues, as does Computer Lib, another of Nelson’s works.

He was early to recognize the framing problem: how in a tangle of interlinked documents, to make sense of what’s there, to discern between useful and extraneous chunks. Nelson admits to no general solution, but points at some promising directions, one of which is link typing–the more information there is on individual links, the more handles there are to make sense of the tangle. Some tentative link types include title, author, supersession, correction, comment, counterpart, translation, heading, paragraph, quote, footnote, jump-link, modal jump-link, suggested threading, expansion, citation, alternative version, comment, certification, and mail.

At several points, Nelson mentions algorithmic work that makes the system possible. Page 1/36 states “Our enfilade data structures and methods effectively refute Donald Knuth’s list of desirable features that he says you can’t have all at once (in his book Fundamental Algorithms: Sorting and Searching)”. I’m curious if anyone knows more about this, or if Knuth ever got to know enough details to verify that claim, or revise his.

So was the opening anecdote a valid description of reality? I have to say no, it’s not that simple. Nelson rightly calls the web a shallow imitation of his grand ideas, but those ideas are–in some ways literally–from a different world. It’s not a question of “if only things had unfolded a bit differently…”. To put it even more strongly, a system with that kind of scope cannot be designed all at once, in order to be embraced by the real world it has to be developed with a feedback loop to the real world. This in no way diminishes the value and influence of big ideas or the place that Roarkian stick-to-your-gunnedness has in our world, industry, and society. We may have gotten ourselves into a mess with the architecture of the present web, but even so, Nelson’s vision will keep us aspiring toward something better.

I intend to return to this posting and update it for accuracy as my understanding improves. Some additional topics to maybe address are: a more detailed linking example (page 2/45), comparing XLink to Xanadu, comparing URIs and tumblers, and mention the bizarre (and yet oddly familiar if you’ve ever been inside a FedEx Kinkos) notion of “SilverStands”.

For more on Nelson, there is the epic writeup in Wired. YouTube has some good stuff too.

Comments are welcome. -m

Xanadu is a registered trademark, here used for specific identifying purpose.

Tuesday, September 22nd, 2009

XForms Developer Zone

Another XForms site launched this week. This one seems pretty close to what I would like XForms Institute to become, if I had an extra 10 hours per week. -m

Wednesday, July 29th, 2009

Object-Oriented CSS

I enjoyed Nicole Sullivan‘s talk at the BayJax Meetup on Object-Oriented CSS, something I hadn’t run in to before. Adding predictability to CSS development seems like a huge win. I need to wrap my head around it better. Anyone with experience using this technique care to comment? -m

Friday, July 24th, 2009

Java-style namespaces for markup

I’m noodling around with requirements and exploring existing work toward a solution for “decentralized extensability” on xml-dev, particularly for HTML. The notion of “Java-style” syntax, with reverse dns names and all, has come up many times in the context of these kinds of discussions, but AFAICT never been fully fleshed out. This is ongoing, slowly, in available time–which as been a post or two per week.  (In case there is any doubt, this is a spare-time effort not connected with my employer)

Check it out and add your knowledge to the thread. -m

Thursday, July 2nd, 2009

And then there were one…

On May 8 I wrote:

it’s time for the W3C to show some tough love and force the two (X)HTML Working Groups together.

On July 2, the W3C wrote:

Today the Director announces that when the XHTML 2 Working Group charter expires as scheduled at the end of 2009, the charter will not be renewed. By doing so, and by increasing resources in the Working Group, W3C hopes to accelerate the progress of HTML 5 and clarify W3C’s position regarding the future of HTML.

The real test is whether the single HTML Working Group can be held to the standard of other Working Groups, and be able to recruit some much-needed editorial help from some of the displaced XHTML 2 gang.  -m

Tuesday, June 23rd, 2009

RDFa List Apart

A great introduction article. Maybe it’s just the crowd I hang with, but RDFa looks like it’s moving from trendy to serious tooling. -m

Friday, June 19th, 2009

VoCamp Wrap-up

I spent 2 days at the Yahoo! campus at a VoCamp event, my first. Initially, I was dismayed at the schedule. Spend all the time the first day figuring out why everybody came? It seemed inefficient. But having gone through it, the process seems productive, exactly the way that completely decentralized groups need to get things done. Peter Mika did a great job moderating.

Attendees numbered about 35, and came from widely varying backgrounds from librarian to linguist to professor to student to CTO, though uniformly geeky. With SemTech this week, the timing was right, and the number of international attendees was impressive.

In community development, nothing gets completely decided just because a few people met. But progress happens. The first day was largely exploratory, but also covered plenary topics that nearly everyone was interested in. Namely:

  • Finding, choosing, and knowing when to create vocabularies
  • Mapping from one vocabulary to another
  • RDBMS to RDF mapping

Much of the shared understanding of these discussions is captured on various wiki pages connected to the one at the top of this article.

For day 2, we split into smaller working groups with more focused topics. I sat in on a discussion of Common Tag (which still feels too complex to me, but does fulfill a richer use case than rel-tag). Next, some vocabulary design, planning a microformat (and eventual RDF vocab) to represent code documentation: classes, functions, parameters, and the like. Tantek Çelik espoused the “scientific method” of vocab design: would a separate group, in similar circumstances, come up with the same design? If the answer is ‘yes’, then you probably designed it right. The way to make that happen is to focus on the basics, keeping everything as simple as possible. If any important features are missed, you will find out quickly. The experience of getting the simple thing out the door will provide the education needed to make the more complicated follow-on version a success.

From the wrap-up: if you are designing a vocabulary, the most useful thing you can do is NOT to unleash a fully-formed proposal on the world, but rather to capture the discussion around it. What were the initial use cases? What are people currently doing? What design goals were explicitly left off the table, or deferred to a future verson, or immediately shot down? It’s better to capture multiple proposals, even if fragmentary, and let lots of people look them over and gravitate toward the best design.

Lastly, some cool things overheard:

“Relational databases? We call those ‘legacy’.”

“The socially-accepted schema is fairly consistent.”

“It’s just a map, it’s not the territory.”

-m

Tuesday, May 12th, 2009

Google Rich Snippets powered by RDFa

The new feature called rich snippets shows that SearchMonkey has caught the eye of the 800 pound gorilla. Many of the same microformats and RDF vocabularies are supported. It seems increasingly inevitable that RDFa will catch on, no matter what the HTML5 group thinks. -m

Friday, May 8th, 2009

HTML: The Markup Language marks a new beginning

If you haven’t already, check out HTML: The Markup Langauge. Besides being a cool new recursive acronym for HTML, it is a reasonably-sane document. Also worth a look: Differences between HTML4 and HTML5. Many of the ideas from XHTML 2 (of which I was an editor at one point) are there.

I think it’s time for the W3C to show some tough love and force the two (X)HTML Working Groups together.

A while ago, I argued that the existence of both Flickr and Yahoo! Photos as an effective two-pronged strategy. Look how that worked out–Y! Photos is permanently shuttered. While there were benefits including a broader potential reach, in aggregate the benefits didn’t amount to more than the immense cost of having two parallel efforts. Same here. -m

Sunday, May 3rd, 2009

Playing with Wolfram Alpha

I’ve been experimenting with the preview version of Wolfram Alpha. It’s not like any current search engine because it’s not a search engine at all. Others have already written more eloquent things about it.

The key feature of it is that it doesn’t just find information, it infers it on the fly. Take for exmple the query

next solar eclipse in Sunnyvale

AFAIK, nobody has ever written a regular web page describing this important (to me) topic. Try it in Yahoo! or Google and see for yourself. There are a few potentially interesting links based on the abstracts, but they turn out to be spammy. Wolfram Alpha figures out that I’m talking about the combination of a concept (“solar eclipse”) and a place (“Sunnyvale, CA”, but with an offer to switch to Sunnyvale, TX) and combines the two. The result is a simple answer–4:52 pm PDT | Sunday, May 20, 2012 (3.049 years from now). Hey, that’s sooner than I thought! Besides the date, there’s many related facts and a cool map.

This is in contrast to SearchMonkey, which I helped create, in two main areas:

  1. Wolfram Alpha uses metadata to produce the result, then renders it through a set of pre-arranged renderers. The response is facts, not web pages.
  2. SearchMonkey focuses on sites providing their own metadata, while Wolfram Alpha focuses on hand-curation.

Search engines have been striving to do a better job at fact-queries. Wolfram’s approach shows that an approach disjoint from finding web pages from an index can be hugely useful.

The engineers working on this have a sense of humor too. The query

1.21GW

returns a page that includes the text “power required to operate the flux capacitor in the DeLorean DMC-12 time machine” as well as a useful comparison (~ 0.1 x the power of space shuttle at launch).

Yahoo! and Google do various kinds of internal “query rewriting”, but usually don’t let you know other than in the broadest terms (“did you mean …”). Wolfram Alpha shows a diagram of what it understood the query to be. The diagrams make it evident that something like the RDF model is in use, but without peeking under the hood, it’s hard to say something definitive.

One thing I wonder about is whether Wolfram Alpha creates dynamic (as was a major goal of SearchMonkey) of giving web authors a reason to put more metadata in their sites–a killer app if you will. It’s not clear at this early date how much web crawling or site metadata extraction (say RDFa) plays into the curation process.

In any case Wolfram Alpha is something to watch. It’s set to launch publicly this month. -m

Sunday, March 8th, 2009

Wolfram Alpha

The remarkable (and prolific) Stephen Wolfram has an idea called Wolfram Alpha. People used to assume the “Star Trek” model of computers:

that one would be able to ask a computer any factual question, and have it compute the answer.

Which has proved to be quite distant from reality. Instead

But armed with Mathematica and NKS [A New Kind of Science] I realized there’s another way: explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable.

It’s not easy to do this. Every different kind of method and model—and data—has its own special features and character. But with a mixture of Mathematica and NKS automation, and a lot of human experts, I’m happy to say that we’ve gotten a very long way.

I’m still a SearchMonkey guy at heart, so I wonder how much Wofram’s team is familiar with existing Semantic Web research and practice–because at a high level this seems very much like RDF with suitable queries thereupon. If that’s a good characterization, that’s A Good Thing, since practical application has been one of SemWeb’s weak spots.

-m

Saturday, January 10th, 2009

Defining the Prime RDFa use case (without mentioning RDFa)

At least, that’s how I’ve summarized John Allsopp’s article on HTML5 semantics. -m

Friday, December 19th, 2008

XSLTForms looks promising

Implementing client-side forms libraries is, and has been, all the rage. I’ve seen Mozquito Factory do amazing things in Netscape 4, Technical Pursuits TIBET on the perpetual verge of release, UGO, and others. In a more recent time scale, Ubiquity XForms impresses me and many others, and it has the right combination of funding and willing developers.

From a comment on my recent posting about Ubiquity XForms, I was pleased to learn about XSLTforms, a rebirth of AjaxForms, which I thought well of two years ago until its developer mysteriously left the project. But Software Libre lives on, and a new developer has taken over, this time using client-side XSLT instead of server-side Java to do the first pass of processing. Given the strong foundation, the project has come a long way in a short time, and already runs against a wide array of non-trivial examples. Check it out.

I’d like to hear what others think about this project. -m

Tuesday, December 9th, 2008

XML 2008 liveblog: Using RDFa for Government Information

Mark Birbeck, Web Backplane.

Problem statement: You shouldn’t have to “scrape” government sites.

Solution: RDFa

<div typeof="arg:Vacancy">
  Job title: <span property="dc:title">Assistant Officer</span>
  Description: <span property="dc:description">To analyse... </span>
</div>

This resolves to two full RDF triples. No separate feeds, uses existing publishing systems. Two of the most ambitious RDFa projects are taking place in the UK. Flexible arrangements possible.

Steps: 1. Create vocabulary. 2. Create demo. 3. Evangelize.

Vocabulary under Google Code: Argot Hub. Reuse terms (dc:title, foaf:name) where possible, developed in public.

Demos: Yahoo! SearchMonkey, (good for helping not-so-technical people to “get it”) then a Drupal hosted one (a little more control).

Next level, a new server that aggregates specific info (like all job openeings for Electricians), incuding geocoding. Ubiquity RDFa helps here.

Evangelizing: Detailed tutorials. Drupal code will go open source. More opportunities with companies currently screen-scrapting. More info @ rdfa.info.

Q&A: Asking about predicate overloading (dc:title). A general SemWeb issue. Context helps. Is RDFa tied to HTML? No, SearchMonkey itself uses RDFa–it’s just attributes.

-m

Tuesday, December 9th, 2008

XML 2008 liveblog: Sentiment Analysis in Open Source Information for the US Government

Ronald Reck, SAP; Kenneth Sall, SAIC

“I wish I knew when people were saying bad things about me.” Sentiment analysis. Kapow used initially. From 800k news articles (from 1996 and 1997), extracted 450M RDF assertions. The 13 Reuters standard metadata elements not used in this case. Used Redland for heavy RDF lifting. Inxight ThingFinder (commercial) for entity extraction, supplemented with enumerated lists (Bush Cabinet, Intellegence Agencies, negative adjectives, positive admire verbs, etc.) End result was RDF/XML.

(Kenneth takes the mic) SPARQL Sentiment Query Web UI. Heavy SPARQL ahead… Redland hasn’t implemented the UNION operator yet, making the examples more convoluted.

PREFIX sap: <http://iama.rrecktek.com/ont/sap#>
SELECT ?ent ?type ?name
WHERE {
?ent sap:Method "Name Catalog" .
?ent sap:Type ?type .
?ent sap:Name ?name
}

Difficult learning curve. Need ability to do substring from entity URI -> article URI.

Next steps: current news stories. Leverage existing metadata. RDF at the sentence level. Improve name catalogs. Use rule-based pattern matching engine. Slides.

-m

Monday, December 8th, 2008

XML 2008 liveblog: Ubiquity XForms

I will talk about one or more sessions from XML 2008 here.

Mark Birbeck of Web Backplane talking about Ubiquity XForms.

Browsers are slow to adopt new standards. Ajax libraries have attempted to work around this. Lots of experimentation which is both good and bad, but at least has legitimzed extensions to browsers. JavaScript is the assembly language of the web.

Ubiquity XForms is part of a library, which wil also include RDFa and SMIL. Initially based on YUI, but in theory sould be adaptable to other libraries like jQuery.

Declarative: tools for creation and validation. Easier to read. Ajax libraries are approaching the level of being their own language anyway, so might as well take advantage of a standard.

Example: setting the “inner value” of a span: <span value="now()"></span>.

Script can do this easily: onclick="this.innerHTML = Date().toLocaleString();" But crosses the line from semantics to specific behavior. The previous one is exactly how xforms:output works.

Another exapmple: tooltips. Breaks down to onmouseover, onmouseout event handlers, show and hide. A jQuery-like approach can search the document for all tooltip elements and add the needed handlers, avoiding explicit behavioral code. This is the essence of Ubiquity XForms (and in fact XForms itself).

Patterns like these compose under XForms. A button (xf:trigger) or any form control can easily have a tooltip (xf:hint). These are all regular elements, stylable with CSS, accesible via DOM, and so forth. Specific events (like xforms-hint) fire for specific events, and a spreadsheet-like engine can update interdependencies.

Question: Is this client-side? A: Yes, all running within Firefox. The entire presentation is one XForms document.

Demo: a range control with class=”geolocation” that displays as a map w/ Google Maps integration. The Ubiquity XForms library contains many such extensibility points.

Summary: Why? Simple, declarative. Not a programming language. Speeds up development. Validatable. Link: ubiquity.googlecode.com.

Q&A: Rich text? Not yet, but not hard (especially with YUI). Formally XForms compliant? Very nearly 1.1 conforming.

-m

Thursday, October 30th, 2008

XiX (XForms in XQuery)

I’m pondering implementing the computational parts of the XForms Model in XQuery. Doing so in a largely functional environment poses some challenges, though. Has anybody tackled this before? How about in any functional language, including ML, Haskell, Scheme, XSLT, or careful Python?

I borrowed the book Purely Functional Data Structures from a friend–this looks to be a good start. What else is out there? Comment below. -m

Thursday, October 23rd, 2008

RDFa is a Recommendation

Haven’t mentioned here that RDFa is a W3C Recommendation. I’m thrilled that something that I’ve been thinking about for a while is ready for prime time.

Also, as of this writing the first page of results at Google still prominently links to a terribly outdated draft of the spec. The first page of results at Yahoo! nails it. Just sayin’.

-m

Thursday, August 7th, 2008

Great comment on the eRDF 1.1 discussion

On the eRDF discussion posting, Toby Inkster, an implementer of eRDF, talks about why it’s bad to steal the id attribute, and why RDFa is better suited for general purpose metadata. Worth a read. -m

Monday, July 28th, 2008

eRDF 1.1 Proposal Discussion

The W3C RDFa specification is now in Candidate Recommendation phase, with an explicit call for implementations (of which there are several). Momentum for RDFa is steadily building. What about eRDF, which favors the existing HTML syntax over new attributes?

There’s still a place for a simpler syntactic approach to embedding RDF in HTML, as evidenced by projects like Yahoo! SearchMonkey. And eRDF is still the only game in town when it comes to annotating RDF within HTML-without-the-X.

One thing the RDFa folks did was define src as a subject-bearing node, rather than an object. At first I didn’t like this inversion, but the more I worked with it, the more it made sense. When you have an image, which can’t have children in (X)HTML, it’s very often useful to use the src URL as the subject, with a predicate of perhaps cc:license.

So I propose one single change to eRDF 1.1. Well, actually several changes, since one thing leads to another. The first is to specify that you are using a different version of eRDF. A new profile string of:

"http://purl.org/NET/erdf11/profile"

The next is changing the meaning of a src value to be a subject, not an object. Perhaps swapping the subject and object. Many existing uses of eRDF involving src already involve properties with readily available inverses. For example:

<!-- eRDF 1.0 -->
<img class="foaf.depiction" src="http://example.org/picture" />

<!-- eRDF 1.1 -->
<img src="http://example.org/picture" class="foaf.depicts" />

With the inherent limitations of existing syntax, the use case of having a full image URL and a license URL won’t happen. But XHTML2 as well as a HTML5 proposal suggest that adding href to many attributes might come to pass. In which case this possibility opens:

<img src="http://example.org/picture" class="cc.license"
href="http://creativecommons.org/licenses/by/2.0/" />

Comments? -m

Monday, July 21st, 2008

Review: Web 2.0: A Strategy Guide

Actually, instead of a review, let me quote the opening testimonial from the inside-front cover.

Competing globally with dynamic capabilities is the top priority of multinational executives and managers everywhere. Rethinking strategy in a highly networked world is the big challenge. How can your company navigate successfully in this turbulent, highly networked and socially connected environment? …

If this does it for you, I couldn’t recommend this book more highly. -m