Archive for the 'search' Category

Sunday, August 22nd, 2010

Eulogy for SearchMonkey

This is indeed a sad day for all of us, for on October 1, a great app will be gone. Though we hardly had enough time during his short life to get to know him, like the grass that withers and fades, this monkey will finish his earthly course.

Updated SearchMonkey logo

Photo by Micah

I know he left many things undone, for example only enhancing 60% of the delivered result pages. He never got a chance to finish his life’s ambition of promoting RDFa and microformats to the masses or to be the killer app of the (lower-case) semantic web. You could say he will live on as “some of this structured data processing will be supported natively by the Microsoft platform”. Part of the monkey we loved will live on as enhanced results continue to flow forth from the Yahoo/Bing alliance.

The SearchMonkey Alumni group on LinkedIn is filled with wonderful mourners. Micah Alpern wrote there

I miss the team, the songs, and the aspiration to solve a hard problem. Everything else is just code.

Isaac Asimov was reported to have said “If my doctor told me I had only six minutes to live, I wouldn’t brood. I’d type a little faster.” Today we can identify with that sentiment. Keep typing.


Saturday, June 12th, 2010

Command Lines on the frontier of user interface

This came from a comment on the prior post, and it’s worth a shout of its own. Don Norman on the importance of command lines, including the ubiquitous search box, in modern UI. -m

Wednesday, June 9th, 2010

“Google syntax” for semantic queries?

Thought experiment: are there any commonly-expressed semantic queries–the kind of queries you’d run over a triple store, or perhaps a SearchMonkey-annotated web site–expressible in common type-in-a-searchbox query grammar?

As a refresher, here’s some things that Google and other search engines can handle. The square brackets represent the search box into which the queries are typed, not part of the queries themselves.


[term -butnotthis]

[term1 OR term2]

[“phrase term”]

[tem1 OR term2 -“but not this” filetype:html]

So what kind of semantic queries would be usefully expressed in a similar way, avoiding SPARQL and the like? For example, maybe [by:”Micah Dubinko”] could map to a document containing a triple like <this document> <dc:author> “Micah Dubinko”. What other kinds of graph queries are interesting, common, and simple to express like this? Comments welcome.


Wednesday, October 7th, 2009

US Federal Register in XML

Fed Thread is a front end for the newly XMLified Federal Register. Why is this a big deal? It’s a daily publication of the goings-on of the US government. It’s a primary source for all kinds of things that normally only get rehashed through news organizations. And it is bulky–nobody can read through it on a regular basis. A yearly subscription (printed) would cost nearly $1000 and fill over 80,000 pages.

Having it in XML enables all kinds of searching, syndication, and annotation via flexible front ends like this one. Yay for transparency. -m

Tuesday, May 12th, 2009

Google Rich Snippets powered by RDFa

The new feature called rich snippets shows that SearchMonkey has caught the eye of the 800 pound gorilla. Many of the same microformats and RDF vocabularies are supported. It seems increasingly inevitable that RDFa will catch on, no matter what the HTML5 group thinks. -m

Sunday, May 3rd, 2009

Playing with Wolfram Alpha

I’ve been experimenting with the preview version of Wolfram Alpha. It’s not like any current search engine because it’s not a search engine at all. Others have already written more eloquent things about it.

The key feature of it is that it doesn’t just find information, it infers it on the fly. Take for exmple the query

next solar eclipse in Sunnyvale

AFAIK, nobody has ever written a regular web page describing this important (to me) topic. Try it in Yahoo! or Google and see for yourself. There are a few potentially interesting links based on the abstracts, but they turn out to be spammy. Wolfram Alpha figures out that I’m talking about the combination of a concept (“solar eclipse”) and a place (“Sunnyvale, CA”, but with an offer to switch to Sunnyvale, TX) and combines the two. The result is a simple answer–4:52 pm PDT | Sunday, May 20, 2012 (3.049 years from now). Hey, that’s sooner than I thought! Besides the date, there’s many related facts and a cool map.

This is in contrast to SearchMonkey, which I helped create, in two main areas:

  1. Wolfram Alpha uses metadata to produce the result, then renders it through a set of pre-arranged renderers. The response is facts, not web pages.
  2. SearchMonkey focuses on sites providing their own metadata, while Wolfram Alpha focuses on hand-curation.

Search engines have been striving to do a better job at fact-queries. Wolfram’s approach shows that an approach disjoint from finding web pages from an index can be hugely useful.

The engineers working on this have a sense of humor too. The query


returns a page that includes the text “power required to operate the flux capacitor in the DeLorean DMC-12 time machine” as well as a useful comparison (~ 0.1 x the power of space shuttle at launch).

Yahoo! and Google do various kinds of internal “query rewriting”, but usually don’t let you know other than in the broadest terms (“did you mean …”). Wolfram Alpha shows a diagram of what it understood the query to be. The diagrams make it evident that something like the RDF model is in use, but without peeking under the hood, it’s hard to say something definitive.

One thing I wonder about is whether Wolfram Alpha creates dynamic (as was a major goal of SearchMonkey) of giving web authors a reason to put more metadata in their sites–a killer app if you will. It’s not clear at this early date how much web crawling or site metadata extraction (say RDFa) plays into the curation process.

In any case Wolfram Alpha is something to watch. It’s set to launch publicly this month. -m

Friday, October 24th, 2008

Online etymology database

I’ve been playing lately with this site, and it’s a fantastic resource. The word carboy probably comes from Persian qarabah “large flagon.” Who knew? -m

Thursday, May 1st, 2008


Today happens to mark the 6th anniversary of my blog. To celebrate going into year seven I’m refocusing it, including a new name: Micahpedia.

Blogging is an important skill, a subset of the overall skill of managing your online persona, so it’s worth devoting some attention to. The ego-burst doesn’t hurt either. My concrete goal is to get in the top 10 search results for the query [Micah], though I face some stiff competition including the prophet.

From an SEO perspective, “Push Button Paradise” wasn’t the greatest choice of name. It suffers from the common SEO mistake of being excessively clever and/or cute reflection of what I happened to be working on at the moment, namely XForms. If you see the old name standalone, or in a blogroll, or in an RSS reader, you still don’t have much of an idea what it’s about or who’s behind it. True I get pretty good ranking on the exact phrase, but nobody searches for that…

I will continue SEO tweaks on this site as time goes on and welcome any advice from any of my 7 readers.

In short, Micahpedia is about what I’m reading, writing, thinking about, and working on. I have plenty to say about these things. :-) The best is yet to come. -m

Sunday, April 27th, 2008

Is there an inverse to the Innovator’s Dilemma?

Roughly speaking, the innovator’s dilemma happens when a product progressively gets more and more advanced features, to the point that it misses out (by listening to customers) on an entire new opportunity. At that point, a simpler, competing product can come into play and make large gains.

But what happens when a company is generally aware of the Innovator’s Dilemma and tries to compensate? It seems like second order effects might come into their own. A product widely known for being (and remaining) minimalist is exposed to attacks from deliberate enhancements and related complexificaiton of competitive products. As the market gets more mature, the steadfastly-simple market leader gets left behind. In a sense, it’s a role reversal from what Clayton Christensen describes. But can it work out the same in the end? Please comment. -m

Thursday, March 13th, 2008

The (lowercase) semantic web goes mainstream

So today Yahoo! announced a major facet of what I’ve been working on lately: making the web more meaningful. Lots of fantastic coverage, including TechCrunch and ReadWriteWeb (and others, please link in the comments), and supportive responses and blog posts across the board. It’s been a while since I’ve felt this good about being a Yahoo.

So what exactly is it?

A few months ago I went through the pages on this very blog and added hAtom markup. As a result of this change…well, nothing happened. I had a good experience learning about exactly what is involved in retrofitting an existing site with microformats, but I didn’t get any tangible benefit. With the “SearchMonkey” platform, any site using microformats, or RDFa or eRDF, is exposed to developers who can enhance search results. An enhanced result won’t directly make my my site rank higher in search, it it most certainly make it prone to more clicks, and ultimately more readership, more inlinks, and better organic ranking.

How about some questions and answers:

Q: Is this Tim Berners-Lee‘s vision of the Semantic Web finally getting fulfilled?

A: No.

Q: Does this presuppose everybody rushing to change their sites to include microformats, RDF, etc?

A: No. After all, there is a developer platform. Naturally, developers will have an easier time with sites that use official and community standards for structuring data, but there is no obligation for any site to make changes in order to participate and benefit.

Q: Why would a site want to expose all its precious data in an easily-extractable way?

A: Because within a healthy ecosystem it results in a measurable increase in traffic and customer satisfaction. Data on the public web is already extractable, given enough eyeballs. An openness strategy pays off (of which SearchMonkey is an existence proof).

Q: What about metacrap? We can never trust sites to provide honest metadata.

A: The system does have significant spam deterrents built in, of which I won’t say more. But perhaps more importantly, the plugin nature of the platform uses the power of the community to shape itself. A spammy plugin won’t get installed by users. A site that mixes in fraudulent RDFa metadata with real content will get exposed as fraudulent, and users will abandon ship.

Q: Didn’t prove that having a better user interface doesn’t help gain search market share?

A: Perhaps. But this isn’t about user interface–it’s about data (which enables a much better interface.)

Q: Won’t (Google|Microsoft|some startup) just immediately clone this idea and take advantage of all the new metadata out there?

A: I’m sure these guys will have some kind of response, and it’s true that a rising tide lifts all boats. But I don’t see anyone else cloning this exactly. The way it’s implemented has a distinctly Yahoo! appeal to it. Nobody has cloned Yahoo! Answers yet, either. In some ways, this is a return to roots, since Yahoo! started off as a human-guided directory. SearchMonkey is similar, except a much broader group of people can now participate. And there are some specific human, technical and financial reasons why as well, but I suggest inviting me out for beers if you want specifics. :-)

Disclaimer: as always, I’m not speaking for my employer. See the standard disclaimer. -m

Update: more Q and A

Q: How is SearchMonkey related to the recently announced Yahoo! Microsearch?

A: In brief, Microsearch is a research project (and a very cool one) with far-reaching goals, while SearchMonkey is targeted as imminently shipping software. I frequently talk to and compare notes with Peter Mika, the lead researcher for Microsearch.

Tuesday, February 26th, 2008

Yahoo! Announces Open Search Platform

As spotted on TechCrunch, full article. This is a game-changer folks. Check out the comments attached to the article. -m

Friday, December 21st, 2007

XML 2007 buzz: Hadoop

OK, the majority of the buzz came from my talk, where I strongly encouraged folks to take a look at Hadoop. This article seems to be saying much the same things. If you’re curious about the future of distributed computation and storage, it’s worth a look. -m

Sunday, June 3rd, 2007

Search On

The approximately seven readers of this blog have probably already heard this, but just in case: I have a new role at Yahoo!–working on next generation search.

Lots of details are still falling into place. For now I describe it: “Imagining, specifying, prototyping, developing, and evangelizing next-generation web search experiences leveraging the full and unique capabilities available within Yahoo!”

In many ways, this is a logical stepping stone after oneSearch, and I’ll be dealing with lowercase semantic web issues more now. Expect the focus of this blog to shift accordingly (though I’m still interested in mobile and will make note of important happenings.)

Search On! -m