Archive for the 'microformats' Category

Wednesday, January 26th, 2011

Explosive growth of RDFa

Some great data from my one-time colleague Peter Mika. Based on data culled from 12 billion web pages, RDFa is on 3.5 percent of them, even after discounting “trivial” uses of it. Just look at how much that dark blue bar shot up since the last measurement, some 18 months earlier.

Also of note: eRDF has dropped off the map. hAtom and hReview are continuing their climb.

-m

Sunday, August 22nd, 2010

Eulogy for SearchMonkey

This is indeed a sad day for all of us, for on October 1, a great app will be gone. Though we hardly had enough time during his short life to get to know him, like the grass that withers and fades, this monkey will finish his earthly course.

Updated SearchMonkey logo

Photo by Micah

I know he left many things undone, for example only enhancing 60% of the delivered result pages. He never got a chance to finish his life’s ambition of promoting RDFa and microformats to the masses or to be the killer app of the (lower-case) semantic web. You could say he will live on as “some of this structured data processing will be supported natively by the Microsoft platform”. Part of the monkey we loved will live on as enhanced results continue to flow forth from the Yahoo/Bing alliance.

The SearchMonkey Alumni group on LinkedIn is filled with wonderful mourners. Micah Alpern wrote there

I miss the team, the songs, and the aspiration to solve a hard problem. Everything else is just code.

Isaac Asimov was reported to have said “If my doctor told me I had only six minutes to live, I wouldn’t brood. I’d type a little faster.” Today we can identify with that sentiment. Keep typing.

-m

Tuesday, May 12th, 2009

Google Rich Snippets powered by RDFa

The new feature called rich snippets shows that SearchMonkey has caught the eye of the 800 pound gorilla. Many of the same microformats and RDF vocabularies are supported. It seems increasingly inevitable that RDFa will catch on, no matter what the HTML5 group thinks. -m

Friday, June 20th, 2008

RDFa is a Candidate Recommendation

The result of tons of work by lots of smart people. Go forth and implement. And I need to put in a plug for Metadata for Grandma which (indirectly, as it turned out) influenced the spec. RDFa is already a big deal, used in places like SearchMonkey. The subset of RDFa used by SearchMonkey is 100% conforming to the CR.

I’ll have more thoughts and perhaps implementation notes on this later. -m

Thursday, June 5th, 2008

Microformat search done right

From the Yahoo! Developer blog, new search keywords you can use to hone in on indexed microformats.

For example, to see every hAtom-bearing page that mentions ‘dubinko’ use the query [searchmonkeyid:com.yahoo.uf.hatom dubinko]. Works similarly for hCard, hCalendar, hReview, and XFN. I’m sure more are coming soon too. -m

Saturday, May 17th, 2008

Are microformats right for your site?

Yeah, more than ever before. See my article on Yahoo! developer net. The stuff I talk about here is currently live in the indexer. -m

Thursday, March 13th, 2008

The (lowercase) semantic web goes mainstream

So today Yahoo! announced a major facet of what I’ve been working on lately: making the web more meaningful. Lots of fantastic coverage, including TechCrunch and ReadWriteWeb (and others, please link in the comments), and supportive responses and blog posts across the board. It’s been a while since I’ve felt this good about being a Yahoo.

So what exactly is it?

A few months ago I went through the pages on this very blog and added hAtom markup. As a result of this change…well, nothing happened. I had a good experience learning about exactly what is involved in retrofitting an existing site with microformats, but I didn’t get any tangible benefit. With the “SearchMonkey” platform, any site using microformats, or RDFa or eRDF, is exposed to developers who can enhance search results. An enhanced result won’t directly make my my site rank higher in search, it it most certainly make it prone to more clicks, and ultimately more readership, more inlinks, and better organic ranking.

How about some questions and answers:

Q: Is this Tim Berners-Lee‘s vision of the Semantic Web finally getting fulfilled?

A: No.

Q: Does this presuppose everybody rushing to change their sites to include microformats, RDF, etc?

A: No. After all, there is a developer platform. Naturally, developers will have an easier time with sites that use official and community standards for structuring data, but there is no obligation for any site to make changes in order to participate and benefit.

Q: Why would a site want to expose all its precious data in an easily-extractable way?

A: Because within a healthy ecosystem it results in a measurable increase in traffic and customer satisfaction. Data on the public web is already extractable, given enough eyeballs. An openness strategy pays off (of which SearchMonkey is an existence proof).

Q: What about metacrap? We can never trust sites to provide honest metadata.

A: The system does have significant spam deterrents built in, of which I won’t say more. But perhaps more importantly, the plugin nature of the platform uses the power of the community to shape itself. A spammy plugin won’t get installed by users. A site that mixes in fraudulent RDFa metadata with real content will get exposed as fraudulent, and users will abandon ship.

Q: Didn’t ask.com prove that having a better user interface doesn’t help gain search market share?

A: Perhaps. But this isn’t about user interface–it’s about data (which enables a much better interface.)

Q: Won’t (Google|Microsoft|some startup) just immediately clone this idea and take advantage of all the new metadata out there?

A: I’m sure these guys will have some kind of response, and it’s true that a rising tide lifts all boats. But I don’t see anyone else cloning this exactly. The way it’s implemented has a distinctly Yahoo! appeal to it. Nobody has cloned Yahoo! Answers yet, either. In some ways, this is a return to roots, since Yahoo! started off as a human-guided directory. SearchMonkey is similar, except a much broader group of people can now participate. And there are some specific human, technical and financial reasons why as well, but I suggest inviting me out for beers if you want specifics. :-)

Disclaimer: as always, I’m not speaking for my employer. See the standard disclaimer. -m

Update: more Q and A

Q: How is SearchMonkey related to the recently announced Yahoo! Microsearch?

A: In brief, Microsearch is a research project (and a very cool one) with far-reaching goals, while SearchMonkey is targeted as imminently shipping software. I frequently talk to and compare notes with Peter Mika, the lead researcher for Microsearch.

Thursday, March 6th, 2008

microformat search at Yahoo!

Somehow I missed this posting and the underlying news that a Y Research project has a nice public demo of semantic search, driven by RDF, RDFa, and microformats. Still a rough sketch of a full solution, with multiple-second access times. But I particularly like the query for renaissance faire. -m

Monday, October 22nd, 2007

Is there fertile ground between RDFa and GRDDL?

The more I look at RDFa, the more I like it. But still it doesn’t help with the pain-point of namespaces, specifically of unmemorable URLs all over the place and qnames (or CURIEs) in content.

Does GRDDL offer a way out? Could, for instance, the namespace name for Dublin Core metadata be assigned to the prefix “dc:” in an external file, linked via transformation to the document in question? Then it would be simpler, from a producer or consumer viewpoint, to simply use names like “dc:title” with no problems or ambiguity.

This could be especially useful not that discussions are reopening around XML in HTML.

As usual, comments welcome. -m

Friday, October 5th, 2007

Playing with microformats

I’ll be doing some experimenting around here over maybe the next week or two. Specifically, setting up hAtom within these pages. Watch for falling debris and report any unusual observations. -m

Sunday, June 24th, 2007

At that moment, I knew my business was Machine Ready

I fell asleep one night while reading Ray Kurzweil, and had this crazy dream where the internet called me up (over VOIP, naturally) to complain that none of my web pages made sense. Par for the course, I thought at first. But then I told the internet a few things, to let me worry about my own domain of concern; he/she/it grappled with a response when a loud noise awoke me–my chirping alarm clock. I reached over to pound the Snooze button, but I stopped when my eyes focused on the display, which read in segmented LED letters: I rtFm. -m

Tuesday, January 23rd, 2007

On language design…

A semi-random thought that occurred to me.

One marker of a well-designed markup language is that it looks to the future. This doesn’t mean it’s an amorphous blob of abstract indirections mapped to tags. It can (and arguably should) be concrete and solid, but designed in such a way that keeps bigger things in mind.

HTML and XHTML are, I suppose, canonical examples of this, giving birth to microformats and many other uses outside of a browser. -m

Monday, December 4th, 2006

UC Berkeley – what I talked about

Last week, I visited Erik Wilde, Bob Glushko, and students up at Cal. No major announcements, just some sharpening of discussion points.

Since this was my first visit to Berkeley, I finally got to tell the joke “thank you for your OS”. Maybe you had to be there.

The intentional web is a formalism for describing “why the font tag is evil”. I often work with 3rd party integration languages, and the markup design is, without exception, crap. I hypothesize that the reason for this is jumping into solution-space before fully understanding problem-space. This seems to apply to lots more than just font tags; I lumped in WML and about half the world’s ajax sites for good measure.

Microformats are a formalism for describing “why creating a new markup language for my CD collection” is evil. Could XForms have been done as a microformat? No, microformats require a strong intentional foundation language, and HTML forms ain’t it. Is the proposed W3C approach an instance of “a deadly two-pronged attack”, a la Yahoo! Photos + Flickr? We’ll see. It does seem like that road leads to a namespace apocalypse, highlighting the fundamental difficulty namespaces hoists on attempts to usably extend HTML and XHTML at the same time. A namespace apocalypse may not be a bad thing.

On namespaces, I went over most of the points from my recent article. I won’t rehash that here.

What are some practical and implementation issues around XForms or the lack thereof? Focusing on mobile, as reason #1 I gave the lack of commercial-grade java browsers, discussed here previously. The state of mobile browsers is appalling, other than Opera and S60. Terms like “model” and “field” are troublesome, because the confuse the problem domain (the real world) and the solution domain (the computer). Browser vendors have been too inwardly-focused, both now and during the first attempt at salvaging HTML forms, leading to a premature jump into solution-space. But perhaps XForms dwelled for too long in the problem space…

Maybe I’ve mellowed some, but increasingly I’m able to look at both sides of issues. A useful skill for Information School students, wouldn’t you agree? -m

Monday, October 23rd, 2006

Yahoo! Answers Mobile

Just ran into this. Nice! Mobile mashups are getting some serious momentum.

To elaborate on my previous comments a bit, the concept of what people find usable differs between sitting at a desktop and sitting/standing/running/driving with mobile in hand. Desktop sites aren’t optimized for these kinds of use patterns. Ergo, fertile ground for lots of mashups. You were getting tired of the Maps API + X formula anyway, right? ;-) -m

Tuesday, July 18th, 2006

The right way to do Ajax is declaritively

Write up by Duncan Cragg. More and more momentum is building for this meme. -m

Sunday, July 16th, 2006

Microformats: inline annotation vs. binding

Hey readers, help me guide my scattered thought processes.
I’ve been thinking lately about microformats, which are typically characterized by inline annotation through existing class attributes in XHTML. You put the rel=”self” or whatever right into the document, on the element you’re talking about.
Another approach, that used by CSS itself, is to keep all the extra information bunched together in a different place. There’s all kinds of phrases, usually beginning with “separation of” to describe this pattern. And to do so requires a specific way to connect the external information, typcially called a binding. For CSS, it’s Selectors.
OK, so far so good. Except that it’s possible, and common in some cases, to have style attributes, taking CSS in the inline annotation direction. The lines are blurrier than they might seem at first.

So, Yahoo! has started publicly supporting microformats, which is great, because they are the ones generating the pages. What if you want to make a microformat out of 3rd party XHTML without touching it?
Here’s my questions. Feel free to comment below. I’m travelling, but I’ll try to moderate asap.

  1. In the specific case of CSS, how do you decide inline vs. binding?
  2. Are any microformat efforts currently looking at a binding approach vs. inline annotation? Which ones?
  3. What general principles should readers keep in mind for this discussion?

Thanks! -m

Tuesday, June 27th, 2006

rel=’nofollow’ IS a failure

My earlier nofollow post is now officially the most-spammed blog posting I’ve ever written. All this despite a moderation system–the spammers are getting zero benefit from all this. Deterrent techniques are not working; there will always be some small percentage of “unprotected” sites that the bad guys are happy to exploit.

Adding insult, even after I moderate posts, the links still have nofollow applied (by default in WordPress). Later, I’m going to post some analysis on how and why nofollow fails. If you have any ideas, post them in comments below. -m

Tuesday, June 6th, 2006

Microformat validation with Python and XPath

Python+XPath is a surprisingly powerful combination for doing all kinds of arbitrary validation tasks. I should know. I’ve recently figured out a few things that make it even better.

Line numbers in error messages. Libxml2 docs aren’t exactly forthcoming in this area. It’s pretty easy to register an error callback, but maddeningly it doesn’t include line numbers (except when piping errors directly to stdout, as several examples show). The C APIs have a whole notion of Structured Error Handling, which doesn’t seem to come across to the Python bindings. Getting the line number of a node is also straightforward, but I couldn’t figure out how to get the line of an error. Fortunately, the answer is simple:

e = libxml2.lastError()
print e.line()  # in contrast to node.lineNo()...

Checking for a class. Another common task in validating microformats is checking whether an element has a certain class applied to it. Since the class attribute takes a space-separated list of class values, this is harder than string search–you really need a tokenizer. Again, Python Libxml2 comes through. It’s reasonably simple to write an XPath extension function in Python:

def hasClass(ctx, content, cssclass):
rc = ""
if (isinstance(content, str):
tokens = content.split()
for token in tokens:
if token==cssclass: rc = "1"
return rc

# register the function on an XPath context (ctx)
ctx.registerXPathFunction("hasClass", "http://some.uri", hasClass)

(WordPress keeps eating the indentation on the above…you’ll figure it out.) Props to Kimbro Staken who did all the initial hard work. Comments? -m

Wednesday, May 31st, 2006

Visualizing Tags Over Time

Check out the presentation page, with a link to the paper. Because someone asked, my name got top biling due to the prestigious “alphabetical” reference system. -m

Tuesday, May 16th, 2006

Come Yahoo! with me

The following is a blatant job posting. If you’re not into that kind of thing, feel free to skip.

In Yahoo! Mobile, we’re working on an amazing project which, unfortunately, I can’t say much about just yet. We’re growing, and we need some more talent. All of the following are in Sunnyvale, CA and have the full benefits package. Relocation is always a possibility for the right candidate.

Web Guru/Developer: If you dream in semantic XHTML and prefer command line tools to read and write web pages, this is the job for you.

Release Engineer: On the other hand, if you dream about virtual IPs and consider Apache config files a second language, you’d be happy in this challenging position.
There’s more openings than these; I’m just highlighting a few here. If you’re interested, or just looking for more details, email me. If you’ll be at WWW next week, you could also look me up in person. -m

Thursday, May 11th, 2006

Is rel=’nofollow’ a failure?

The argument behind rel=”nofollow” was that spammers were trying to game the system to get link credibility for thier sites. Having a way to flag links that haven’t been human-reviewed so that they don’t count toward PageRank (and similar algorithms) would remove that incentive, and spammers would go away.

Fat chance. You haven’t noticed it here because of moderation (a similar way of enforcing human-review and removing incentive to game the system), but the spammers have already been hammering this site. A 3-second check would show that their efforts aren’t working, but if someone has a bot slamming thousands or millions of sites, apparently it’s not even economical for them to direct their spew–they keep on trying.

Even as the dis-incentives pile on, levels of spam activity increase. Nofollow doesn’t work, because it doesn’t live at the same level where the problem occurs. Comments? -m

MicahLogic is Stephen Fry proof thanks to caching by WP Super Cache