Archive for the 'metadata' Category

Wednesday, January 26th, 2011

Explosive growth of RDFa

Some great data from my one-time colleague Peter Mika. Based on data culled from 12 billion web pages, RDFa is on 3.5 percent of them, even after discounting “trivial” uses of it. Just look at how much that dark blue bar shot up since the last measurement, some 18 months earlier.

Also of note: eRDF has dropped off the map. hAtom and hReview are continuing their climb.

-m

Thursday, September 9th, 2010

FCC opens its databases

Good news for big data fans. The FCC has released APIs to several large databases involving broadband statistics, spectrum licenses, and some related topics. I haven’t had a chance for a close look yet, perhaps we can do that together. Link. -m

Sunday, August 22nd, 2010

Eulogy for SearchMonkey

This is indeed a sad day for all of us, for on October 1, a great app will be gone. Though we hardly had enough time during his short life to get to know him, like the grass that withers and fades, this monkey will finish his earthly course.

Updated SearchMonkey logo

Photo by Micah

I know he left many things undone, for example only enhancing 60% of the delivered result pages. He never got a chance to finish his life’s ambition of promoting RDFa and microformats to the masses or to be the killer app of the (lower-case) semantic web. You could say he will live on as “some of this structured data processing will be supported natively by the Microsoft platform”. Part of the monkey we loved will live on as enhanced results continue to flow forth from the Yahoo/Bing alliance.

The SearchMonkey Alumni group on LinkedIn is filled with wonderful mourners. Micah Alpern wrote there

I miss the team, the songs, and the aspiration to solve a hard problem. Everything else is just code.

Isaac Asimov was reported to have said “If my doctor told me I had only six minutes to live, I wouldn’t brood. I’d type a little faster.” Today we can identify with that sentiment. Keep typing.

-m

Wednesday, June 9th, 2010

“Google syntax” for semantic queries?

Thought experiment: are there any commonly-expressed semantic queries–the kind of queries you’d run over a triple store, or perhaps a SearchMonkey-annotated web site–expressible in common type-in-a-searchbox query grammar?

As a refresher, here’s some things that Google and other search engines can handle. The square brackets represent the search box into which the queries are typed, not part of the queries themselves.

[term]

[term -butnotthis]

[term1 OR term2]

[“phrase term”]

[tem1 OR term2 -“but not this” site:dubinko.info filetype:html]

So what kind of semantic queries would be usefully expressed in a similar way, avoiding SPARQL and the like? For example, maybe [by:”Micah Dubinko”] could map to a document containing a triple like <this document> <dc:author> “Micah Dubinko”. What other kinds of graph queries are interesting, common, and simple to express like this? Comments welcome.

-m

Sunday, May 30th, 2010

Balisage contest: solving the wikiml problem

I wish I could say I had something to do with the planning of this: part of Balisage 2010 is a contest to “encourage markup experts to review and to research the current state of wiki markup languages and to generate a proposal that serves to de-babelize the current state of affairs for the long haul.”  To enter, you must propose a set of concrete steps (organizational, social, and/or technological) that will enable wiki content interchange, a real WYSIWYG editor, and/or wiki syntax standardization.

This pushes all of my buttons. It’s got structured documents, Web, parser geekery, writing, engineering, and standards. There’s a bunch of open source prior art, including PyXMLWiki, which I adapted from some fantastic earlier work from Rick Jelliffe.

Sadly, MarkLogic employees aren’t eligible to enter. Get your write-up done by July 15 and sent to balisage-2010-contest at marklogic dot com. The winner will be announced at Balisage and will take home some serious prize winnings, and also will be strongly encouraged (but not required) to give a brief summary (~10 minutes) of their winning entry.

Can’t wait to see what comes out of this. -m

Thursday, November 5th, 2009

Metadata FTW

Link credit goes to Joho.

This looks pretty significant. The AZ Supreme Court ruled that document metadata must be disclosed under existing public records law. This may start a chain reaction with other states following suit. With the movement toward open data including data.gov and the Federal Register, this fits in well. Quite often metadata including creation date and author and the like make for much better searching and faceting. -m

Monday, October 12th, 2009

Speaking at Northern Virginia Mark Logic User Group Oct 27

Come learn more about Mark Logic and get a behind-the-scenes look at the new Application Builder. I’ll be speaking at the NOVA MUG (Northern Virginia Mark Logic User Group) on October 27. This turns out to be pretty close to the big Semantic Web conference, so I’ll stick my head in there too. Stop by and look me up!

Details at the developer site.

-m

Wednesday, September 16th, 2009

Billion triples challenge

I had been asking around earlier for large RDF datasets. Here’s one. Looks like a great contest to build an app around this, but unfortunately, the deadline looks like it’s soonish (1 Oct).

What is it?

The major part of the dataset was crawled during February/March 2009 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson using the MultiCrawler/SWSE framework. To ensure wide coverage, we also included a (bounded) breadth-first crawl of depth 50 starting from http://www.w3.org/People/Berners-Lee/card.

The downloaded content was parsed using the Redland toolkit with rdfxml, rss-tag-soup, rdfa parsers. We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format and split into chunks of 10m statements each.

The page includes some fairly detailed statistics on the data breakdown. Cool. -m

Tuesday, June 23rd, 2009

RDFa List Apart

A great introduction article. Maybe it’s just the crowd I hang with, but RDFa looks like it’s moving from trendy to serious tooling. -m

Friday, June 19th, 2009

VoCamp Wrap-up

I spent 2 days at the Yahoo! campus at a VoCamp event, my first. Initially, I was dismayed at the schedule. Spend all the time the first day figuring out why everybody came? It seemed inefficient. But having gone through it, the process seems productive, exactly the way that completely decentralized groups need to get things done. Peter Mika did a great job moderating.

Attendees numbered about 35, and came from widely varying backgrounds from librarian to linguist to professor to student to CTO, though uniformly geeky. With SemTech this week, the timing was right, and the number of international attendees was impressive.

In community development, nothing gets completely decided just because a few people met. But progress happens. The first day was largely exploratory, but also covered plenary topics that nearly everyone was interested in. Namely:

  • Finding, choosing, and knowing when to create vocabularies
  • Mapping from one vocabulary to another
  • RDBMS to RDF mapping

Much of the shared understanding of these discussions is captured on various wiki pages connected to the one at the top of this article.

For day 2, we split into smaller working groups with more focused topics. I sat in on a discussion of Common Tag (which still feels too complex to me, but does fulfill a richer use case than rel-tag). Next, some vocabulary design, planning a microformat (and eventual RDF vocab) to represent code documentation: classes, functions, parameters, and the like. Tantek Çelik espoused the “scientific method” of vocab design: would a separate group, in similar circumstances, come up with the same design? If the answer is ‘yes’, then you probably designed it right. The way to make that happen is to focus on the basics, keeping everything as simple as possible. If any important features are missed, you will find out quickly. The experience of getting the simple thing out the door will provide the education needed to make the more complicated follow-on version a success.

From the wrap-up: if you are designing a vocabulary, the most useful thing you can do is NOT to unleash a fully-formed proposal on the world, but rather to capture the discussion around it. What were the initial use cases? What are people currently doing? What design goals were explicitly left off the table, or deferred to a future verson, or immediately shot down? It’s better to capture multiple proposals, even if fragmentary, and let lots of people look them over and gravitate toward the best design.

Lastly, some cool things overheard:

“Relational databases? We call those ‘legacy’.”

“The socially-accepted schema is fairly consistent.”

“It’s just a map, it’s not the territory.”

-m

Friday, May 15th, 2009

A nugget from _A Canticle for Leibowitz_

This brilliant bit is almost a throwaway paragraph on page 304, near the end.

[Two men in a satirical dialog] managed only to demonstrate that the mathematical limit of an infinite sequence of “doubting the certainty with which something doubted is known to be unknowable  when the ‘something doubted’ is still a preceding statement ‘unknowability’ of something doubted,” that the limit of this process at infinity can only be equivalent to a statement of absolute certainty, even though phrased ans an infinite series of negations of certainty.

It’s not like the whole book is like this…far from it. But it is chock full of little gems.

-m

Tuesday, May 12th, 2009

Google Rich Snippets powered by RDFa

The new feature called rich snippets shows that SearchMonkey has caught the eye of the 800 pound gorilla. Many of the same microformats and RDF vocabularies are supported. It seems increasingly inevitable that RDFa will catch on, no matter what the HTML5 group thinks. -m

Sunday, May 3rd, 2009

Playing with Wolfram Alpha

I’ve been experimenting with the preview version of Wolfram Alpha. It’s not like any current search engine because it’s not a search engine at all. Others have already written more eloquent things about it.

The key feature of it is that it doesn’t just find information, it infers it on the fly. Take for exmple the query

next solar eclipse in Sunnyvale

AFAIK, nobody has ever written a regular web page describing this important (to me) topic. Try it in Yahoo! or Google and see for yourself. There are a few potentially interesting links based on the abstracts, but they turn out to be spammy. Wolfram Alpha figures out that I’m talking about the combination of a concept (“solar eclipse”) and a place (“Sunnyvale, CA”, but with an offer to switch to Sunnyvale, TX) and combines the two. The result is a simple answer–4:52 pm PDT | Sunday, May 20, 2012 (3.049 years from now). Hey, that’s sooner than I thought! Besides the date, there’s many related facts and a cool map.

This is in contrast to SearchMonkey, which I helped create, in two main areas:

  1. Wolfram Alpha uses metadata to produce the result, then renders it through a set of pre-arranged renderers. The response is facts, not web pages.
  2. SearchMonkey focuses on sites providing their own metadata, while Wolfram Alpha focuses on hand-curation.

Search engines have been striving to do a better job at fact-queries. Wolfram’s approach shows that an approach disjoint from finding web pages from an index can be hugely useful.

The engineers working on this have a sense of humor too. The query

1.21GW

returns a page that includes the text “power required to operate the flux capacitor in the DeLorean DMC-12 time machine” as well as a useful comparison (~ 0.1 x the power of space shuttle at launch).

Yahoo! and Google do various kinds of internal “query rewriting”, but usually don’t let you know other than in the broadest terms (“did you mean …”). Wolfram Alpha shows a diagram of what it understood the query to be. The diagrams make it evident that something like the RDF model is in use, but without peeking under the hood, it’s hard to say something definitive.

One thing I wonder about is whether Wolfram Alpha creates dynamic (as was a major goal of SearchMonkey) of giving web authors a reason to put more metadata in their sites–a killer app if you will. It’s not clear at this early date how much web crawling or site metadata extraction (say RDFa) plays into the curation process.

In any case Wolfram Alpha is something to watch. It’s set to launch publicly this month. -m

Sunday, March 8th, 2009

Wolfram Alpha

The remarkable (and prolific) Stephen Wolfram has an idea called Wolfram Alpha. People used to assume the “Star Trek” model of computers:

that one would be able to ask a computer any factual question, and have it compute the answer.

Which has proved to be quite distant from reality. Instead

But armed with Mathematica and NKS [A New Kind of Science] I realized there’s another way: explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable.

It’s not easy to do this. Every different kind of method and model—and data—has its own special features and character. But with a mixture of Mathematica and NKS automation, and a lot of human experts, I’m happy to say that we’ve gotten a very long way.

I’m still a SearchMonkey guy at heart, so I wonder how much Wofram’s team is familiar with existing Semantic Web research and practice–because at a high level this seems very much like RDF with suitable queries thereupon. If that’s a good characterization, that’s A Good Thing, since practical application has been one of SemWeb’s weak spots.

-m

Saturday, January 10th, 2009

Defining the Prime RDFa use case (without mentioning RDFa)

At least, that’s how I’ve summarized John Allsopp’s article on HTML5 semantics. -m

Tuesday, December 30th, 2008

RDFa parser in XQuery now open source

After a delay, the code to my RDFa parser in XQuery is now available under an Apache license. Go get it. This is some of the earliest XQuery code I ever wrote, so go easy on me. It follows the earlier work on a functional definition of RDFa. And feel free to send in patches. -m

Tuesday, December 9th, 2008

XML 2008 liveblog: Using RDFa for Government Information

Mark Birbeck, Web Backplane.

Problem statement: You shouldn’t have to “scrape” government sites.

Solution: RDFa

<div typeof="arg:Vacancy">
  Job title: <span property="dc:title">Assistant Officer</span>
  Description: <span property="dc:description">To analyse... </span>
</div>

This resolves to two full RDF triples. No separate feeds, uses existing publishing systems. Two of the most ambitious RDFa projects are taking place in the UK. Flexible arrangements possible.

Steps: 1. Create vocabulary. 2. Create demo. 3. Evangelize.

Vocabulary under Google Code: Argot Hub. Reuse terms (dc:title, foaf:name) where possible, developed in public.

Demos: Yahoo! SearchMonkey, (good for helping not-so-technical people to “get it”) then a Drupal hosted one (a little more control).

Next level, a new server that aggregates specific info (like all job openeings for Electricians), incuding geocoding. Ubiquity RDFa helps here.

Evangelizing: Detailed tutorials. Drupal code will go open source. More opportunities with companies currently screen-scrapting. More info @ rdfa.info.

Q&A: Asking about predicate overloading (dc:title). A general SemWeb issue. Context helps. Is RDFa tied to HTML? No, SearchMonkey itself uses RDFa–it’s just attributes.

-m

Tuesday, December 9th, 2008

XML 2008 liveblog: Sentiment Analysis in Open Source Information for the US Government

Ronald Reck, SAP; Kenneth Sall, SAIC

“I wish I knew when people were saying bad things about me.” Sentiment analysis. Kapow used initially. From 800k news articles (from 1996 and 1997), extracted 450M RDF assertions. The 13 Reuters standard metadata elements not used in this case. Used Redland for heavy RDF lifting. Inxight ThingFinder (commercial) for entity extraction, supplemented with enumerated lists (Bush Cabinet, Intellegence Agencies, negative adjectives, positive admire verbs, etc.) End result was RDF/XML.

(Kenneth takes the mic) SPARQL Sentiment Query Web UI. Heavy SPARQL ahead… Redland hasn’t implemented the UNION operator yet, making the examples more convoluted.

PREFIX sap: <http://iama.rrecktek.com/ont/sap#>
SELECT ?ent ?type ?name
WHERE {
?ent sap:Method "Name Catalog" .
?ent sap:Type ?type .
?ent sap:Name ?name
}

Difficult learning curve. Need ability to do substring from entity URI -> article URI.

Next steps: current news stories. Leverage existing metadata. RDF at the sentence level. Improve name catalogs. Use rule-based pattern matching engine. Slides.

-m

Friday, October 24th, 2008

Online etymology database

I’ve been playing lately with this site, and it’s a fantastic resource. The word carboy probably comes from Persian qarabah “large flagon.” Who knew? -m

Saturday, August 23rd, 2008

MarkLogic RDFa parser

This post will be continuously updated to contain the most recent details about an XQuery 1.0 RDFa parser I wrote for Mark Logic. It follows the Functional RDFa pattern.

At present there is little to say, but eventually code and more will be available. Stay tuned.

-m

Friday, August 8th, 2008

It would be awesome if somebody…

It would be awesome of someone made a site that catalogued all the common mis-encodings. Even in 2008, I see these things all over the web–mangled quotation marks, apostrophes, em-dashes. I’d love to see a pictoral guide.

curly apostrophe looks like ?’ – original encoding=_________ mislabeled as __________ .

That sort of thing. Surely somebody has done this arleady, right? -m

Thursday, August 7th, 2008

Great comment on the eRDF 1.1 discussion

On the eRDF discussion posting, Toby Inkster, an implementer of eRDF, talks about why it’s bad to steal the id attribute, and why RDFa is better suited for general purpose metadata. Worth a read. -m

Monday, August 4th, 2008

Implementing RDFa in XQuery

Through the weekend I put most of the final touches on an implementation of RDFa in XQuery. The implementation is based on the functional specification of RDFa, an offshoot of the excellent work coming out of the W3C task force.

The spec contains a procedural description of the parsing algorithm, and several have successfully followed it to arrive at a conforming implementation. But you would have tough times explaining RDFa to someone that way. The functional description sort of fell out of the way I described RDFa to people.

“When you see an element with XXXX, you generate a triple, using SSSS as the subject, PPPP as the predicate, and OOOO as the object.”

Which arguably is the more natural way to express the algorithm for functional languages like XQuery or XSLT. Fill in the right blanks and you pretty much have it. In practice, it’s somewhat more complicated, but not nearly so much as with other W3C specs.

I hope to make the code available soon. You’ll hear about it first here.

I’ll write more when I’m not exhausted. :-) -m

Monday, July 28th, 2008

eRDF 1.1 Proposal Discussion

The W3C RDFa specification is now in Candidate Recommendation phase, with an explicit call for implementations (of which there are several). Momentum for RDFa is steadily building. What about eRDF, which favors the existing HTML syntax over new attributes?

There’s still a place for a simpler syntactic approach to embedding RDF in HTML, as evidenced by projects like Yahoo! SearchMonkey. And eRDF is still the only game in town when it comes to annotating RDF within HTML-without-the-X.

One thing the RDFa folks did was define src as a subject-bearing node, rather than an object. At first I didn’t like this inversion, but the more I worked with it, the more it made sense. When you have an image, which can’t have children in (X)HTML, it’s very often useful to use the src URL as the subject, with a predicate of perhaps cc:license.

So I propose one single change to eRDF 1.1. Well, actually several changes, since one thing leads to another. The first is to specify that you are using a different version of eRDF. A new profile string of:

"http://purl.org/NET/erdf11/profile"

The next is changing the meaning of a src value to be a subject, not an object. Perhaps swapping the subject and object. Many existing uses of eRDF involving src already involve properties with readily available inverses. For example:

<!-- eRDF 1.0 -->
<img class="foaf.depiction" src="http://example.org/picture" />

<!-- eRDF 1.1 -->
<img src="http://example.org/picture" class="foaf.depicts" />

With the inherent limitations of existing syntax, the use case of having a full image URL and a license URL won’t happen. But XHTML2 as well as a HTML5 proposal suggest that adding href to many attributes might come to pass. In which case this possibility opens:

<img src="http://example.org/picture" class="cc.license"
href="http://creativecommons.org/licenses/by/2.0/" />

Comments? -m

Thursday, July 3rd, 2008

Yahoo! now indexes RDFa

I haven’t seen an announcement about this, but try the following query on Yahoo Search: [searchmonkeyid:com.yahoo.rdf.rdfa] (link). It shows documents containing RDFa, with Digg at the top. Since this is a Searchmonkey ID, it’s also usable in Searchmonkey to actually extract the metadata and use it to customize search results.

Does your site use RDFa yet? -m

Friday, June 20th, 2008

RDFa is a Candidate Recommendation

The result of tons of work by lots of smart people. Go forth and implement. And I need to put in a plug for Metadata for Grandma which (indirectly, as it turned out) influenced the spec. RDFa is already a big deal, used in places like SearchMonkey. The subset of RDFa used by SearchMonkey is 100% conforming to the CR.

I’ll have more thoughts and perhaps implementation notes on this later. -m

Wednesday, May 14th, 2008

Reminder: SearchMonkey developer launch party Thursday

Reminder: Thursday evening at Yahoo! Sunnyvale headquarters is the launch party for the developer-facing side of SearchMonkey. In case you haven’t been paying attention, SearchMonkey is a new platform that lets developers craft their own awesomized search results. If you’re interested in SEO or general lowercase semantic web tools, you’ll love it. Meet me there. Upcoming link. Party starts at 5:30. -m

Update: The developer tool is live. Rasmus has a nice walkthrough.

Monday, April 28th, 2008

SearchMonkey in private beta

I haven’t mentioned it yet, but SearchMonkey (now an official name, not just a project name) is in external limited beta. Keep an eye on ysearchblog, lots more technical content is on the way. -m

Thursday, March 13th, 2008

The (lowercase) semantic web goes mainstream

So today Yahoo! announced a major facet of what I’ve been working on lately: making the web more meaningful. Lots of fantastic coverage, including TechCrunch and ReadWriteWeb (and others, please link in the comments), and supportive responses and blog posts across the board. It’s been a while since I’ve felt this good about being a Yahoo.

So what exactly is it?

A few months ago I went through the pages on this very blog and added hAtom markup. As a result of this change…well, nothing happened. I had a good experience learning about exactly what is involved in retrofitting an existing site with microformats, but I didn’t get any tangible benefit. With the “SearchMonkey” platform, any site using microformats, or RDFa or eRDF, is exposed to developers who can enhance search results. An enhanced result won’t directly make my my site rank higher in search, it it most certainly make it prone to more clicks, and ultimately more readership, more inlinks, and better organic ranking.

How about some questions and answers:

Q: Is this Tim Berners-Lee‘s vision of the Semantic Web finally getting fulfilled?

A: No.

Q: Does this presuppose everybody rushing to change their sites to include microformats, RDF, etc?

A: No. After all, there is a developer platform. Naturally, developers will have an easier time with sites that use official and community standards for structuring data, but there is no obligation for any site to make changes in order to participate and benefit.

Q: Why would a site want to expose all its precious data in an easily-extractable way?

A: Because within a healthy ecosystem it results in a measurable increase in traffic and customer satisfaction. Data on the public web is already extractable, given enough eyeballs. An openness strategy pays off (of which SearchMonkey is an existence proof).

Q: What about metacrap? We can never trust sites to provide honest metadata.

A: The system does have significant spam deterrents built in, of which I won’t say more. But perhaps more importantly, the plugin nature of the platform uses the power of the community to shape itself. A spammy plugin won’t get installed by users. A site that mixes in fraudulent RDFa metadata with real content will get exposed as fraudulent, and users will abandon ship.

Q: Didn’t ask.com prove that having a better user interface doesn’t help gain search market share?

A: Perhaps. But this isn’t about user interface–it’s about data (which enables a much better interface.)

Q: Won’t (Google|Microsoft|some startup) just immediately clone this idea and take advantage of all the new metadata out there?

A: I’m sure these guys will have some kind of response, and it’s true that a rising tide lifts all boats. But I don’t see anyone else cloning this exactly. The way it’s implemented has a distinctly Yahoo! appeal to it. Nobody has cloned Yahoo! Answers yet, either. In some ways, this is a return to roots, since Yahoo! started off as a human-guided directory. SearchMonkey is similar, except a much broader group of people can now participate. And there are some specific human, technical and financial reasons why as well, but I suggest inviting me out for beers if you want specifics. :-)

Disclaimer: as always, I’m not speaking for my employer. See the standard disclaimer. -m

Update: more Q and A

Q: How is SearchMonkey related to the recently announced Yahoo! Microsearch?

A: In brief, Microsearch is a research project (and a very cool one) with far-reaching goals, while SearchMonkey is targeted as imminently shipping software. I frequently talk to and compare notes with Peter Mika, the lead researcher for Microsearch.

Monday, March 10th, 2008

Dear readers…

You are awesome. Just sayin’. -m

MicahLogic is Stephen Fry proof thanks to caching by WP Super Cache