Billion triples challenge

I had been asking around earlier for large RDF datasets. Here’s one. Looks like a great contest to build an app around this, but unfortunately, the deadline looks like it’s soonish (1 Oct).

What is it?

The major part of the dataset was crawled during February/March 2009 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson using the MultiCrawler/SWSE framework. To ensure wide coverage, we also included a (bounded) breadth-first crawl of depth 50 starting from http://www.w3.org/People/Berners-Lee/card.

The downloaded content was parsed using the Redland toolkit with rdfxml, rss-tag-soup, rdfa parsers. We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format and split into chunks of 10m statements each.

The page includes some fairly detailed statistics on the data breakdown. Cool. -m

Related Posts

Open to work: What I’m Looking For

Cleaning Data by David Mertz now available