Archive for January, 2016

Monday, January 11th, 2016

Re-creating the Semantics Demo

(From the archives: I wrote this over 2 years ago, but never hit publish. At last, the tale can be told!)

If you haven’t seen it, the keynote at MarkLogic World 2013 is worth a look. I was on stage demonstrating new Semantics features built into MarkLogic server. Two of the three demos were based on MarkMail, a database of some 60 million messages, with enhanced search capabilities driven by semantics. (The third demo was a built-from-the-ground-up semantic application).

Since then, several folks have asked about the code behind the demo. What I showed was a fully operational MarkMail instance, including millions of email messages. This was understandably quite expensive to keep up on AWS, and it went away shortly after the keynote. A huge part of the demo was showing operation at scale, but reading between the lines, what folks are more interested in is something more portable–a way to see the code in operation and play with it without having to stand up an entire cluster or go through a lengthy setup procedure.

Space won’t allow for a full semantics tutorial here. For that, a good resource is this free tutorial from MarkLogic University.

So, in this posting, let’s recreate something on a similar level using built-in features. We’ll use the Oscars sample application that ships with the product. To get started, create an Application Builder sample project and deploy it. We’ll call the relevant database names ‘oscar’ and ‘oscar-modules’ throughout. Since Application Builder ships with only a small amount of data, you may also want to run the sample Information Studio collector that will fetch the rest of the dataset.

Before we can query, we need to actually turn on the semantics index. The easiest place to do this is on the page at http://localhost:8000/appservices/. Select the oscar database and hit configure. On the page that comes up, tick the box for Semantics and wait for the yellow flash.

Semantic Data

This wouldn’t be much of a semantic application without triple data. Entire books have been written on this kind of data modeling, but one huge advantage of semantics is that there’s lots of data already set up and ready to go. We’ll use dbpedia. The most reecent release as of this writing is version 3.9.

From there we’ll grab data that looks relevant to the Oscar application: anything about people and/or movies, picking and choosing from things most likely to have relevant facts:

In all, a bit less over 38 million triples–not even enough to make MarkLogic break a sweat, but still a large enough chunk to be inconvenient to download. Since the oscar data and dbpedia ultimately derive from the same source–Wikipedia itself. Since the oscar data preserved URLs it was straightforward to extract all triples that had a matching subject, once prefixed with “http://dbpedia.org/resource/”.

I extracted all these triples: Grab them from here and put it somewhere on your local system.

Then simply load these triples via query console. Point the target database to ‘oscar’ and run this:

import module namespace sem="http://marklogic.com/semantics"
  at "MarkLogic/semantics.xqy";
sem:rdf-load("/path/to/oscartrips.ttl")

Infopanel widget

So an ‘infopanel’ is what in the MLW demo showed the Hadoop logo, committers, downloads, and other facts about the current query. The default oscar app already has something like this: widgets. Let’s create a new widget type that looks up and displays facts about the current query. To start, if you haven’t already, build the example application in App Builder. There’s some excellent documentation that walks through this process.

Put on your Front End Dev hat and let’s build a widget. All the code we will use and modify is in the oscar-modules database, so either hook up a WebDav server or copy the files out to your filesystem to work on them. Back in AppBuilder on the Assemble page, click the small X at the upper-right corner of the pie chart widget. This will clear space for the widget we’re about to create, specifically in the div <div id=”widget-2″ class=”widget widget-slot”>.

The way to do this is to modify the file application/custom/app-config.js All changes to files in the custom/ directory will survive a redeployment in AppBuilder, which means your changes will be safe, even if you need to go back and change things in Application Builder.

function infocb(dat) {
 $("#widget-2").html("<h2>Infopanel</h2><p>The query is " +
    JSON.stringify(dat.query) + "</p>");
 };
var infopanel = ML.createWidget($("#widget-2"), infocb, null, null);

This gives us the bare minimum possible widget. Now all that’s left is to add semantics.

Hooking up the Infopanel query

We need a semantic query, the shape of which is: “starting with a string, find the matching concept, and from that concept return lots of facts to sift through later”.

And we have everything we need at hand with MarkLogic 7. The REST endpoint, already part of the deployed app, includes a SPARQL endpoint. So we need to make the new widget fire off a semantic query in the SPARQL language, then render the results into the widget. One nice thing about the triples in use here is that they consistently use the foaf:name property to map between a concept and its string label. So pulling all the triples based on a string-named topic works like this. Again, we’ll use Query Console to experiment:

import module namespace sem = "http://marklogic.com/semantics"
    at "/MarkLogic/semantics.xqy";
let $str := "Zorba the Greek"
let $sparql := "
prefix foaf: <http://xmlns.com/foaf/0.1/>
construct { ?topic ?p ?o }
where
{ ?topic foaf:name $str .
?topic ?p ?o . }
"
return sem:sparql($sparql, map:entry("str", $str))

Here, of course, to make this Query Console runnable we are passing in a hard-coded string (“Zorba the Greek”) but in the infopanel this will come from the query.

Of course, deciding what parts of the query to use could be quite an involved process. For example, if the query included [decade:1980s] you can imaging all kinds of interesting semantic queries that might produce useful and interesting results. But to keep things simple, we will look for only a s single word query, which includes quoted phrases like “Orson Welles”. Also in the name of simplicity, the code sample will only use a few possible predicates. Choosing which predicates to use, and in what order to display them, is a big part of making an infopanel useful.

Here’s the code. Put this in config/app-config.js:

function infocb(dat) {
  var qtxt = dat.query && dat.query["word-query"] &&
        dat.query["word-query"][0] && dat.query["word-query"][0].text &&
        dat.query["word-query"][0].text._value
  if (qtxt) {
    $.ajax({
      url: "/v1/graphs/sparql",
      accepts: { json:"application/rdf+json" },
      dataType: "json",
      data: {query:
        'prefix foaf: <http://xmlns.com/foaf/0.1/> ' +
        'construct { ?topic ?p ?o } ' +
        'where ' +
        '{ ?topic foaf:name "' + qtxt + '"@en . ' +
        '?topic ?p ?o . }'
      },
      success: function(data) {
        var subj = Object.keys(data); // ECMAscript 5th ed, IE9+
        var ptitle = "http://xmlns.com/foaf/0.1/name";
        var pdesc = "http://purl.org/dc/elements/1.1/description";
        var pthumb = "http://dbpedia.org/ontology/thumbnail";
        var title = "-";
        var desc = "";
        var thumb = "";
        if (data[subj]) {
          if (data[subj][ptitle]) {
            title = data[subj][ptitle][0].value;
          }
          if (data[subj][pdesc]) {
            desc = "<p>" + data[subj][pdesc][0].value + "</p>";
          }
          if (data[subj][pthumb]) {
            thumb = "<img style='width:150px; height:150px' src='" +
                data[subj][pthumb][0].value + "'/>";
          }
        }
        $("#widget-2").html("<h2>" + title + "</h2>" + desc + thumb );
      }
   });
  } else { $("#widget-2").html("no data")} 
};

var infopanel = ML.createWidget($("#widget-2"), infocb, null, null);

This works by crafting a SPARQL query and sending it off to the server. The response comes back in RDF/JSON format, with the subject as a root object in the JSON, and each predicate against that subject as a sub-object. The code looks through the predicates and picks out interesting information for the infopanel, formatting it as HTML.

I noted in working on this that many of the images referenced in the dbpedia image dataset actually return 404 on the web. If you are not seeing thumbnail images for some queries, this may be why. An infopanel implementation can only be as helpful as the data underneath. If anyone knows of more recent data than the official dpbedia 3.9 data, do let me know.

Where to go from here

I hope this provides a base upon which many developers can play and experiment. Any kind of app, but especially a semantic app, comes about through an iterative process. There’s a lot of room for expansion in these techniques. Algorithms to select and present semantic data can get quite involved; this only scratches the surface.

The other gem in here is the widget framework, which has actually been part of all Application Builder apps since MarkLogic 6. Having that technology as a backdrop made it far easier to zoom in and focus on the semantic technology. Try it out, and let me know in the comments how it works for you.

Monday, January 11th, 2016

Geektastic Things

I am trying something new with the GeekThoughts domain. Instead of pointing to my blog, it’s pointing at some cool geeky things on a CMS that’s easier to update. Won’t you check it out?

geekthoughts.info

Saturday, January 2nd, 2016

Pixel editor

I did a thing.

I am experimenting with machine learning and neural networks. To do so, I need a real-world dataset to play with. For starters, I am using a 5×7 pixel array, as common in many DIY projects, representing digits 0-9. Please help me my drawing a picture of the randomly-selected digit below. All data will be released to the public domain.

Can you help me? Take a few seconds and draw a picture of a number by clicking on pixels to toggle their value.

Email me with any questions or suggestions.




Inspiration

Data collection by Google Custom Forms.

Machine learning datasets FTW! Stay tuned for code & results.