Archive for the 'commercialism' Category

Monday, January 11th, 2016

Re-creating the Semantics Demo

(From the archives: I wrote this over 2 years ago, but never hit publish. At last, the tale can be told!)

If you haven’t seen it, the keynote at MarkLogic World 2013 is worth a look. I was on stage demonstrating new Semantics features built into MarkLogic server. Two of the three demos were based on MarkMail, a database of some 60 million messages, with enhanced search capabilities driven by semantics. (The third demo was a built-from-the-ground-up semantic application).

Since then, several folks have asked about the code behind the demo. What I showed was a fully operational MarkMail instance, including millions of email messages. This was understandably quite expensive to keep up on AWS, and it went away shortly after the keynote. A huge part of the demo was showing operation at scale, but reading between the lines, what folks are more interested in is something more portable–a way to see the code in operation and play with it without having to stand up an entire cluster or go through a lengthy setup procedure.

Space won’t allow for a full semantics tutorial here. For that, a good resource is this free tutorial from MarkLogic University.

So, in this posting, let’s recreate something on a similar level using built-in features. We’ll use the Oscars sample application that ships with the product. To get started, create an Application Builder sample project and deploy it. We’ll call the relevant database names ‘oscar’ and ‘oscar-modules’ throughout. Since Application Builder ships with only a small amount of data, you may also want to run the sample Information Studio collector that will fetch the rest of the dataset.

Before we can query, we need to actually turn on the semantics index. The easiest place to do this is on the page at http://localhost:8000/appservices/. Select the oscar database and hit configure. On the page that comes up, tick the box for Semantics and wait for the yellow flash.

Semantic Data

This wouldn’t be much of a semantic application without triple data. Entire books have been written on this kind of data modeling, but one huge advantage of semantics is that there’s lots of data already set up and ready to go. We’ll use dbpedia. The most reecent release as of this writing is version 3.9.

From there we’ll grab data that looks relevant to the Oscar application: anything about people and/or movies, picking and choosing from things most likely to have relevant facts:

In all, a bit less over 38 million triples–not even enough to make MarkLogic break a sweat, but still a large enough chunk to be inconvenient to download. Since the oscar data and dbpedia ultimately derive from the same source–Wikipedia itself. Since the oscar data preserved URLs it was straightforward to extract all triples that had a matching subject, once prefixed with “http://dbpedia.org/resource/”.

I extracted all these triples: Grab them from here and put it somewhere on your local system.

Then simply load these triples via query console. Point the target database to ‘oscar’ and run this:

import module namespace sem="http://marklogic.com/semantics"
  at "MarkLogic/semantics.xqy";
sem:rdf-load("/path/to/oscartrips.ttl")

Infopanel widget

So an ‘infopanel’ is what in the MLW demo showed the Hadoop logo, committers, downloads, and other facts about the current query. The default oscar app already has something like this: widgets. Let’s create a new widget type that looks up and displays facts about the current query. To start, if you haven’t already, build the example application in App Builder. There’s some excellent documentation that walks through this process.

Put on your Front End Dev hat and let’s build a widget. All the code we will use and modify is in the oscar-modules database, so either hook up a WebDav server or copy the files out to your filesystem to work on them. Back in AppBuilder on the Assemble page, click the small X at the upper-right corner of the pie chart widget. This will clear space for the widget we’re about to create, specifically in the div <div id=”widget-2″ class=”widget widget-slot”>.

The way to do this is to modify the file application/custom/app-config.js All changes to files in the custom/ directory will survive a redeployment in AppBuilder, which means your changes will be safe, even if you need to go back and change things in Application Builder.

function infocb(dat) {
 $("#widget-2").html("<h2>Infopanel</h2><p>The query is " +
    JSON.stringify(dat.query) + "</p>");
 };
var infopanel = ML.createWidget($("#widget-2"), infocb, null, null);

This gives us the bare minimum possible widget. Now all that’s left is to add semantics.

Hooking up the Infopanel query

We need a semantic query, the shape of which is: “starting with a string, find the matching concept, and from that concept return lots of facts to sift through later”.

And we have everything we need at hand with MarkLogic 7. The REST endpoint, already part of the deployed app, includes a SPARQL endpoint. So we need to make the new widget fire off a semantic query in the SPARQL language, then render the results into the widget. One nice thing about the triples in use here is that they consistently use the foaf:name property to map between a concept and its string label. So pulling all the triples based on a string-named topic works like this. Again, we’ll use Query Console to experiment:

import module namespace sem = "http://marklogic.com/semantics"
    at "/MarkLogic/semantics.xqy";
let $str := "Zorba the Greek"
let $sparql := "
prefix foaf: <http://xmlns.com/foaf/0.1/>
construct { ?topic ?p ?o }
where
{ ?topic foaf:name $str .
?topic ?p ?o . }
"
return sem:sparql($sparql, map:entry("str", $str))

Here, of course, to make this Query Console runnable we are passing in a hard-coded string (“Zorba the Greek”) but in the infopanel this will come from the query.

Of course, deciding what parts of the query to use could be quite an involved process. For example, if the query included [decade:1980s] you can imaging all kinds of interesting semantic queries that might produce useful and interesting results. But to keep things simple, we will look for only a s single word query, which includes quoted phrases like “Orson Welles”. Also in the name of simplicity, the code sample will only use a few possible predicates. Choosing which predicates to use, and in what order to display them, is a big part of making an infopanel useful.

Here’s the code. Put this in config/app-config.js:

function infocb(dat) {
  var qtxt = dat.query && dat.query["word-query"] &&
        dat.query["word-query"][0] && dat.query["word-query"][0].text &&
        dat.query["word-query"][0].text._value
  if (qtxt) {
    $.ajax({
      url: "/v1/graphs/sparql",
      accepts: { json:"application/rdf+json" },
      dataType: "json",
      data: {query:
        'prefix foaf: <http://xmlns.com/foaf/0.1/> ' +
        'construct { ?topic ?p ?o } ' +
        'where ' +
        '{ ?topic foaf:name "' + qtxt + '"@en . ' +
        '?topic ?p ?o . }'
      },
      success: function(data) {
        var subj = Object.keys(data); // ECMAscript 5th ed, IE9+
        var ptitle = "http://xmlns.com/foaf/0.1/name";
        var pdesc = "http://purl.org/dc/elements/1.1/description";
        var pthumb = "http://dbpedia.org/ontology/thumbnail";
        var title = "-";
        var desc = "";
        var thumb = "";
        if (data[subj]) {
          if (data[subj][ptitle]) {
            title = data[subj][ptitle][0].value;
          }
          if (data[subj][pdesc]) {
            desc = "<p>" + data[subj][pdesc][0].value + "</p>";
          }
          if (data[subj][pthumb]) {
            thumb = "<img style='width:150px; height:150px' src='" +
                data[subj][pthumb][0].value + "'/>";
          }
        }
        $("#widget-2").html("<h2>" + title + "</h2>" + desc + thumb );
      }
   });
  } else { $("#widget-2").html("no data")} 
};

var infopanel = ML.createWidget($("#widget-2"), infocb, null, null);

This works by crafting a SPARQL query and sending it off to the server. The response comes back in RDF/JSON format, with the subject as a root object in the JSON, and each predicate against that subject as a sub-object. The code looks through the predicates and picks out interesting information for the infopanel, formatting it as HTML.

I noted in working on this that many of the images referenced in the dbpedia image dataset actually return 404 on the web. If you are not seeing thumbnail images for some queries, this may be why. An infopanel implementation can only be as helpful as the data underneath. If anyone knows of more recent data than the official dpbedia 3.9 data, do let me know.

Where to go from here

I hope this provides a base upon which many developers can play and experiment. Any kind of app, but especially a semantic app, comes about through an iterative process. There’s a lot of room for expansion in these techniques. Algorithms to select and present semantic data can get quite involved; this only scratches the surface.

The other gem in here is the widget framework, which has actually been part of all Application Builder apps since MarkLogic 6. Having that technology as a backdrop made it far easier to zoom in and focus on the semantic technology. Try it out, and let me know in the comments how it works for you.

Monday, January 11th, 2016

Geektastic Things

I am trying something new with the GeekThoughts domain. Instead of pointing to my blog, it’s pointing at some cool geeky things on a CMS that’s easier to update. Won’t you check it out?

geekthoughts.info

Monday, May 20th, 2013

Five years at MarkLogic

This past weekend marked my five-year anniversary at MarkLogic. It’s been a fun ride, and I’m proud of how much I’ve accomplished.

It was the technology that originally caught my interest: I saw the MarkMail demo at an XML conference, and one thing led to another. The company was looking to expand the product beyond the core database–they had plans for something called a “utility layer” though in reality it wasn’t really a utility nor a separate layer. It started with Search API, though the very first piece of code I wrote was an RDFa parser.

But what’s really held my interest for these years is a truly unmatched set of peers. This place is brimming with brilliant minds, and that keeps me smiling every day on my way in to work.

Which leads my thoughts back to semantics again. This push in a new direction has a lot of echoes with the events that originally brought me on board. This is going to be huge, and will move the company in a new direction. Stay tuned. -m

Saturday, April 13th, 2013

Semantics!

This week marked the MarkLogic World conference and with it some exciting news. Without formally “announcing” a new release, the company showed off a great deal of semantic technology in-progress. Part of that came from me, on stage during the Wednesday technical keynote. I’ve been at MarkLogic five years next month, and the first piece of code I wrote there was an RDFa parser. This has been a long time coming.

It was an amazing experience. I was responsible for sifting through the huge amounts of public data–both in RDF formats and on public web pages–and writing the semantic code to pull everything together, culminating in those ten minutes on stage.

Picture this: just behind the big stage and the projected screens was a hive of impressive activity. I counted 8 A/V people backstage, plus 4 more at the back of the auditorium. The conference has reached  a level of production values that wouldn’t be vastly different if it was a stadium affair. So in back there’s a curtained-off “green room” with some higher-grade snacks (think PowerBars and Red Bull) with a flatscreen that shows the stage. From back there you can’t see the projected slides or demos, but if you step just outside, you’re at the reverse side of the screen, larger-than-life. The narrow walkway leads to the “chute”, right up the steps onto the main stage. As David Gorbet went through the opening moments of his talk in fine form, I did some stretches and did everything I could think of to prepare myself.

Then he called me up and the music blasted out from the speakers. I had been playing through my mind all the nightmare scenarios–tripping on the stairs and falling on my face as I come onstage (etc.)–but none of that happened. I’ve done public speaking many times before so I had an idea what to expect, though on a stage like that the lights are so bright that it’s hard to see beyond about the third row. So despite the 300-400 people in the room, it didn’t even feel much different than addressing an intimate group of peers. It was fun. On with the demos:

The first showed our internal MarkMail cluster with a simple ‘infobox’ of the sort that all the search engines are doing these days. This was an icebreaker to talk about semantics and how it works–in this case locate the concept of Hadoop in the database, and from there find all the related labels, abstracts, people, projects, releases, and so on. During the construction of the demo, we uncovered some real world facts about the author of the top-ranked message for the query, including a book he wrote. The net effect was that these additional facts made the results a lot more useful by providing a broader context for them.

The second demo showed improved recall–that is finding things that would otherwise slip under the radar. The existing [from:IBM] query in MarkMail does a good job finding people that happen to have the letters i-b-m in their email address. The semantic query [affiliation:IBM] in contrast knows about the concept of IBM, the concept of people, and the relationship of is-affiliated-with (technically foaf:affiliation) to run a query that more closely models how a person would ask the question: “people that work for IBM” as opposed to “people that have i-b-m in their email address”. This the results included folks posting from gmail accounts and other personal addresses, and the result set jumped from about 277k messages to 280k messages.

At this point, a pause to talk about the architecture underlying the technology. It turns out that a system that already supports shared-nothing scale out, full ACID transactions, multiple HA/DR options, and a robust security model is a good starting point for building semantic capabilities.  (I got so excited at this point that I forgot to use the clicker for a few beats and had to quickly catch-up the slides.) SPARQL code on the screen.

Then the third demo, a classic semantic app with a twist. Pulling together triples from several different public vocabularies, we answered the question of “find a Hadoop expert” with each row of the results representing not a document, as in MarkMail results, but an actual person. We showed location data (which was actually randomized to avoid privacy concerns) and aggregate cost-of-living data for each city. When we added in a search term, we drew histograms of MarkMail message traffic over time and skipped over the result that had no messages. The audience was entranced.

This is exciting work. I had several folks come up to me afterwards with words to the effect of they hadn’t realized it before, but boy do they ever need semantics. I can’t think of a better barometer for a technical keynote. So back to work I go. There’s a lot to do.

Thanking by name is dangerous, because inevitably people get left out, but I would like to shout out to David Gorbet who ran the keynote, John Snelson who’s a co-conspirator in the development effort, Eric Bloch who helped with the MarkMail code more than anyone will ever know, Denis Shehan who was instrumental in wrangling the cloud and data, and Stephen Buxton who patiently and repeatedly offered feedback that helped sharpen the message.

I’ll post a pointer to the video when it’s available. -m

Friday, March 1st, 2013

WFH

The valley is buzzing about Marissa’s edict putting the kibosh on Yahoos working from home. I don’t have any first-hand information, but apparently this applies somewhat even to one-day-a-week telecommuters. Some are saying Marissa’s making a mistake, but I don’t think so. She’s too smart for that. There’s no better way to get extra hours of work out of a motivated A-lister than letting them skip the commute, and I work regularly with several full-time telecommuters. It works out just fine.

This is a sign that Y is still infested with slackers. From what I’ve seen, a B-or-C-lister will ruthlessly take advantage of a WFH policy. If that dries up, they’ll move on.

If I’m right, the policy will indeed go into effect at Yahoo starting this summer, and after a respectable amount of time has passed (and the slackers leave) it will loosen up again. And Yahoo will be much stronger for it. Agree? -m

Tuesday, November 20th, 2012

Hedgehogs and Foxes

In Nate Sliver’s new book, he mentions a classification system for experts, originally from Berkeley professor Philip Tetlock, along a spectrum of Fox <—> Hedgehog. (The nomenclature comes from an essay about Tolstoy.)

Hedgehogs are type A personalities who believe in Big Ideas. The are ideologues and go “all-in” on whatever they’re espousing. A great many pundits fall into this category.

Foxes are scrappy creatures who believe in a plethora of little ideas and in taking different approaches toward a problem, and are more tolerant of nuance, uncertainty, complexity, and dissent.

There are a lot of social situations (broadly construed) where hedgehogs seem to have the upper hand. Talking heads on TV are a huge example, but so are many fixtures in the tech world, Malcolm Gladwell, say. Most of the places I’ve worked at have at least a subtle hedgehog-bias toward hiring, promotions, and career development.

To some degree, I think this stems from a lack of self-awareness. Brash pundits come across better on the big screen; they grab your attention and take a bold stand for sometihing–who wouldn’t like that? But if you take pause and think about what they’re saying or (horror) go back an measure their predictions after-the-fact, they don’t look nearly so good. Foxes are better at getting things right.

It seems like we’ve just been through a phase of more-obnoxious-than-usual punditry, and I found this spectrum a useful way to look at things. How about you? Are you paying more attention to hedgehogs when you probably should be listening to the foxes?

-m

Monday, September 17th, 2012

MarkLogic 6 is here

MarkLogic 6 launched today, and it’s full of new and updated goodies. I spent some time designing the new Application Builder including the new Visualization Widgets. If you’ve used Application Builder in the past, you’ll be pleasantly surprised at the changes. It’s leaner and faster under the hood. I’d love to hear what people think of the new architecture, and how they’re using it in new and awesome ways.

If I had to pick out a common theme for the release, it’s all about expanding the appeal of the server to reach new audiences. The Java API makes working with the server feel like a native extension to the language, and the REST API makes it easy to extend the same to other languages.

XQuery support is stronger than ever. I liked Ryan Dew’s take on some of the smaller, but still useful features.

This wouldn’t be complete without thanking my teammates who really made this possible. I had the great pleasure of working with some top-notch front-end people recently, and it’s been a great experience. -m

 

Thursday, August 23rd, 2012

Super simple tokenizer in XQuery

A lexer might seem like one of the boringest pieces of code to write, but every language brings it’s own little wrinkles to the problem. Elegant solutions are more work, but also more rewarding.

There is, of course, a large body of work on table-driven approaches, several of them listed here (and bigger list), though XQuery seems to have been largely left out of the fun.

In MarkLogic Search API, we implemented a recursive tokenizer. Since a search string can contain quoted pieces which need to be carefully maintained, first we split (in the fn:tokenize-sense, discarding matched delimiters) on the quote character, then iterate through the pieces. Odd-numbered pieces are chunks of tokens outside of any quoting, and even-numbered pieces are a single quoted string, to be preserved as-is. We recurse through the odd chunks, further breaking them down into individual tokens, as well as normalizing whitespace and a few other cleanup operations. This code is aggressively optimized, and it removes any searches for tokens known to not appear in the overall string. It also preserves the character offset positions of each token relative to the starting string, which gets used downstream, so this makes for some of the most complicated code in the Search API. But it’s blazingly fast.

When prototyping, it’s nice to have something simpler and more straightforward. So I came up with an approach using fn:analyze-string. This function, introduced in XSLT 2.0 and later ported to XQuery 3.0, takes a regular expression, and returns all of the target string, neatly divided into match and non-match portions. This is great, but difficult to apply across the entire string. For example, potential matches can have different meaning depending on where they fall (again, quoted strings as an example.) But if every regex starts with ^ which anchors the match to the front of the string, the problem simplifies to peeling off a single token from the front of the string. Keep doing this until there’s no string left.

This is a particularly nice approach when parsing a grammar that’s formally defined in EBNF. You can pretty much take the list of terminal expressions, port them to XQuery-style regexes, add a ^ in front of each, and roll.

Take SPARQL for example. It’s a reasonably rich grammar. The W3C draft spec has 35 productions for terminals. I sketched out some of the terminal rules (note these are simplified):

declare variable $spq:WS     := "^\s+";
declare variable $spq:QNAME  := "^[a-zA-Z][a-zA-Z0-9]*:[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:PREFIX := "^[a-zA-Z][a-zA-Z0-9]*:";
declare variable $spq:NAME   := "^[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:IRI    := "^<[^>]+>";
...

Then going through the input string, and seeing which of these expressions match, and if so, calling analyze-string and adding the matched portion as a token, and recursing on the non-matched portion. Note that we need to go through longer matches first, so the rule for ‘prefix:qname’ comes before the rule for ‘prefix:’ which comes before the rule for ‘string’

declare function spq:tokenize-recurse($in as xs:string, $tl as json:array) {
    if ($in eq "")
    then ()
    else spq:tokenize-recurse(
        switch(true())
        case matches($in, $spq:WS)     return spq:discard-tok($in, $spq:WS)
        case matches($in, $spq:QNAME)  return spq:peel($in, $spq:QNAME, $tl, "qname")
        case matches($in, $spq:PREFIX) return spq:peel($in, $spq:PREFIX, $tl, "prefix", 0, 1)
        case matches($in, $spq:NAME)   return spq:peel($in, $spq:NAME, $tl, "name")
        ...

Here, we’re co-opting a json:array mutable object as a convenient way to store tokens as we peel them off. There’s not actually any JSON involved here. The actual peeling looks like this:

declare function spq:peel(
    $in as xs:string,
    $regex as xs:string,
    $toklist as json:array,
    $type as xs:string,
    $triml, $trimr) {
    let $split := analyze-string($in, $regex)
    let $match := string($split/str:match)
    let $match := if ($triml gt 0) then substring($match, $triml + 1) else $match
    let $match := if ($trimr gt 0) then substring($match, 1, string-length($match) - $trimr) else $match
    let $_ := json:array-push($toklist, <searchdev:tok type="{$type}">{$match}</searchdev:tok>)
    let $result := string($split/str:non-match)
    return $result
};

Some productions, like a <iri> inside angle brackets, contain fixed delimiters which get trimmed off. Some productions, like whitespace, get thrown away. And that’s it. As it stands, it’s pretty close to a table-driven approach. It’s also more flexible than the recursive approach above–even for things like escaped quotes inside a string, if you can write a regex for it, you can lex it.

Performance

But is it fast? Short answer is that I don’t know. A full performance analysis would take some time. But a few quick inspections shows that it’s not terrible, and certainly good enough for prototype work. I have no evidence for this, but I also suspect that it’s amenable to server-side optimization–inside the regular expression matching code, paths that involve start-anchored matches should be easy to identify and in many cases avoid work farther down the string. There’s plenty of room on the XQuery side for optimization as well.

If you’ve experimented with different lexing techniques, or are interested in more details of this approach, drop me a line in the comments. -m

Thursday, April 26th, 2012

MarkLogic World 2012

I’m getting ready to leave for MarkLogic World, May 1-3 in Washington, DC, and it’s shaping up to be one fabulous conference. I’ve always enjoyed the vibe at these events–it has a, well, cool-in-a-data-geeky-way thing going on (like the XML conference in the early 2000’s where I got to have lunch with James Clark, but that’s a different story). Lots of people with big data problems will be here, and I always enjoy talking to these kinds of people.

I’m speaking on Wednesday at 3:30 with Product Manager extraordinaire Justin Makeig about big data visualization. If you’ll be at the conference, come look me up. And if you won’t, well, forgive me if I need a few extra days to get back to any email you send this way.

Follow me on Twitter and look for the #MLW12 tag for live coverage.

-m

Sunday, April 15th, 2012

Actually using big data

I’ve been thinking a lot about big data, and two recent items nicely capture a slice of the discussion.

1) Alex Milowski recounting working with Big Weather Data. He concludes that ‘naive’ (as-is) data loading is a “doomed” approach. Even small amounts of friction add up at scale, so you should plan on doing som in-situ cleanup. He came up with a slick solution in MarkLogic–go read his post for details.

2) Chris Dixon on Making Large Datasets Useful. Typical approaches like machine learning only solve 80-90% of the problem. So you need to either live with errorful data, or invoke manual clean-up processes.

Both worth a read. There’s more to say, but I’m not ready to tip my hand on a paper I’m working on…

-m

Wednesday, February 1st, 2012

Googlebot submitting Flash forms

I’m sure this is old news by now, but here’s one more data point.

As it turns out, XForms Institute uses an old skool XForms engine written in Flash, dating approximately back to the era when Flash was necessary to do XForms-ey things in the browser. The feedback form for the site is, quite naturally, implemented in XForms. Submissions there ultimately make it into my inbox. Here’s what I see:

Tue Jan 31 12:19:22 2012 66.249.68.249 Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

An iPhone running Flash? I doubt it. That’s quite an agent string! Organic versioning in the wild. -m

Sunday, January 15th, 2012

Five iOS keyboard tips you probably didn’t know

Check out these tips. The article talks about iPad, but they work on iPhone too, even an old 3G.

One one hand, it shows the intense amount of careful thought Apple puts into the user experience. But on the other hand, it highlights the discovery problem. I know people who have been using iOS since before it was called iOS, and still didn’t know about these. How do you put these kinds of finishing touches into a product and make sure the target audience can find out about them? -m

Thursday, December 8th, 2011

Resurgence of MVC in XQuery

There’s been an increasing amount of talk about MVC in XQuery, notably David Cassel’s great discussion and to an extent Kurt Cagle’s platform discussion that touched on forms interfaces. Lots of Smart People are thinking in this area, and that’s a good thing.

A while back I recorded my thoughts on what I called MET, or the Model Endpoint Template organizational pattern, as used in MarkLogic Application Builder. One difference between 2009 and now, though, is that browsers have distanced themselves even farther from XML, which tends to undercut the eliminate-the-impedance-mismatch argument. In particular, the forms model in HTML5 continues to prefer flat data, which to me indicates that models still play an important role in XQuery web apps.

So I envision the app lifecycle like this:

  1. The browser requests a particular page, say the one that lets you configure sorting options in the app you’re building
  2. An HTML page loads.
  3. Client-side script requests the project state from a designated endpoint, the server transforms the XML into a flat list, and delivers it as JSON (as an optimization, the server can package the initial data into the page delivered in the prior step)
  4. Standard form interaction and client-side scripting happens, including manipulation of repeating structures mediated by JavaScript
  5. A standard form submit happens (possibly via script), sending a flat list back to the client, which performs an update to the stored XML.
It’s pretty easy to envision data-mapping tools and libraries that help automate the construction of the transforms mentioned in steps 3 and 5.

Another thing that’s changed is the emergence of XQuery plugin technology in MarkLogic. There’s a rapidly-growing library of reusable components, initially centered around Information Studio but soon to cover more ground. This is going to have a major impact on XQuery app designs as components of the app (think visualization widgets) can be seamlessly added to apps.

Endpoints still make a ton of sense for XQuery apps, and provide the additional advantage that you now have a testable, concern-separated data layer for your app. Other apps have a clean way to interop, and even command-line operaton is possible with off-the-shelf-tools like wget.

Lastly, Templates. Even if you use plugins for the functional core of your app, there’s still a lot of boilerplate stuff you’d not want to repeat. Something like Mustache.xq is a good fit for this.

Which is all good–but is it MVC? This organizational pattern (let’s call it MET 2.0) is a lot closer to it. Does MET need a controller? Probably. (MarkLogic now ships a pretty good one called rest:rewrite) Like MVC, MET separates the important essences of your application. XQuery will never be Ruby or Java, and its frameworks will never be Rails or Spring, but rather something uniquely poised to capture the expressive power of the language to build apps on top of unstructured and big data. -m

Tuesday, November 1st, 2011

5 things to know about MarkLogic 5

MarkLogic 5 is out today. Here’s five things beyond the official announcement that developers should know about it:

  1. If you found the CQ sample useful, you’ll love Query Console, which does everything CQ does and more (syntax highlighting!)
  2. Better Search API support for metadata: MarkLogic has always had support for storing metadata separately from documents. With new Search API support, it’s easy to set up, and it works great with databases of binary documents.
  3. The Hadoop connector, while not officially supported in this configuration, works on Mac. I know a lot of developers use Mac hardware. Once you get Hadoop itself set up (following rules like these), everything works great in my experience.
  4. “Fields” have gotten more general and more powerful. If you haven’t set aside named portions of your documents or metadata for special indexing and access, you should look in to this feature–it will rock your world.
  5. To better understand what your system is doing at any point in time, you can now use the built-in Monitoring Dashboard, which runs in-browser.
And let’s not leave out the Express license, which makes it easier to get started. Check it out.
-m

Monday, May 30th, 2011

Good to Great

One book that Ken Bado, the MarkLogic President and CEO, likes to talk about is Good to Great, (subtitled why some companies make the leap… and others don’t), a result of many man-years of meticulous research.

There’s plenty to think about in this book. It talks about the qualities of a “level 5” executive: the best have a paradoxical mixture of personal humility and iron will. It talks about getting the right people on the bus, and only then deciding where the bus is going. It talks about a culture where brutal facts surfacing is the normal and expected behavior, resulting in a culture of both discipline and faith in the future. Perhaps the key point of the book is the venn diagram that depicts “great” companies as focusing on the intersection of passion, what they can be the best at in the world, and what drives their economic engine.

The structure of the book is based on 11 key companies that passed several rigorous metrics, including an at-least-15-year period of good financial performance, followed by a turning point and an at-least-15-year period of greatness, that is, returns well above the general and industry markets. (Perhaps unfairly, companies that were in the ‘great’ bucket continuously, with no periods of merely ‘good’ performance, were excluded).

Two of the companies in the list: Fannie Mae and Wells Fargo, raised the eyebrows of this fresh reader. Both of them have been prominently in the headlines in the last few years, and not in a good way. In particular the depictions of Wells Fargo struggling with deregulation in the 80s seem galling to read with the hindsight of going through the Great Recession. Circuit City, another of the good-to-great companies, declared bankruptcy in 2009. The book itself cautions about tough times at Gillette and Nucor in the Epilogue section.

I bring this out not to be negative, but to emphasize that this is a soft discipline, not science. If there are companies that have consistently beat the market from the 80s until today with no serious hiccups, that would be truly remarkable. But there’s lots of hidden variables, the system is chaotic, and mere financial numbers are too shallow a measure by which to measure greatness. A company that can truly follow these principles will almost certainly do better than one that doesn’t. Just look at Yahoo for a negative example.

In particular, I’m thinking the three circles are a good way to approach life, though I sincerely hope an individual’s third circle isn’t about optimizing finances. What can you be the best in the world at, have pasion for, and drive your personal satisfaction engine? Maybe that would be a good area to focus your limited resources on. -m

Thursday, February 17th, 2011

MarkLogic in the news

What’s that on your TV screen? Why, it’s MarkLogic, again.

Why President Obama Picked the Bay Area

And it’s true, we’re hiring big time. Maybe your resume should be in that pile… -m

Wednesday, January 5th, 2011

Why I am abandoning Yahoo! Mail (and why you should too)

This is a non-technical description of why Yahoo! Mail is unsafe to use in a public setting, and indeed at all. I will be pointing people at this page as I go through the long process of changing an address I’ve had for more than a decade.

What’s wrong with Yahoo Mail?

A lot of web addresses start with http://–that’s a signal that the “scheme” used to deliver the page to your browser is something called HTTP, which is a technical specification that turns out is a really good way to move around web pages. As the page flows to the browser, it’s susceptible to eavesdropping, particularly over a wi-fi connection, and much more so in public, including the usual hotspots like coffee shops, but also workplaces and many home environments. It’s the virtual equivalent of a postcard. When you’re reading the news or checking traffic, it’s not a big deal if someone can sneak a glance at your page.

Some addresses start with https://–notice the extra ‘s’ which stands for “secure”. This means two things 1) that the web page being sent over is encrypted, and thus unavailable to eavesdroppers, and 2) that the people running the site had to obtain a certificate, which is a form of proof of their identity as an organization (that they’re not, say, Ukrainian phishers). Many years ago, serving pages over https was considered quite expensive in that servers needed much beefier processors to run all that encryption. Today, while it still requires extra computation, it’s not as big of a deal. Most off-the-shelf servers have plenty of extra power. To be fair, for a truly ginormous application with millions of users like Yahoo Mail, it is not a trivial thing to roll out. But it’s critically important.

First, to dispel a point of confusion, these days nearly every site, including Yahoo Mail, uses https for the login screen. This is the most critical time when encryption is needed, because otherwise you’d be sending your password on a postcard for anyone with even modest technical skills to peek at. So that’s good, but it’s no longer enough. Because sites are written so that you don’t have to reenter your password on every single new page, they use a tiny bit of information called a “cookie” in your browser to stay logged in. Cookies themselves are neither good nor bad, but if an eavesdropper gets a hold of one, they can control most of your account–everything that doesn’t require re-entering a password. In Yahoo Mail this includes reading any of your messages, sending mail on your behalf, or even deleting messages. Are you comfortable allowing strangers to do this?

As I mentioned earlier, new, more powerful tools have been out for months that automate the process of taking over accounts this way. Zero technical prowess is needed, only the ability to install a browser plug-in. If there are any web companies dealing in personal information for which this wasn’t a all-hands-on-deck security wake-up, they are grossly negligent. Indeed, other sites like Gmail work with https all-the-time. But still, in 2011, Yahoo Mail doesn’t. I have a soft spot for Yahoo as a former employer, and I want to keep liking them. Too bad they make it so difficult.

The deeper issue at stake is that if this serious of an issue goes unfixed for months, how many lesser issues lurk in the site and have been around for months or years? The issue is trust, my friend, and Yahoo just overdrew their account. I’m leaving.

FAQ

Q: So what do you want Yahoo to do about this?  A: Well, they should fix their site for their millions of remaining users.

Q: What if they fix it tomorrow? Will you delete this message?  A: No. Since I no longer trust the site, I am leaving, even though it takes time to notify all the people who still send me mail, and no matter what other developments unfold in the meantime. This page will explain my actions.

Q: Do you really want everyone else to leave Yahoo Mail?  A: No, only those who care about their privacy.

Q: What’s your new email address?  A: I have a couple, but <my first name> @ <this domain> is a good general-purpose one.

I will continue to update this page as more information becomes available. -m

Saturday, December 4th, 2010

Yahoo Mail’s inexplicable, inexcusable lack of https support

Dear Yahoo,

What’s the deal? Shortly after FireSheep was announced on Oct 24, 2010, you should have had an emergency security all-hands meeting. You should have had an edict passed down from the “Paranoids” group to get secure or else. Maybe these things happened–I have no way of knowing.

But it is clear that it’s been 6 weeks and security hasn’t changed. It’s simply not possible to read Yahoo mail over https–try it and you get redirected straight back to an insecure channel. As such, anyone accessing Yahoo mail on a public network, say a coffee shop or a workplace, is vulnerable to having their private information read, forwarded, compromised, or deleted.

Wait, did I say 6 weeks?–SSL had apparently been rolled out for mail more than 2 years ago, but pulled back due to problems. Talk about failure to execute.

I feel like I missed an announcement. What’s the deal, Y? Show me that you care about your users. No excuses.

Sincerely,

-m

Sunday, August 22nd, 2010

Eulogy for SearchMonkey

This is indeed a sad day for all of us, for on October 1, a great app will be gone. Though we hardly had enough time during his short life to get to know him, like the grass that withers and fades, this monkey will finish his earthly course.

Updated SearchMonkey logo

Photo by Micah

I know he left many things undone, for example only enhancing 60% of the delivered result pages. He never got a chance to finish his life’s ambition of promoting RDFa and microformats to the masses or to be the killer app of the (lower-case) semantic web. You could say he will live on as “some of this structured data processing will be supported natively by the Microsoft platform”. Part of the monkey we loved will live on as enhanced results continue to flow forth from the Yahoo/Bing alliance.

The SearchMonkey Alumni group on LinkedIn is filled with wonderful mourners. Micah Alpern wrote there

I miss the team, the songs, and the aspiration to solve a hard problem. Everything else is just code.

Isaac Asimov was reported to have said “If my doctor told me I had only six minutes to live, I wouldn’t brood. I’d type a little faster.” Today we can identify with that sentiment. Keep typing.

-m

Monday, July 26th, 2010

Microsoft’s new slogan

I wanted to say something snarky about Microsoft’s new slogan, but the comments on the linked article did a pretty good job already. Ahh snark, the unthinking-man’s eloquence. -m

Wednesday, July 7th, 2010

Grokking Selenium

As the world of web apps gets more framework-y, I need to get up to speed on contemporary automation testing tools. One of the most popular ones right now is the open source Selenium project. From the look of it, that project is going through an awkward adolescent phase. For example:

  • Selenium IDE lets you record tests in a number of languages, but only HTML ones can be played back. For someone using only Selenium IDE, it’s a confusing array of choices for no apparent reason.
  • Selenium RC has bindings for lots of different languages but not for the HTML tests that are most useful in Selenium IDE. (Why not include the ability to simply play through an entire recorded script in one call, instead of fine grained commands like selenium.key_press(input_id, 110), etc.?)
  • The list of projects prominently mentions Selenium Core (a JavaScript implementation), but when you click through to the documentation, it’s not mentioned. Elsewhere on the site it’s spoken of in deprecating terms.
  • If you look at the developer wiki, all the recent attention is on Web Drivers, a new architecture for remote-controlling browsers, but those aren’t mentioned in the docs (yet) either.

So yeah, right now it’s awkward and confusing. The underlying architecture of the project is undergoing a tectonic shift, something that would never see public light of day in a proprietary project. In the end it will come out leaner and meaner. What the project needs in the short term is more help from fresh outsiders who can visualize the desirable end state and help the ramped and productive developers on the project get there.

By the way, if this kind of problem seems interesting to you, let me know. We’re hiring. If you have any tips for getting up to speed in Selenium, comment below.

-m

Wednesday, June 9th, 2010

“Google syntax” for semantic queries?

Thought experiment: are there any commonly-expressed semantic queries–the kind of queries you’d run over a triple store, or perhaps a SearchMonkey-annotated web site–expressible in common type-in-a-searchbox query grammar?

As a refresher, here’s some things that Google and other search engines can handle. The square brackets represent the search box into which the queries are typed, not part of the queries themselves.

[term]

[term -butnotthis]

[term1 OR term2]

[“phrase term”]

[tem1 OR term2 -“but not this” site:dubinko.info filetype:html]

So what kind of semantic queries would be usefully expressed in a similar way, avoiding SPARQL and the like? For example, maybe [by:”Micah Dubinko”] could map to a document containing a triple like <this document> <dc:author> “Micah Dubinko”. What other kinds of graph queries are interesting, common, and simple to express like this? Comments welcome.

-m

Sunday, May 30th, 2010

Balisage contest: solving the wikiml problem

I wish I could say I had something to do with the planning of this: part of Balisage 2010 is a contest to “encourage markup experts to review and to research the current state of wiki markup languages and to generate a proposal that serves to de-babelize the current state of affairs for the long haul.”  To enter, you must propose a set of concrete steps (organizational, social, and/or technological) that will enable wiki content interchange, a real WYSIWYG editor, and/or wiki syntax standardization.

This pushes all of my buttons. It’s got structured documents, Web, parser geekery, writing, engineering, and standards. There’s a bunch of open source prior art, including PyXMLWiki, which I adapted from some fantastic earlier work from Rick Jelliffe.

Sadly, MarkLogic employees aren’t eligible to enter. Get your write-up done by July 15 and sent to balisage-2010-contest at marklogic dot com. The winner will be announced at Balisage and will take home some serious prize winnings, and also will be strongly encouraged (but not required) to give a brief summary (~10 minutes) of their winning entry.

Can’t wait to see what comes out of this. -m

Friday, May 14th, 2010

Geek Thoughts: verbing facebook

Facebook (v): to deliberately create an impenetrable computer user interface for purposes of manipulating users.

More collected Geek Thoughts at http://geekthoughts.info.

Tuesday, May 11th, 2010

XProc is ready

Brief note: The W3C XProc specification, edited by my partner-in-crime Norm Walsh, has advanced to Recommendation status. Now go use it. -m

Thursday, April 29th, 2010

DMC = developer.marklogic.com

The new MarkLogic developer site is up, cleaner, better organized, and more social. Even cooler, it’s an XSLT-heavy application running on a pre-release version of MarkLogic. The new blog gives some of the details of the new site and transition.

So, if you’re already a MarkLogic developer, this is a great resource. And if you’re not, the site itself shows how fast and simple it is to put together a XSLT and XQuery-powered app. -m

Friday, April 2nd, 2010

Recalibrating expectations of XML performance

Working at MarkLogic has forced me to recalibrate my expectations around XML-related performance issues. Not to brag or anything, but it’s screaming fast. Conventional wisdom of avoiding // in paths doesn’t apply, since that’s the sort of thing the indexes are made to do, and that’s just the start. Single milliseconds are now a noteworthy amount of time for something showing up in the profiler.

This is what XML was supposed to be like. Now that XML has fallen off the hype cycle, we’re getting some serious work done. -m

Thursday, March 18th, 2010

Kindle for Mac scores low on usability

Here’s my first experience with Amazon’s new Kindle client for Mac: After digging up my password and logging in, I was presented with a bunch of books. I picked the last one I’d been reading. It downloaded slowly, without a progress bar, then dumped me on some page in the middle. Apparently my farthest-read location, but I honestly don’t remember.

A cute little graphic on the screen said I could use my scroll wheel. I’m on a laptop, so I tried the two-finger drag–the equivalent gesture sans mouse… and flipped some dozens of pages in half a second. Now, hopelessly lost I searched for a ‘back’ button to no avail.  Perversely, there is a prominent ‘back’ button, but disabled. Mocking me.

This feels rushed. I wonder what could be pushing Amazon to release something so unfinished? -m

Monday, March 1st, 2010

Newsweek should never have been free

Andrew Zolli argues in Newsweek that online content should never have been free. I’m probably not the first one to make this profound observation–but if it were not for the free online edition of Newsweek (and link aggregator sites like Digg) I wouldn’t have read a single word of Newsweek in years, nor would I be linking to it as my previous sentence does… Maybe Newsweek is OK with that. -m

Monday, February 22nd, 2010

Mark Logic User Conference 2010

Are you coming? Link. It starts on May 4 (Star Wars day!) at the InterContinental Hotel in San Francisco. Guest speakers include Chris Anderson, Editor-in-Chief of Wired and Michelle Manafy, Editor-in-Chief of EContent magazine.

Early bird registration ends Feb 28. -m