Newest Post

September 28th, 2016

Quantified bronchitis

Lots of people are using Fitbit to get into better shape. But I haven’t heard of anyone using trackers to better measure what happens when you get sick.

As it turns out, I’ve had a nasty case of bronchitis over the last week. This is the sickest I’ve been in a while, and I find it fascinating to look at the data. I hope folks from Fitbit corporate, and other interested individuals will be able to make use of the data.

First off, here’s what shortness of breath and fever do to you: they raise your heart rate.


My normal resting heart rate is in the low 60s. But as you can see here, it hardly dropped below 90, even while I was asleep. Fitbit considers a rate of 89 or above to be the “fat burn” zone, and I had an incredible 20 hours less two minutes in that zone. Interestingly, Fitbit considered my resting heart rate to be only 76 bpm. I think it uses some kind of rolling average over many days.


Here’s what the resting heart rate looked like:


And here’s the overall time in various zones.



The highest spike is the day we were just talking about. But days leading up to that had significantly raised heart rate too. After that, it dropped off fast, but the scale can be deceptive. That data point 3rd from the end is still more than four hours in the “fat burn” zone, which is a lot for lying in bed.

Next up, step counts:


I muddled through Wednesday not feeling well, but activity crashed the way you’d expect, spending entire days in bed.

Lastly calories:


The leftmost column is the Wednesday I muddled through, with a burn of 3004 calories, which is pretty typical. But the next day, despite my step count dropping by 90%, Fitbit recorded a burn of almost 3,500 calories. True, this was with 14-and-a-half hours in the “fat burn” zone, but I don’t think it could make that much of a difference. This has to be a bug. Someone, my set of physiological conditions triggered some defect in the algorithms that made it badly overestimate my burn.

[From what research I could find, running a fever of 102 indeed increases your basal metabolic rate. But not that much. It probably less-than-compensates for the decreased physical activity while sick.]

Friday, the day I had 20 hours in the “fat burn” zone, I recorded a more realistic 2438 calorie burn.

What do you think? If you have fitness tracker data from when you’re sick, I’d love to see it.

Thanks, -m

June 16th, 2016

Antennas and photons

In the previous article I described how antennae work in terms of EM waves. But EM isn’t exactly a wave. Quantum aspects require modeling as particles. Photons. But I can’t really figure out how a photon traveling through space gets converted into an electron current in a wire. There are some cases where treating EM as waves really seems simpler.

But there are probably places where considering it as particles matters too. Like, maybe, the EM drive. Multiple independent tests have confirmed that this device, simply by bouncing microwaves around inside a specially-shaped resonator, produces thrust. Huh?

A new paper suggests a theoretical model where this doesn’t violate Newton’s 3rd law. But the explanation involves paired up out-of-phase photons. Are there existing technologies or experiments where this phenomonon takes place?

I wonder if it’s analogous in any way to how electrons pair up to manifest superconductivity… Would love to hear from the Physics crowd. Add your comment below.


May 8th, 2016

The antenna project

A layman’s description of how antennas work, plus some related experiments. Physics is strange when you think about it.

I’ve been working with electronics since I was about five. (Not an exaggeration. I “fixed” one of my two-battery-requiring cars with one good battery and some wire.) I sailed through high school electronics, and went on to get professional training in the field. But I’m still pretty mystified by radio waves.

So here’s my set of experiments and a written journal of my learning process. I’m going to figure out Radio, RF, antennas, wave propagation, and be able to explain it to a beginner. My goal is to create a device that harvests ambient RF from the air and illuminates (possibly intermittently) an LED.

First off, DC circuits. Most people have seen something like this, or can figure it out just by looking. (Schematic diagrams courtesy of the iCircuit simulator on OS X)


Due to a chemical reaction, the nine volt battery produces a constant potential of 9 volts. The longer side in the diagram is the + terminal. The squiggly line is a resistor, a device that slows the flow of electricity. The arrow is a LED, a light emitting diode. In this configuration, it would light up, though not terribly brightly. In a circuit like this, the wires can be treated as if they have no resistance at all, because compared to anything else in the circuit, they effectively are zero resistance. So measuring the voltage, say at the top of the battery terminal, is within a rounding error of the measurement on the left-hand side of the resistor.

This circuit is static and simple. In particular, there’s no energy storage anywhere. Current is always proportional to voltage divided by resistance (Ohm’s Law). Compare with this circuit:


Same 9 volt supply and a limiting 1k resistor. But also two more branches on the right. The rightmost one is a capacitor. As the diagram indicates, it is essentially two metal plates separated by an insulator. This has the ability to store energy in the form an electric field. Notice that the voltmeter still reads the full 9 volts, even though the switch in the lower-left is open. The battery has been disconnected from the circuit, but it’s still live. (This, by the way, is why old CRT-style TV sets can be dangerous to poke around inside, even after they’re unplugged.)

As further evidence that energy is being stored, when the circuit is first turned on, the voltmeter doesn’t instantly snap up to 9 volts. It gradually builds up, because the capacitor is charging.

The other branch in the circuit is an inductor, or a a coil. This can store energy in the form of a magnetic field. (The 10 ohm resistor is to model the inevitable resistance in the windings of the coil. Without it, this simulator assumes a perfect conductor which throw things off a bit.)

What would happen if we closed the switch on the inductor circuit? Well, the energy from the capacitor would suddenly gush through the inductor, which stretch out a magnetic field around the coil, like a rubber band. Then the magnetic field would collapse, sending a new burst of energy into the capacitor, and so forth. Current will slosh back and forth in the circuit until the losses through the resistor diminish the impulse. The circuit would ring like a tuning fork, though with the values here, the frequency is only around 5 cycles per second.

Within the coils of the inductor, we can no longer assume that all voltages and currents will be equal, because it is interacting with the outside world through a magnetic field. Because there is more going on than simple DC, the simplifying assumption from the first circuit no longer holds. Voltage and current may not be directly proportional, because we have to account for stored energy.

So what in the world is a radio wave then? How is visible light, or an X-Ray, the same as an AM radio broadcast? We need more theory to get there.

Every wire that conducts electricity produces a magnetic field (and every moving magnetic field produces current in a conductor) like this:


(image: Wikimedia Commons)

If you put iron filings on paper, ran a wire through perpendicularly as shown here, and put a strong current through the wire, you’d get a nested donut pattern. (Incidentally, this is why inductors are coils. That shape makes all the magnetic donuts on every turn of the wire point the same way and sum together.) My electric toothbrush uses magnetic induction to charge. But this still isn’t a radio wave.

A closely related concept is an electric field. While the magnetic field is caused by current (flow), an electric field is caused by voltage (potential). Every charged particle exhibits a field like this:


(image: Wikimedia Commons)

When you combine an electric field with a magnetic field, amazing things can happen. Like in the circuit above where energy sloshed back and forth, electric and magnetic fields can support each other and propagate, that is, travel through space at the speed of light.

Probably the most straightforward math comes from a “dipole” antenna, which we can make by taking twin-lead wire and separating out the ends. This is most efficient when each end of the T is half a wavelength of the signal. (For example, the speed of light ~= 300M m/sec. So your 2.4Ghz wifi signal has a wavelength of ~ 300M / 2400M = 0.125 m, or a hair under 5 inches.)

If you drive this antenna at its resonant frequency, you could measure the current along various points of either side of the dipole, and you’d measure different values. This isn’t a simple DC circuit! It took me a while to figure out how to visualize this, but you need to start with the current. At the extreme end of the wires, the current will always be zero. Think of a long, arterial one-way street. At the very end, there will always be no traffic, but the farther you get toward the main road, the more traffic there will be coming and going. You can visualize it like this. The black represents the physical wires of the antenna, and blue is a graph of the current at a given point along the wire.

Remember that current flow means there’s a magnetic donut circling the wire, proportional in strength to the amount of current flow.

This flow means that electric charge will accumulate, being strongest at the very ends of the wires. Which in turn means there will be an electric field that looks very much like our diagram above.


The applied signal, being a sine wave, drops to zero a quarter of a cycle later, and these electric fields, supported by the donut magnetic fields, slosh back and detach from the wire to become closed loops in space, and then proceed to propagate. The cycle repeats, switching polarity each time.


The wikipedia page for dipole antenna has some animated gifs which may or  may not help you visualize this.

I welcome your feedback. Coming up next: some experiments. If you have suggestions, I’d love to hear them. -m

May 5th, 2016

Unsafe Java

Pop quiz. Why is the following Java 8 code unsafe? UPDATE: this code is fine, see comments. Still good to think about, though.

Entity e = new Entity();
e.setName("my new entity");

To provide some context, Entity is a POJO representing something we want to store in a database. And persistanceLayer is an instance of a Data Access Object responsible for storing the object in the database. Don’t see it yet?

Here’s the function signature of that put method

<E extends AbstractEntity> CompletableFuture<E> put(E newEntity);

Ah, yes, Spring, which because of the @Async annotation will proxy this method and cause it to execute in a different thread. The object getting passed in is clearly mutable, and by my reading of the Java Memory Model, there’s no guarantee that the other thread’s view of the update that happened in the setName method will be visible. In other words, there’s no guarantee the two threads idea of “before-ness” will be the same; who knows what might have been reordered or cached?

I’m having trouble coming to terms with what seems like a gaping hole. Do you agree with this analysis? If so, how do you deal with it? Immutable object classes seem like an obvious choice, but will introduce lots of overhead in other parts of the code that need to manipulate the entities. A builder pattern gets awkward and verbose quickly, as it hoists all that complexity on to the caller.

How do you handle situations like this? I’d love to hear it.

If you produce libraries, please include documentation on what is and isn’t guaranteed w.r.t. async calls. -m

April 24th, 2016

The physics of the impossibly tiny

According to Newtonian gravitation, the attraction between two bodies is proportional to the product of their masses and inversely proportional to the square of the distance between them. Einstein refined this somewhat, but as long as there aren’t crazy speeds or non-flattish spacetime involved, Newton’s formulation is accurate. As far as we know.

I read this interesting article, which spun off the thought experiment below. The EM Drive mechanism keeps confounding some Really Smart people who would genuinely like to discredit it. What’s going on here? I’m probably butchering the explanation, but the article posits that very small accelerations don’t behave in a completely smooth manner. In other words, they’re subject to quantum effects.

Here’s a thought experiment. What is the Newtonian gravitational attraction between this hydrogen atom in my little finger, and one in, say, the Andromeda galaxy? (Pedantically, one that was in the Andromeda galaxy 2.5 million years ago, from the Earth’s frame of reference)

I put this into Wolfram Alpha (which helpfully supplied an value for the big-G constant) and came up with an answer of:

1.419 * 10-109 N (newtons)

That’s not a typo. There’s 109 places to the right of the decimal.

I don’t even have a good analogy for how slight that force is. A trillionth trillionth of the seismic energy of a dandelion seed landing, measured from a trillion km away?? Still too much, by a lot. It’s a reasonable question whether the universe even goes to that level of detail. Could such a slight force even be said to exist in a meaningful sense? In other words, could it be measured, even in principle? I’m no expert, but I doubt it. So maybe physics isn’t completely smooth down to that level.

Is there a limit to how small a force can be and still act like a force?

To accurately compute the complete gravitational affect on this hydrogen atom in my little finger, I’d have to take into account the 1080 particles in the observable universe, the vast majority of which make unmeasurably tiny contributions to the overall sum. That’s for a single instant. For a continuous number, I’d need to repeat this enormous calculation something like 1043 times per second. So all you universe simulator writers out there, take note. Some simplification is probably warranted. :)

(A realistic simulator would only need to take into account the mass of the Earth, Moon, Sun, and for a few decimal places more, Jupiter. But I’m talking about laws running the universe, not engineering hacks.)

It seems there are still some quite interesting things left to discover in the universe. I keep going back to the chapter Surprises from the Real World in Lee Smolin’s Trouble With Physics. Exciting times ahead! -m

January 11th, 2016

Re-creating the Semantics Demo

(From the archives: I wrote this over 2 years ago, but never hit publish. At last, the tale can be told!)

If you haven’t seen it, the keynote at MarkLogic World 2013 is worth a look. I was on stage demonstrating new Semantics features built into MarkLogic server. Two of the three demos were based on MarkMail, a database of some 60 million messages, with enhanced search capabilities driven by semantics. (The third demo was a built-from-the-ground-up semantic application).

Since then, several folks have asked about the code behind the demo. What I showed was a fully operational MarkMail instance, including millions of email messages. This was understandably quite expensive to keep up on AWS, and it went away shortly after the keynote. A huge part of the demo was showing operation at scale, but reading between the lines, what folks are more interested in is something more portable–a way to see the code in operation and play with it without having to stand up an entire cluster or go through a lengthy setup procedure.

Space won’t allow for a full semantics tutorial here. For that, a good resource is this free tutorial from MarkLogic University.

So, in this posting, let’s recreate something on a similar level using built-in features. We’ll use the Oscars sample application that ships with the product. To get started, create an Application Builder sample project and deploy it. We’ll call the relevant database names ‘oscar’ and ‘oscar-modules’ throughout. Since Application Builder ships with only a small amount of data, you may also want to run the sample Information Studio collector that will fetch the rest of the dataset.

Before we can query, we need to actually turn on the semantics index. The easiest place to do this is on the page at http://localhost:8000/appservices/. Select the oscar database and hit configure. On the page that comes up, tick the box for Semantics and wait for the yellow flash.

Semantic Data

This wouldn’t be much of a semantic application without triple data. Entire books have been written on this kind of data modeling, but one huge advantage of semantics is that there’s lots of data already set up and ready to go. We’ll use dbpedia. The most reecent release as of this writing is version 3.9.

From there we’ll grab data that looks relevant to the Oscar application: anything about people and/or movies, picking and choosing from things most likely to have relevant facts:

In all, a bit less over 38 million triples–not even enough to make MarkLogic break a sweat, but still a large enough chunk to be inconvenient to download. Since the oscar data and dbpedia ultimately derive from the same source–Wikipedia itself. Since the oscar data preserved URLs it was straightforward to extract all triples that had a matching subject, once prefixed with “”.

I extracted all these triples: Grab them from here and put it somewhere on your local system.

Then simply load these triples via query console. Point the target database to ‘oscar’ and run this:

import module namespace sem=""
  at "MarkLogic/semantics.xqy";

Infopanel widget

So an ‘infopanel’ is what in the MLW demo showed the Hadoop logo, committers, downloads, and other facts about the current query. The default oscar app already has something like this: widgets. Let’s create a new widget type that looks up and displays facts about the current query. To start, if you haven’t already, build the example application in App Builder. There’s some excellent documentation that walks through this process.

Put on your Front End Dev hat and let’s build a widget. All the code we will use and modify is in the oscar-modules database, so either hook up a WebDav server or copy the files out to your filesystem to work on them. Back in AppBuilder on the Assemble page, click the small X at the upper-right corner of the pie chart widget. This will clear space for the widget we’re about to create, specifically in the div <div id=”widget-2″ class=”widget widget-slot”>.

The way to do this is to modify the file application/custom/app-config.js All changes to files in the custom/ directory will survive a redeployment in AppBuilder, which means your changes will be safe, even if you need to go back and change things in Application Builder.

function infocb(dat) {
 $("#widget-2").html("<h2>Infopanel</h2><p>The query is " +
    JSON.stringify(dat.query) + "</p>");
var infopanel = ML.createWidget($("#widget-2"), infocb, null, null);

This gives us the bare minimum possible widget. Now all that’s left is to add semantics.

Hooking up the Infopanel query

We need a semantic query, the shape of which is: “starting with a string, find the matching concept, and from that concept return lots of facts to sift through later”.

And we have everything we need at hand with MarkLogic 7. The REST endpoint, already part of the deployed app, includes a SPARQL endpoint. So we need to make the new widget fire off a semantic query in the SPARQL language, then render the results into the widget. One nice thing about the triples in use here is that they consistently use the foaf:name property to map between a concept and its string label. So pulling all the triples based on a string-named topic works like this. Again, we’ll use Query Console to experiment:

import module namespace sem = ""
    at "/MarkLogic/semantics.xqy";
let $str := "Zorba the Greek"
let $sparql := "
prefix foaf: <>
construct { ?topic ?p ?o }
{ ?topic foaf:name $str .
?topic ?p ?o . }
return sem:sparql($sparql, map:entry("str", $str))

Here, of course, to make this Query Console runnable we are passing in a hard-coded string (“Zorba the Greek”) but in the infopanel this will come from the query.

Of course, deciding what parts of the query to use could be quite an involved process. For example, if the query included [decade:1980s] you can imaging all kinds of interesting semantic queries that might produce useful and interesting results. But to keep things simple, we will look for only a s single word query, which includes quoted phrases like “Orson Welles”. Also in the name of simplicity, the code sample will only use a few possible predicates. Choosing which predicates to use, and in what order to display them, is a big part of making an infopanel useful.

Here’s the code. Put this in config/app-config.js:

function infocb(dat) {
  var qtxt = dat.query && dat.query["word-query"] &&
        dat.query["word-query"][0] && dat.query["word-query"][0].text &&
  if (qtxt) {
      url: "/v1/graphs/sparql",
      accepts: { json:"application/rdf+json" },
      dataType: "json",
      data: {query:
        'prefix foaf: <> ' +
        'construct { ?topic ?p ?o } ' +
        'where ' +
        '{ ?topic foaf:name "' + qtxt + '"@en . ' +
        '?topic ?p ?o . }'
      success: function(data) {
        var subj = Object.keys(data); // ECMAscript 5th ed, IE9+
        var ptitle = "";
        var pdesc = "";
        var pthumb = "";
        var title = "-";
        var desc = "";
        var thumb = "";
        if (data[subj]) {
          if (data[subj][ptitle]) {
            title = data[subj][ptitle][0].value;
          if (data[subj][pdesc]) {
            desc = "<p>" + data[subj][pdesc][0].value + "</p>";
          if (data[subj][pthumb]) {
            thumb = "<img style='width:150px; height:150px' src='" +
                data[subj][pthumb][0].value + "'/>";
        $("#widget-2").html("<h2>" + title + "</h2>" + desc + thumb );
  } else { $("#widget-2").html("no data")} 

var infopanel = ML.createWidget($("#widget-2"), infocb, null, null);

This works by crafting a SPARQL query and sending it off to the server. The response comes back in RDF/JSON format, with the subject as a root object in the JSON, and each predicate against that subject as a sub-object. The code looks through the predicates and picks out interesting information for the infopanel, formatting it as HTML.

I noted in working on this that many of the images referenced in the dbpedia image dataset actually return 404 on the web. If you are not seeing thumbnail images for some queries, this may be why. An infopanel implementation can only be as helpful as the data underneath. If anyone knows of more recent data than the official dpbedia 3.9 data, do let me know.

Where to go from here

I hope this provides a base upon which many developers can play and experiment. Any kind of app, but especially a semantic app, comes about through an iterative process. There’s a lot of room for expansion in these techniques. Algorithms to select and present semantic data can get quite involved; this only scratches the surface.

The other gem in here is the widget framework, which has actually been part of all Application Builder apps since MarkLogic 6. Having that technology as a backdrop made it far easier to zoom in and focus on the semantic technology. Try it out, and let me know in the comments how it works for you.

January 11th, 2016

Geektastic Things

I am trying something new with the GeekThoughts domain. Instead of pointing to my blog, it’s pointing at some cool geeky things on a CMS that’s easier to update. Won’t you check it out?

January 2nd, 2016

Pixel editor

I did a thing.

I am experimenting with machine learning and neural networks. To do so, I need a real-world dataset to play with. For starters, I am using a 5×7 pixel array, as common in many DIY projects, representing digits 0-9. Please help me my drawing a picture of the randomly-selected digit below. All data will be released to the public domain.

Can you help me? Take a few seconds and draw a picture of a number by clicking on pixels to toggle their value.

Email me with any questions or suggestions.


Data collection by Google Custom Forms.

Machine learning datasets FTW! Stay tuned for code & results.

October 12th, 2015

The Vector Sum theory of leadership

I’ve talked about this before, but I don’t think I’ve ever written it down. As more of my day-to-day involves leadership, I think about this stuff. To run an effective team, you need to think in vector sums.

As the great poet A. Yankovic once said, “Do vector calculus just for fun.” But this isn’t even calculus, just trig. :) Let’s see if we can avoid flashbacks to high school math. You can think of a vector as a value that has both a magnitude and a direction, like the wind blowing 10 mph to the NW.

Every team member is a vector. The magnitude is how much stuff that person can get done, and the direction is what they see as the end-point of their work. The goal. Unlike a wind report, the direction might consist of many different dimensions, but the basic principles still hold.

What happens when your team grows to two people?

Vectors can add. Geometrically, you can think a vector as an arrow with a particular length and orientation. Adding is stacking two (or more) arrows tail-to-head. So two vectors opposed by 180 degrees will tend to cancel each other out. Two vectors pointing exactly the same direction reinforce.

But here’s the great thing: two vectors that mostly point in the same direction give almost as much benefit as if they were exactly aligned.

Example: Team member A is pulling NW at 10 mph. Team member B is pulling NE at 10 mph. If you treat this as a right triangle, you get approximately:

Team member A contributions: 7 mph N, 7 mph W

Team member B contributions: 7 mph N, 7 mph E

The E and W components fight against each other, and you end up with 14 mph due north. Even though the team members are pointing in very different directions, they are still around 70% effective in combination. Of course, the job of a manager is to establish and communicate goals effectively to better align team members and prevent rework. Herein lies efficiency.

For example, if my math is right, in the case where team member A is pulling NNW and B is pulling NNE, the efficiency jumps from 70% to 92%.

Many managers fall into the trap of micro-managing, which is missing the point. Hire smart people, make sure they’re pointed in the right direction and let them run ahead as fast as they can. Stay just enough ahead of them to remove obstacles before they encounter them. That’s a great leader.

Bonus epiphany

I recently realized that this principle also applies to the thoughts in your head. No, I’m not talking about an after-school-special  multiple-personality situation. But thousand of thoughts rattle through your mind every day, and each one has a magnitude and a direction. You need to get them all pointing in the same direction, more-or-less, to be effective as a human being.

This is a restatement of a common concept that goes under many names such as “Personal brand.” Is everything you think about (and subsequently do) helping make you into who you want to be?

May 2nd, 2015

Before your standup, sit down

I’m running a Scrum project, and doing things a little differently than the classic method. I’ve done this at two different companies now, and it seems to work out well.

Before a traditional standup meeting with The Three Questions, I schedule 15 minutes of intentional downtime. Since this happens at the start of the day, it’s a chance to quietly plan out your day, and start on a thoughtful note.

I’ve often found myself “preparing” for a standup meeting by quickly scribbling down what I’ve done on a sticky note, so that when called upon, I can know what to say without having to think too hard. So why not formalize this? Work life is hectic enough. A few minutes of quiet reflection can make a big difference. Knowing what you need to work on is the first step to avoiding thrashing. Since we use a software tool to track sprint work, capturing the day’s list in this app is a fine thing to do during this time.

One caution: it’s easy for this to slip into extra-minutes-for-commute time, or just a regular standup meeting that starts 15 minutes later. To get the benefit, the team needs to be present (but not necessarily in the meeting room).

Has anyone else tried something like this? -m

March 18th, 2015

Fiction update

Quick update here: if you are reading this, you’d probably like this short story, named in honor of Dennis Ritchie, FREE and currently burning up the charts for 30 minute reads in Mystery, Thriller, & Suspense. Doing pretty well in Science Fiction and Cyberpunk as well. (link fixed)

Do a solid for readers everywhere and leave a review.

Coming soon is the prequel to this story. It will be free only to folks who jump on the author mailing list.

We now return to your regularly-scheduled nonfictional geekery.


February 5th, 2015

DIY fizzy yerba mate drink

Filed under need-to-try this: A homebrew version of a popular hacker drink called Club Mate.

I already have a carbonator cap and CO2 setup, as part of my beer brewing hardware.

One variation I would experiment with is cutting down on the sugar. Even though Club Mate isn’t very sweet, it still has a fair amount of sugar in it. Stevia could be a good alternative. -m

July 27th, 2014

Prime Number sieve in Scala

There are a number of sieve algorithms that can be used to list prime numbers up to a certain value.  I came up with this implementation in Scala. I rather like it, as it makes no use of division, modulus, and only one (explicit) multiplication.

Despite being in Scala, it’s not in a functional style. It uses the awesome mutable BitSet data structure which is very efficient in space and time. It is intrinsically ordered and it allows an iterator, which makes jumping to the next known prime easy. Constructing a BitSet from a for comprehension is also easy with breakOut.

The basic approach is the start with a large BitSet filled will all odd numbers (and 2), then iterate through the BitSet, constructing a new BitSet containing numbers to be crossed off, which is easily done with the &~= (and-not) reassignment method. Since this is a logical bitwise operation, it’s blazing fast. This code takes longer to compile than run on my oldish MacBook Air.

import scala.collection.mutable.BitSet
import scala.collection.breakOut
println(new java.util.Date())
val top = 200000
val sieve = BitSet(2)
sieve |= (3 to top by 2).map(identity)(breakOut)
val iter = sieve.iteratorFrom(3)
while(iter.hasNext) {
 val n =
 sieve &~= (2*n to top by n).map(identity)(breakOut)
println(sieve.toIndexedSeq(10000)) // 0-based
println(new java.util.Date())

As written here, it’s a solution to Euler #7, but it could be made even faster for more general use.

For example

  • I used a hard-coded top value (which is fine when you need to locate all primes up to  n). For finding the nth prime, though, the top limit could be calculated
  • I could stop iterating at sqrt(top)
  • I could construct the removal BitSet starting at n*n rather than n*2

I suspect that spending some time in the profiler could make this even faster. So take this as an example of the power of Scala, and a reminder that sometimes a non-FP solution can be valid too. Does anyone have a FP equivalent to this that doesn’t make my head hurt? :-)


March 14th, 2014

I am not a robot spammer

Based on the huge number of mail bounces I’ve been getting today, it looks like an unscrupulous somebody forged my return address on a bunch of mail. Perhaps you even sought out this blog based on the distinctive domain name.

Some subject lines in use:

it’s so nice to write to u

maybe your lady


It is me!

It wasn’t me. It’s all to easy to claim an email is from somebody. And putting an unsuspecting schmo through all this apparently makes the message 0.003% more likely to get through filters deliberately trying to block abusive behavior.

And don’t worry: I haven’t (yet) got any really irate people. It’s mostly automated bounces from when an email address on the spammer’s list no longer exists. Carry on. -m

February 18th, 2014


I can’t blog about secret projects I’m working on, so how about something completely different?

I’ve improved my fitness level substantially over the last five years. (On index cards, I have my daily weight and body fat percentage, according to the bathroom scale, back to November 2009). Here’s some things I’ve learned:

  • Moving counts. A lot. The difference between being completely sedentary and moving a bit (easy walks, standing desk, etc.) is the biggest leap. Everything after that is incremental.
  • Spending $99 in a Fitbit is the best health investment I’ve made, dollar-for-dollar, ever.
  • Expensive shoes don’t help much. My current main shoes were $40 online, and they’re just as good, if not better, than the $120 shoes from Roadrunner.
  • Pilates looks easy if you’ve never tried it.
  • Once you reach a certain level, you will plateau there unless you challenge yourself further.
  • Strength training is helpful for just about everything, even improving your running times.
  • Foam rollers are super useful for managing sore muscles and tendons. Highly-recommended.
  • Boosting your VO2Max is painful–interval training is the gasping-for-air kind of torture many people think of when they hear the word ‘exercise’–but it’s also important if you want to improve your run times.
  • But you shouldn’t try to improve your run times or anything else unless you have specific bigger-picture goals in mind.
  • Seriously–sitting is terrible for you. Get a standing desk.

Invest in yourself. -m

October 29th, 2013

Skunklink a decade later

Alex Milowski asks on Twitter about my thoughts on Skunklink, now a decade old.

Linking has long been thought one of the cornerstones of the web, and thereby a key part of XML and related syntaxes. It’s also been frustratingly difficult to get right. XLink in particular once showed great promise, but when it came down to concrete syntax, didn’t get very far. My thinking at the time is still well-reflected in what is, to my knowledge, the only fiction ever published on A Hyperlink Offering. That story ends on a hopeful note, and a decade out, I’m still hoping.

For what it purports to do, Skunklink still seems like a good solution to me. It’s easy to explain. The notion of encoding the author’s intent, then letting devices work out the details, possibly with the aid of stylesheets and other such tools, is the right way to tackle this kind of a problem. Smaller specifications like SkunkLink would be a welcome breath of fresh air.

But a bigger question lurks behind the scenes: that of requirements. Does the world need a vocabulary-independent linking mechanism? The empirical answer is clearly ‘no’ since existing approaches have not gained anything like widespread use, and only a few voices in the wilderness even see this as a problem. In fact, HTML5 has gone in quite the opposite direction, rejecting the notion of even a vocabulary-independent syntax, to say nothing of higher layers like intent. I have to admit this mystifies me.

That said, it seems like the attribute name ‘href’ has done pretty well in representing intended hyperlinks. The name ‘src’ not quite as well. I still consider it best practice to use these names instead of making something else up.

What do you think? -m


October 14th, 2013


If you’ve come here because of something you noticed in your HTTP access logs, read on.

Who is doing this? This is a personal project of Micah Dubinko. It is completely separate from anything related to any employer.

What is ASLbot? In the immediate future, ASLbot is no more than a personal research project. It consists of a web crawler, like Google, with an emphasis on sites centered around American Sign Language, and in particular reference materials relating to particular signs. At the moment, there is no publicly available search site, but I would like to set that up as time allows. My long term goal is to promote ASL as an effective means of communication while at the same time making it easier to research and learn about.

Will this affect my site? No. I have the crawl settings turned down very low, so that sites crawled have no discernible impact on performance. I also crawl very infrequently, as ASL dictionaries don’t tend to change terribly often. Once a search site is operating, you may notice an increase in traffic as more people are able to find and visit your site.

What do you intend to do with the crawled data? First off, this is a technology experiment. I’ve noticed that Google/Bing/Yahoo do only an “OK” job on queries like “asl sign for awesome” and think a dedicated site can do better. Once the basics are up, I’d like to do a lot more, but this will necessarily take a long time, as this is not my full-time work. For example, I would like to (possibly with manual input, especially from native signers) categorize signs by handshape, position, and movement in a manner similar to William Stokoe‘s groundbreaking research on ASL linguistics. Keep in mind that this, if it happens at all, is far in the future—imagine someone searching for “M handshape shoulder” and getting a list of hits that link to existing ASL dictionaries.

Do you plan to charge money to access the site? Never.

Do you automatically download videos? No. Only web pages.

How do I make it stop? Think of it this way: Does your site appear in Google? If so, people will be searching and finding particular signs anyway, but without the aid of an ASL-positive web tool. But if you really want to, put an entry for “ASLbot” in your robots.txt file, which this crawler fully honors.

This is awesome, how do I help?  Or, I still have questions: Feel free to email me using the contact information listed on this site, or ( <my first name> @ <this> )

August 10th, 2013

XForms in 2013

This year’s Balisage conference was preceded by the international symposium on Native XML User Interfaces, which naturally enough centered around XForms.

As someone who’s written multiple articles surveying XForms implementations, I have to say that it’s fantastic to finally see one break out of the pack. Nearly every demo I saw in Montreal used XSLTForms if it used XForms at all. And yet, one participant I conversed with afterwards noted that very little that transpired at the symposium couldn’t have been done ten years ago.

It’s safe to say I have mixed emotions about XForms. One one hand, watching how poorly the browser makers have treated all things XML, I sometimes muse about what it would look like if we started fresh today. If we were starting anew, a namespace-free specification might be a possibility. But with  XForms 2.0 around the corner, it’s probably more fruitful to muse about implementations. Even though XSLTForms is awesome, I still want more. :-)

  • A stronger JavaScript interface. It needs to be possible to incrementally retrofit an existing page using POHF (plain old HTML forms) toward using XForms in whole or in part. We need an obvious mapping from XForms internals to HTML form controls.
  • Better default UI. I still see InfoPath as the leader here. Things designed in that software just look fantastic, even if quickly tossed together.
  • Combining the previous two bullets, the UI needs to be more customizable, and easier so. It needs to be utterly straightforward to make XForms parts of pages fit in with non-XForms parts of pages.
  • Rich text: despite several assertions during the week, XForms can actually handle mixed text, just not very well. One of the first demo apps (in the DENG engine–remember that?) was an HTML editor. The spec is explicitly designed in such a way as to allow new and exciting forms widgets, and a mixed-content widget would be A Big Deal, if done well.
  • Moar debugging tools

During the main conference, Michael Kay demonstrated Saxon-CE, an impressive tour-de-force in routing around the damage that is browser vendors’ attitudes toward XML. And though he didn’t make a big deal of it, it’s now available freely under an open source license. This just might change everything.

Curious about what others think here–I welcome your comments.


May 20th, 2013

Five years at MarkLogic

This past weekend marked my five-year anniversary at MarkLogic. It’s been a fun ride, and I’m proud of how much I’ve accomplished.

It was the technology that originally caught my interest: I saw the MarkMail demo at an XML conference, and one thing led to another. The company was looking to expand the product beyond the core database–they had plans for something called a “utility layer” though in reality it wasn’t really a utility nor a separate layer. It started with Search API, though the very first piece of code I wrote was an RDFa parser.

But what’s really held my interest for these years is a truly unmatched set of peers. This place is brimming with brilliant minds, and that keeps me smiling every day on my way in to work.

Which leads my thoughts back to semantics again. This push in a new direction has a lot of echoes with the events that originally brought me on board. This is going to be huge, and will move the company in a new direction. Stay tuned. -m

April 13th, 2013


This week marked the MarkLogic World conference and with it some exciting news. Without formally “announcing” a new release, the company showed off a great deal of semantic technology in-progress. Part of that came from me, on stage during the Wednesday technical keynote. I’ve been at MarkLogic five years next month, and the first piece of code I wrote there was an RDFa parser. This has been a long time coming.

It was an amazing experience. I was responsible for sifting through the huge amounts of public data–both in RDF formats and on public web pages–and writing the semantic code to pull everything together, culminating in those ten minutes on stage.

Picture this: just behind the big stage and the projected screens was a hive of impressive activity. I counted 8 A/V people backstage, plus 4 more at the back of the auditorium. The conference has reached  a level of production values that wouldn’t be vastly different if it was a stadium affair. So in back there’s a curtained-off “green room” with some higher-grade snacks (think PowerBars and Red Bull) with a flatscreen that shows the stage. From back there you can’t see the projected slides or demos, but if you step just outside, you’re at the reverse side of the screen, larger-than-life. The narrow walkway leads to the “chute”, right up the steps onto the main stage. As David Gorbet went through the opening moments of his talk in fine form, I did some stretches and did everything I could think of to prepare myself.

Then he called me up and the music blasted out from the speakers. I had been playing through my mind all the nightmare scenarios–tripping on the stairs and falling on my face as I come onstage (etc.)–but none of that happened. I’ve done public speaking many times before so I had an idea what to expect, though on a stage like that the lights are so bright that it’s hard to see beyond about the third row. So despite the 300-400 people in the room, it didn’t even feel much different than addressing an intimate group of peers. It was fun. On with the demos:

The first showed our internal MarkMail cluster with a simple ‘infobox’ of the sort that all the search engines are doing these days. This was an icebreaker to talk about semantics and how it works–in this case locate the concept of Hadoop in the database, and from there find all the related labels, abstracts, people, projects, releases, and so on. During the construction of the demo, we uncovered some real world facts about the author of the top-ranked message for the query, including a book he wrote. The net effect was that these additional facts made the results a lot more useful by providing a broader context for them.

The second demo showed improved recall–that is finding things that would otherwise slip under the radar. The existing [from:IBM] query in MarkMail does a good job finding people that happen to have the letters i-b-m in their email address. The semantic query [affiliation:IBM] in contrast knows about the concept of IBM, the concept of people, and the relationship of is-affiliated-with (technically foaf:affiliation) to run a query that more closely models how a person would ask the question: “people that work for IBM” as opposed to “people that have i-b-m in their email address”. This the results included folks posting from gmail accounts and other personal addresses, and the result set jumped from about 277k messages to 280k messages.

At this point, a pause to talk about the architecture underlying the technology. It turns out that a system that already supports shared-nothing scale out, full ACID transactions, multiple HA/DR options, and a robust security model is a good starting point for building semantic capabilities.  (I got so excited at this point that I forgot to use the clicker for a few beats and had to quickly catch-up the slides.) SPARQL code on the screen.

Then the third demo, a classic semantic app with a twist. Pulling together triples from several different public vocabularies, we answered the question of “find a Hadoop expert” with each row of the results representing not a document, as in MarkMail results, but an actual person. We showed location data (which was actually randomized to avoid privacy concerns) and aggregate cost-of-living data for each city. When we added in a search term, we drew histograms of MarkMail message traffic over time and skipped over the result that had no messages. The audience was entranced.

This is exciting work. I had several folks come up to me afterwards with words to the effect of they hadn’t realized it before, but boy do they ever need semantics. I can’t think of a better barometer for a technical keynote. So back to work I go. There’s a lot to do.

Thanking by name is dangerous, because inevitably people get left out, but I would like to shout out to David Gorbet who ran the keynote, John Snelson who’s a co-conspirator in the development effort, Eric Bloch who helped with the MarkMail code more than anyone will ever know, Denis Shehan who was instrumental in wrangling the cloud and data, and Stephen Buxton who patiently and repeatedly offered feedback that helped sharpen the message.

I’ll post a pointer to the video when it’s available. -m

March 31st, 2013

Introducing node-node:node.node

Naming is hard to do well, almost as hard as designing good software in the first place. Take for instance the term ‘node’ which depending on the context can mean

  1. A fundamental unit of the DOM (Document Object Model) used in creating rich HTML5 applications.
  2. A basic unit of the Semantic Web–a thing you can say stuff about. Some nodes are even unlabeled, and hence ‘blank nodes’.
  3. In operations, a node means, roughly, a machine on the network. E.g. “sixteen-node cluster”
  4. A software library for event-driven, asynchronous development with JavaScript.

I find myself at the forefront of a growing chorus of software architects and API designers that are fed up with this overloading of a perfectly good term. So I’m happy today to announce node-node:node.node.

The system is still in pre-alpha, but it solves all of the most pressing problems that software developers routinely run in to. In this framework, every node represents a node, for the ultimate in scalable distributed document storage. In addition, every node additionally serves as a node, which provides just enough context to make open-world assumption metadata assertions at node-node-level granularity. Using the power of Node, every node modeled as a node has instant access to other node-node:nodes. The network really is the computer. You may never write a program the old way again. Follow my progress on Sourceforge, the latest and most cutting-edge social code-sharing site. -m

March 1st, 2013


The valley is buzzing about Marissa’s edict putting the kibosh on Yahoos working from home. I don’t have any first-hand information, but apparently this applies somewhat even to one-day-a-week telecommuters. Some are saying Marissa’s making a mistake, but I don’t think so. She’s too smart for that. There’s no better way to get extra hours of work out of a motivated A-lister than letting them skip the commute, and I work regularly with several full-time telecommuters. It works out just fine.

This is a sign that Y is still infested with slackers. From what I’ve seen, a B-or-C-lister will ruthlessly take advantage of a WFH policy. If that dries up, they’ll move on.

If I’m right, the policy will indeed go into effect at Yahoo starting this summer, and after a respectable amount of time has passed (and the slackers leave) it will loosen up again. And Yahoo will be much stronger for it. Agree? -m

February 18th, 2013


So I did it.

I stood up on a platform in front of a room of native signers, and delivered a (pre-prepared) five minute presentation without making a sound. In front of cameras, with my ugly face beamed out to multiple large screens.

That was stressful, though less so then many different public speaking engagements I’ve participated in. It was a different kind of stress. I’m sure I made all kinds of mistakes of which I wasn’t even aware of. ASL books, videos, and web sites tend to focus on particular signs, and vocabulary is one important part of learning the language–but not the only part. A huge amount of the communication comes through facial expression, body shifting and language, and other “non-manual markers.” I’m learning, if slowly.

It’s also helping me in everyday situations, among hearing folks. I’m better able to express myself and I’ve picked up some new gestures (like non-dominant-hand indexing…more on that later), and I tend to, even if in the back of my mind, think about how you’d express such-and-such an idea in ASL, and having thought it through more, better express it in writing or speech.

It’s also helping to finally tame my inner-introvert. When a fundamental part of communication involves displaying play-by-play emotions on your face (and indeed, entire body) it changes you. Better than acting lessons.

What have you done lately to push yourself out of your comfort zone? -m

December 31st, 2012

New Year’s Resolution

Holding steady at 1440 x 900.

Relevant. -m

December 25th, 2012


My journey into ASL continues. I’ve been reading Oliver Sacks _Seeing Voices_ and Harlan Lane, Robert Hoffmeister, and Ben Bahan’s _A Journey into the DEAF-WORLD_. In short, learning a language in your thirties is a whole different ballgame than learning as a toddler. There are a few different brain plasticity cliffs you drop off especially at around age 6 and again at age 12.

And I’m completely OK with this. I don’t expect to ever get confused for a native signer, which is fine with me. I do expect, however, to become a better communicator–to develop sufficient skill to be clearly understood in ASL. I prefer to think of it like someone with a suave British accent in America. You’d never mistake them for a native and yet they are a joy to converse with. In the right circumstances, they can even grab your attention moreso than someone with a native accent.

This can only do good things for my spoken communication skills as well. It’s a lot like acting classes in some respects, which is a marked departure from my normally taciturn personality. This is encouraging me to quit holding everything inside quite so much, with encouraging results. If you see me walking a little taller, speaking a bit more emphatically, or better conveying emotion to get my point across, now you know what’s behind that. -m

December 8th, 2012


I’ve been learning a new language lately: American Sign Language aka ASL. Along with the language, I’ve picked up lots of new friends as part of a thriving culture. A big part of learning is through mistakes, and a big part of said culture is helpful bluntness. The combination of these can be a little rough on your ego sometimes.

Sometimes I notice that, when I’m corrected–say I make a sign incorrectly and my conversational partner demonstrates the correct way to do it–I often can’t tell any difference between what I was supposed to do and what my hands actually did. This kind of fundamental error in cognition seems to happen all the time with me. My helpful friends tell me that’s a good sign. (no pun intended)

A less-bruising kind of error is the “oops” kind–the instant you commit the error, you know you messed up. This, however, can sometimes throw you off if you get self-conscious about it. A third kind of error is when you know exactly what to do, but your physiology holds you back–for instance the ASL sign for either ‘6’ or ‘W’ (made the way most hearing people show a ‘3’ on their fingers; thumb holding down the pinky) is difficult for me to make without slowing way down. And to think, only 13 years ago I was playing keyboards in a little garage band. Guess I need some stretches. It’s good to loosen up.

In ASL, though, there’s a weird kind of middle ground. Sometimes people who don’t know Spanish kind of ‘fake it’ — “Yo no speako español” and the like, which has always come across to me as vaguely offensive. Being overly terrified of making a mistake is itself a fourth kind of mistake. ASL is remarkably flexible; even though it’s a complete language, it has aspects based on pantomime and sometimes “classifiers”, where your hands and fingers can stand in for people, vehicles, or many other things of particular shapes/sizes. I watch some very well-made ASL productions that have equally well-made English paragraphs alongside, and the ASL version uses all of these techniques and more. No word-for-word correspondence here: every time, I’m surprised by the versitility of the language. My theory is that for an earnest student, it’d be a lot harder to accidentally come across as offensive or mocking the language in ASL compared to other spoken languages. And thus, I’m probably committing the fourth kind of error too much.

It’s good to loosen up. -m

November 20th, 2012

Hedgehogs and Foxes

In Nate Sliver’s new book, he mentions a classification system for experts, originally from Berkeley professor Philip Tetlock, along a spectrum of Fox <—> Hedgehog. (The nomenclature comes from an essay about Tolstoy.)

Hedgehogs are type A personalities who believe in Big Ideas. The are ideologues and go “all-in” on whatever they’re espousing. A great many pundits fall into this category.

Foxes are scrappy creatures who believe in a plethora of little ideas and in taking different approaches toward a problem, and are more tolerant of nuance, uncertainty, complexity, and dissent.

There are a lot of social situations (broadly construed) where hedgehogs seem to have the upper hand. Talking heads on TV are a huge example, but so are many fixtures in the tech world, Malcolm Gladwell, say. Most of the places I’ve worked at have at least a subtle hedgehog-bias toward hiring, promotions, and career development.

To some degree, I think this stems from a lack of self-awareness. Brash pundits come across better on the big screen; they grab your attention and take a bold stand for sometihing–who wouldn’t like that? But if you take pause and think about what they’re saying or (horror) go back an measure their predictions after-the-fact, they don’t look nearly so good. Foxes are better at getting things right.

It seems like we’ve just been through a phase of more-obnoxious-than-usual punditry, and I found this spectrum a useful way to look at things. How about you? Are you paying more attention to hedgehogs when you probably should be listening to the foxes?


September 28th, 2012

Virgil Matheson: mentor

I’ve mentioned Virgil Matheson in these pages a few times, but never made a full accounting. When I had my O’Reilly book published, I submitted a simple dedication in the manuscript:

for Virgil

But for whatever reason, it didn’t make it into the printed edition. This post is a small step toward letting the world know about someone important to me.

We first met in 1985 or thereabouts. One day while riding my bike through a back-alley, I stopped to look at an equipment rack set outside a spare garage. Virgil came out to give a get-off-my-lawn kind of speech, and somehow we ended up talking about electronics.  This led to discussions about crystal radios, and in a subsequent visit, we built one, he explaining the principles of operation. Virgil, it turns out, was a retired teacher at the North Dakota State School of Science, where he taught AC theory and thermodynamics. I was going through some rough times, and Virgil ended up being a much-needed role model.

Around that time, I had ttempted to build a Heathkit radio set, but couldn’t quite get it working. I brought it to Virgil, and we traced through the schematic diagrams, eventually getting it working. Along the way, Virgil introduced me to all kinds of electronic test equipment, including oscilloscopes and galvanometers that he had hand-wound in his younger days.

The next year, I needed a science project, and I had become fixated on Tesla Coils. Virgil had worked at Westinghouse (but not in overlap with the good N. Tesla) and found this project right up his alley. We used his wood lathe to turn a base for the coil, and a standard lathe to wind a primary and two perfectly-spaced secondary coils on PVC pipe, after which we sprayed them down with insulating paint. We built a high-voltage power supply out of a car battery, ignition coil, and relay-type regulator from the junkyard. The thing would turn out serious spark on the primary side, and at one point, I accidentally made contact with it, knocking me clear off the metal bench I was sitting on. We used a spark gap and high-voltage capacitors from old equipment to make a resonator, and got the coil working. It could light a fluorescent tube from my full arm-span away. It was a smash hit at the science fair, too.

For one so knowledgable about the foundations of technology, he was awfully curmudgeonly about it. He bemoaned the day students started showing up in his class with hand-calculators instead of slide rules. He would never answer the phone (but would speak on it, if you could get his brother to pick up).

We kept meeting on and off, and we would have epic discussions/debates about technology, thermodynamics, perpetual motion machines, higher mathematics, theology, building test equipment, and logic puzzles. He taught me, in short, how to think.

A non-exhaustive list of things he taught me:

  • How to build a crystal radio set
  • How to troubleshoot a conventional radio (hint–check for signal at the volume control–that will narrow down the problem to either the front-end or back-end)
  • How to compute resonant LC circuits
  • How to use a slide rule
  • How to pick locks
  • How to compute power factor and plot phasor diagrams for AC circuits
  • The value of good tools and how to care for them
  • How to build a Tesla Coil
  • How to debate
  • Respect for high voltage
  • The joy of back-issues of Scientific American
  • The trouble with Pascal’s wager
  • How to debunk perpetual motion claims
  • How (and why) to use a planimeter

On a recent vacation, I went to see Virgil again–now in his 90s. He’s still vigorous and feisty, though his memory is starting to slip a little. It was difficult to come to terms with the possibility that, given the frequency with which I make it to that part of the country, it may be the last time I see him. Since this is posted online, he’ll probably never see it. But if he could speak to each one of you, I think he’d offer advice something like this:

Cherish the people in your life. Treat every meeting as if it might be the one that sets you on a new course–one that you’ll look back at years later in wonder. Don’t worry what others think of you, and never stop learning.

Thank you, Virgil, for all you’ve given me. -m

September 17th, 2012

MarkLogic 6 is here

MarkLogic 6 launched today, and it’s full of new and updated goodies. I spent some time designing the new Application Builder including the new Visualization Widgets. If you’ve used Application Builder in the past, you’ll be pleasantly surprised at the changes. It’s leaner and faster under the hood. I’d love to hear what people think of the new architecture, and how they’re using it in new and awesome ways.

If I had to pick out a common theme for the release, it’s all about expanding the appeal of the server to reach new audiences. The Java API makes working with the server feel like a native extension to the language, and the REST API makes it easy to extend the same to other languages.

XQuery support is stronger than ever. I liked Ryan Dew’s take on some of the smaller, but still useful features.

This wouldn’t be complete without thanking my teammates who really made this possible. I had the great pleasure of working with some top-notch front-end people recently, and it’s been a great experience. -m


August 23rd, 2012

Super simple tokenizer in XQuery

A lexer might seem like one of the boringest pieces of code to write, but every language brings it’s own little wrinkles to the problem. Elegant solutions are more work, but also more rewarding.

There is, of course, a large body of work on table-driven approaches, several of them listed here (and bigger list), though XQuery seems to have been largely left out of the fun.

In MarkLogic Search API, we implemented a recursive tokenizer. Since a search string can contain quoted pieces which need to be carefully maintained, first we split (in the fn:tokenize-sense, discarding matched delimiters) on the quote character, then iterate through the pieces. Odd-numbered pieces are chunks of tokens outside of any quoting, and even-numbered pieces are a single quoted string, to be preserved as-is. We recurse through the odd chunks, further breaking them down into individual tokens, as well as normalizing whitespace and a few other cleanup operations. This code is aggressively optimized, and it removes any searches for tokens known to not appear in the overall string. It also preserves the character offset positions of each token relative to the starting string, which gets used downstream, so this makes for some of the most complicated code in the Search API. But it’s blazingly fast.

When prototyping, it’s nice to have something simpler and more straightforward. So I came up with an approach using fn:analyze-string. This function, introduced in XSLT 2.0 and later ported to XQuery 3.0, takes a regular expression, and returns all of the target string, neatly divided into match and non-match portions. This is great, but difficult to apply across the entire string. For example, potential matches can have different meaning depending on where they fall (again, quoted strings as an example.) But if every regex starts with ^ which anchors the match to the front of the string, the problem simplifies to peeling off a single token from the front of the string. Keep doing this until there’s no string left.

This is a particularly nice approach when parsing a grammar that’s formally defined in EBNF. You can pretty much take the list of terminal expressions, port them to XQuery-style regexes, add a ^ in front of each, and roll.

Take SPARQL for example. It’s a reasonably rich grammar. The W3C draft spec has 35 productions for terminals. I sketched out some of the terminal rules (note these are simplified):

declare variable $spq:WS     := "^\s+";
declare variable $spq:QNAME  := "^[a-zA-Z][a-zA-Z0-9]*:[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:PREFIX := "^[a-zA-Z][a-zA-Z0-9]*:";
declare variable $spq:NAME   := "^[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:IRI    := "^<[^>]+>";

Then going through the input string, and seeing which of these expressions match, and if so, calling analyze-string and adding the matched portion as a token, and recursing on the non-matched portion. Note that we need to go through longer matches first, so the rule for ‘prefix:qname’ comes before the rule for ‘prefix:’ which comes before the rule for ‘string’

declare function spq:tokenize-recurse($in as xs:string, $tl as json:array) {
    if ($in eq "")
    then ()
    else spq:tokenize-recurse(
        case matches($in, $spq:WS)     return spq:discard-tok($in, $spq:WS)
        case matches($in, $spq:QNAME)  return spq:peel($in, $spq:QNAME, $tl, "qname")
        case matches($in, $spq:PREFIX) return spq:peel($in, $spq:PREFIX, $tl, "prefix", 0, 1)
        case matches($in, $spq:NAME)   return spq:peel($in, $spq:NAME, $tl, "name")

Here, we’re co-opting a json:array mutable object as a convenient way to store tokens as we peel them off. There’s not actually any JSON involved here. The actual peeling looks like this:

declare function spq:peel(
    $in as xs:string,
    $regex as xs:string,
    $toklist as json:array,
    $type as xs:string,
    $triml, $trimr) {
    let $split := analyze-string($in, $regex)
    let $match := string($split/str:match)
    let $match := if ($triml gt 0) then substring($match, $triml + 1) else $match
    let $match := if ($trimr gt 0) then substring($match, 1, string-length($match) - $trimr) else $match
    let $_ := json:array-push($toklist, <searchdev:tok type="{$type}">{$match}</searchdev:tok>)
    let $result := string($split/str:non-match)
    return $result

Some productions, like a <iri> inside angle brackets, contain fixed delimiters which get trimmed off. Some productions, like whitespace, get thrown away. And that’s it. As it stands, it’s pretty close to a table-driven approach. It’s also more flexible than the recursive approach above–even for things like escaped quotes inside a string, if you can write a regex for it, you can lex it.


But is it fast? Short answer is that I don’t know. A full performance analysis would take some time. But a few quick inspections shows that it’s not terrible, and certainly good enough for prototype work. I have no evidence for this, but I also suspect that it’s amenable to server-side optimization–inside the regular expression matching code, paths that involve start-anchored matches should be easy to identify and in many cases avoid work farther down the string. There’s plenty of room on the XQuery side for optimization as well.

If you’ve experimented with different lexing techniques, or are interested in more details of this approach, drop me a line in the comments. -m