Friday, December 23rd, 2016

Two free courses that improved my life

Check out this post at its new home on Writing Through The Fog.

Monday, January 11th, 2016

Geektastic Things

I am trying something new with the GeekThoughts domain. Instead of pointing to my blog, it’s pointing at some cool geeky things on a CMS that’s easier to update. Won’t you check it out?

Wednesday, March 18th, 2015

Fiction update

Quick update here: if you are reading this, you’d probably like this short story, named in honor of Dennis Ritchie, FREE and currently burning up the charts for 30 minute reads in Mystery, Thriller, & Suspense. Doing pretty well in Science Fiction and Cyberpunk as well. (link fixed)

Do a solid for readers everywhere and leave a review.

Coming soon is the prequel to this story. It will be free only to folks who jump on the author mailing list.

We now return to your regularly-scheduled nonfictional geekery.


Friday, March 14th, 2014

I am not a robot spammer

Based on the huge number of mail bounces I’ve been getting today, it looks like an unscrupulous somebody forged my return address on a bunch of mail. Perhaps you even sought out this blog based on the distinctive domain name.

Some subject lines in use:

it’s so nice to write to u

maybe your lady


It is me!

It wasn’t me. It’s all to easy to claim an email is from somebody. And putting an unsuspecting schmo through all this apparently makes the message 0.003% more likely to get through filters deliberately trying to block abusive behavior.

And don’t worry: I haven’t (yet) got any really irate people. It’s mostly automated bounces from when an email address on the spammer’s list no longer exists. Carry on. -m

Monday, October 14th, 2013


If you’ve come here because of something you noticed in your HTTP access logs, read on.

Who is doing this? This is a personal project of Micah Dubinko. It is completely separate from anything related to any employer.

What is ASLbot? In the immediate future, ASLbot is no more than a personal research project. It consists of a web crawler, like Google, with an emphasis on sites centered around American Sign Language, and in particular reference materials relating to particular signs. At the moment, there is no publicly available search site, but I would like to set that up as time allows. My long term goal is to promote ASL as an effective means of communication while at the same time making it easier to research and learn about.

Will this affect my site? No. I have the crawl settings turned down very low, so that sites crawled have no discernible impact on performance. I also crawl very infrequently, as ASL dictionaries don’t tend to change terribly often. Once a search site is operating, you may notice an increase in traffic as more people are able to find and visit your site.

What do you intend to do with the crawled data? First off, this is a technology experiment. I’ve noticed that Google/Bing/Yahoo do only an “OK” job on queries like “asl sign for awesome” and think a dedicated site can do better. Once the basics are up, I’d like to do a lot more, but this will necessarily take a long time, as this is not my full-time work. For example, I would like to (possibly with manual input, especially from native signers) categorize signs by handshape, position, and movement in a manner similar to William Stokoe‘s groundbreaking research on ASL linguistics. Keep in mind that this, if it happens at all, is far in the future—imagine someone searching for “M handshape shoulder” and getting a list of hits that link to existing ASL dictionaries.

Do you plan to charge money to access the site? Never.

Do you automatically download videos? No. Only web pages.

How do I make it stop? Think of it this way: Does your site appear in Google? If so, people will be searching and finding particular signs anyway, but without the aid of an ASL-positive web tool. But if you really want to, put an entry for “ASLbot” in your robots.txt file, which this crawler fully honors.

This is awesome, how do I help?  Or, I still have questions: Feel free to email me using the contact information listed on this site, or ( <my first name> @ <this> )

Saturday, April 13th, 2013


This week marked the MarkLogic World conference and with it some exciting news. Without formally “announcing” a new release, the company showed off a great deal of semantic technology in-progress. Part of that came from me, on stage during the Wednesday technical keynote. I’ve been at MarkLogic five years next month, and the first piece of code I wrote there was an RDFa parser. This has been a long time coming.

It was an amazing experience. I was responsible for sifting through the huge amounts of public data–both in RDF formats and on public web pages–and writing the semantic code to pull everything together, culminating in those ten minutes on stage.

Picture this: just behind the big stage and the projected screens was a hive of impressive activity. I counted 8 A/V people backstage, plus 4 more at the back of the auditorium. The conference has reached  a level of production values that wouldn’t be vastly different if it was a stadium affair. So in back there’s a curtained-off “green room” with some higher-grade snacks (think PowerBars and Red Bull) with a flatscreen that shows the stage. From back there you can’t see the projected slides or demos, but if you step just outside, you’re at the reverse side of the screen, larger-than-life. The narrow walkway leads to the “chute”, right up the steps onto the main stage. As David Gorbet went through the opening moments of his talk in fine form, I did some stretches and did everything I could think of to prepare myself.

Then he called me up and the music blasted out from the speakers. I had been playing through my mind all the nightmare scenarios–tripping on the stairs and falling on my face as I come onstage (etc.)–but none of that happened. I’ve done public speaking many times before so I had an idea what to expect, though on a stage like that the lights are so bright that it’s hard to see beyond about the third row. So despite the 300-400 people in the room, it didn’t even feel much different than addressing an intimate group of peers. It was fun. On with the demos:

The first showed our internal MarkMail cluster with a simple ‘infobox’ of the sort that all the search engines are doing these days. This was an icebreaker to talk about semantics and how it works–in this case locate the concept of Hadoop in the database, and from there find all the related labels, abstracts, people, projects, releases, and so on. During the construction of the demo, we uncovered some real world facts about the author of the top-ranked message for the query, including a book he wrote. The net effect was that these additional facts made the results a lot more useful by providing a broader context for them.

The second demo showed improved recall–that is finding things that would otherwise slip under the radar. The existing [from:IBM] query in MarkMail does a good job finding people that happen to have the letters i-b-m in their email address. The semantic query [affiliation:IBM] in contrast knows about the concept of IBM, the concept of people, and the relationship of is-affiliated-with (technically foaf:affiliation) to run a query that more closely models how a person would ask the question: “people that work for IBM” as opposed to “people that have i-b-m in their email address”. This the results included folks posting from gmail accounts and other personal addresses, and the result set jumped from about 277k messages to 280k messages.

At this point, a pause to talk about the architecture underlying the technology. It turns out that a system that already supports shared-nothing scale out, full ACID transactions, multiple HA/DR options, and a robust security model is a good starting point for building semantic capabilities.  (I got so excited at this point that I forgot to use the clicker for a few beats and had to quickly catch-up the slides.) SPARQL code on the screen.

Then the third demo, a classic semantic app with a twist. Pulling together triples from several different public vocabularies, we answered the question of “find a Hadoop expert” with each row of the results representing not a document, as in MarkMail results, but an actual person. We showed location data (which was actually randomized to avoid privacy concerns) and aggregate cost-of-living data for each city. When we added in a search term, we drew histograms of MarkMail message traffic over time and skipped over the result that had no messages. The audience was entranced.

This is exciting work. I had several folks come up to me afterwards with words to the effect of they hadn’t realized it before, but boy do they ever need semantics. I can’t think of a better barometer for a technical keynote. So back to work I go. There’s a lot to do.

Thanking by name is dangerous, because inevitably people get left out, but I would like to shout out to David Gorbet who ran the keynote, John Snelson who’s a co-conspirator in the development effort, Eric Bloch who helped with the MarkMail code more than anyone will ever know, Denis Shehan who was instrumental in wrangling the cloud and data, and Stephen Buxton who patiently and repeatedly offered feedback that helped sharpen the message.

I’ll post a pointer to the video when it’s available. -m

Monday, December 31st, 2012

New Year’s Resolution

Holding steady at 1440 x 900.

Relevant. -m

Monday, September 17th, 2012

MarkLogic 6 is here

MarkLogic 6 launched today, and it’s full of new and updated goodies. I spent some time designing the new Application Builder including the new Visualization Widgets. If you’ve used Application Builder in the past, you’ll be pleasantly surprised at the changes. It’s leaner and faster under the hood. I’d love to hear what people think of the new architecture, and how they’re using it in new and awesome ways.

If I had to pick out a common theme for the release, it’s all about expanding the appeal of the server to reach new audiences. The Java API makes working with the server feel like a native extension to the language, and the REST API makes it easy to extend the same to other languages.

XQuery support is stronger than ever. I liked Ryan Dew’s take on some of the smaller, but still useful features.

This wouldn’t be complete without thanking my teammates who really made this possible. I had the great pleasure of working with some top-notch front-end people recently, and it’s been a great experience. -m


Tuesday, August 7th, 2012

Balisage Bound

I’m en route to Balisage 2012, though beset by multiple delays. The first leg of my flight was more than two hours delayed, which made the 90 minute transfer window…problematic. My rebooked flight, the next day (today, that is) is also delayed. Then through customs. Maybe all I’ll get out of Tuesday is Demo Jam. But I will make it.

I’m speaking on Thursday about exploring large XML datasets. Looking forward to it!


Thursday, April 26th, 2012

MarkLogic World 2012

I’m getting ready to leave for MarkLogic World, May 1-3 in Washington, DC, and it’s shaping up to be one fabulous conference. I’ve always enjoyed the vibe at these events–it has a, well, cool-in-a-data-geeky-way thing going on (like the XML conference in the early 2000’s where I got to have lunch with James Clark, but that’s a different story). Lots of people with big data problems will be here, and I always enjoy talking to these kinds of people.

I’m speaking on Wednesday at 3:30 with Product Manager extraordinaire Justin Makeig about big data visualization. If you’ll be at the conference, come look me up. And if you won’t, well, forgive me if I need a few extra days to get back to any email you send this way.

Follow me on Twitter and look for the #MLW12 tag for live coverage.


Sunday, April 15th, 2012

Actually using big data

I’ve been thinking a lot about big data, and two recent items nicely capture a slice of the discussion.

1) Alex Milowski recounting working with Big Weather Data. He concludes that ‘naive’ (as-is) data loading is a “doomed” approach. Even small amounts of friction add up at scale, so you should plan on doing som in-situ cleanup. He came up with a slick solution in MarkLogic–go read his post for details.

2) Chris Dixon on Making Large Datasets Useful. Typical approaches like machine learning only solve 80-90% of the problem. So you need to either live with errorful data, or invoke manual clean-up processes.

Both worth a read. There’s more to say, but I’m not ready to tip my hand on a paper I’m working on…


Wednesday, February 1st, 2012

Googlebot submitting Flash forms

I’m sure this is old news by now, but here’s one more data point.

As it turns out, XForms Institute uses an old skool XForms engine written in Flash, dating approximately back to the era when Flash was necessary to do XForms-ey things in the browser. The feedback form for the site is, quite naturally, implemented in XForms. Submissions there ultimately make it into my inbox. Here’s what I see:

Tue Jan 31 12:19:22 2012 Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +

An iPhone running Flash? I doubt it. That’s quite an agent string! Organic versioning in the wild. -m

Tuesday, November 1st, 2011

5 things to know about MarkLogic 5

MarkLogic 5 is out today. Here’s five things beyond the official announcement that developers should know about it:

  1. If you found the CQ sample useful, you’ll love Query Console, which does everything CQ does and more (syntax highlighting!)
  2. Better Search API support for metadata: MarkLogic has always had support for storing metadata separately from documents. With new Search API support, it’s easy to set up, and it works great with databases of binary documents.
  3. The Hadoop connector, while not officially supported in this configuration, works on Mac. I know a lot of developers use Mac hardware. Once you get Hadoop itself set up (following rules like these), everything works great in my experience.
  4. “Fields” have gotten more general and more powerful. If you haven’t set aside named portions of your documents or metadata for special indexing and access, you should look in to this feature–it will rock your world.
  5. To better understand what your system is doing at any point in time, you can now use the built-in Monitoring Dashboard, which runs in-browser.
And let’s not leave out the Express license, which makes it easier to get started. Check it out.

Thursday, June 9th, 2011

SkunkLink in Belorussian

The awesome thing about the internet is that you never know who’s reading your stuff. Case in point: during the depths of the hypertext linking standards discussions, after folks realized that XLink wasn’t going to work with HTML (not even with XML-flavored XHTML), all kinds of proposals flew around about what to do about it. One was my own SkunkLink, a “skunkworks” attempt to get people thinking in a certain direction.

An enthusiastic follower, Bohdan Zograf, has translated SkunkLink into Belorussian, available here. He’s also mentioned translating all of XForms Essentials, which I completely support, and is just the kind of thing I hoped would happen when I put the text under a liberal content license.

Awesome. -m

Sunday, May 1st, 2011

MarkLogic User Conference coverage

Running commentary on Twitter, but hurry, Twitter’s search infrastructure has the long-term memory of a fruit fly. Posts tagged with MLUC11 will soon be dropping off the search event horizon. -m

Thursday, March 31st, 2011

Follow me on Twitter

Not an April Fools joke. I’m now tweeting in professional capacity. I’ll talk about XML technologies, the web, and various and sundry geeky topics. -m

Thursday, February 3rd, 2011

We’ll always have Prague

Today I exchanged electrons with a major airline, which will ultimately result in them removing a certain amount of abstract currency units from my account.

In other words, see you all at XML Prauge 2011. I’ve never been to this conference before, and each year I hear better and better things. Looking forward to it. -m

Friday, January 7th, 2011

XForms Training: Feb 14, 15 in Maryland

The remarkable C. M. Sperberg-McQueen is offering XForms training in Maryland (at Mulberry Technologies), Feb 14 & 15, 2011. This is a two-day hands-on introduction to XForms. Check it out. This is a great opportunity to learn more about XForms. -m

Wednesday, January 5th, 2011

Why I am abandoning Yahoo! Mail (and why you should too)

This is a non-technical description of why Yahoo! Mail is unsafe to use in a public setting, and indeed at all. I will be pointing people at this page as I go through the long process of changing an address I’ve had for more than a decade.

What’s wrong with Yahoo Mail?

A lot of web addresses start with http://–that’s a signal that the “scheme” used to deliver the page to your browser is something called HTTP, which is a technical specification that turns out is a really good way to move around web pages. As the page flows to the browser, it’s susceptible to eavesdropping, particularly over a wi-fi connection, and much more so in public, including the usual hotspots like coffee shops, but also workplaces and many home environments. It’s the virtual equivalent of a postcard. When you’re reading the news or checking traffic, it’s not a big deal if someone can sneak a glance at your page.

Some addresses start with https://–notice the extra ‘s’ which stands for “secure”. This means two things 1) that the web page being sent over is encrypted, and thus unavailable to eavesdroppers, and 2) that the people running the site had to obtain a certificate, which is a form of proof of their identity as an organization (that they’re not, say, Ukrainian phishers). Many years ago, serving pages over https was considered quite expensive in that servers needed much beefier processors to run all that encryption. Today, while it still requires extra computation, it’s not as big of a deal. Most off-the-shelf servers have plenty of extra power. To be fair, for a truly ginormous application with millions of users like Yahoo Mail, it is not a trivial thing to roll out. But it’s critically important.

First, to dispel a point of confusion, these days nearly every site, including Yahoo Mail, uses https for the login screen. This is the most critical time when encryption is needed, because otherwise you’d be sending your password on a postcard for anyone with even modest technical skills to peek at. So that’s good, but it’s no longer enough. Because sites are written so that you don’t have to reenter your password on every single new page, they use a tiny bit of information called a “cookie” in your browser to stay logged in. Cookies themselves are neither good nor bad, but if an eavesdropper gets a hold of one, they can control most of your account–everything that doesn’t require re-entering a password. In Yahoo Mail this includes reading any of your messages, sending mail on your behalf, or even deleting messages. Are you comfortable allowing strangers to do this?

As I mentioned earlier, new, more powerful tools have been out for months that automate the process of taking over accounts this way. Zero technical prowess is needed, only the ability to install a browser plug-in. If there are any web companies dealing in personal information for which this wasn’t a all-hands-on-deck security wake-up, they are grossly negligent. Indeed, other sites like Gmail work with https all-the-time. But still, in 2011, Yahoo Mail doesn’t. I have a soft spot for Yahoo as a former employer, and I want to keep liking them. Too bad they make it so difficult.

The deeper issue at stake is that if this serious of an issue goes unfixed for months, how many lesser issues lurk in the site and have been around for months or years? The issue is trust, my friend, and Yahoo just overdrew their account. I’m leaving.


Q: So what do you want Yahoo to do about this?  A: Well, they should fix their site for their millions of remaining users.

Q: What if they fix it tomorrow? Will you delete this message?  A: No. Since I no longer trust the site, I am leaving, even though it takes time to notify all the people who still send me mail, and no matter what other developments unfold in the meantime. This page will explain my actions.

Q: Do you really want everyone else to leave Yahoo Mail?  A: No, only those who care about their privacy.

Q: What’s your new email address?  A: I have a couple, but <my first name> @ <this domain> is a good general-purpose one.

I will continue to update this page as more information becomes available. -m

Thursday, September 9th, 2010

FCC opens its databases

Good news for big data fans. The FCC has released APIs to several large databases involving broadband statistics, spectrum licenses, and some related topics. I haven’t had a chance for a close look yet, perhaps we can do that together. Link. -m

Sunday, August 22nd, 2010

Eulogy for SearchMonkey

This is indeed a sad day for all of us, for on October 1, a great app will be gone. Though we hardly had enough time during his short life to get to know him, like the grass that withers and fades, this monkey will finish his earthly course.

Updated SearchMonkey logo

Photo by Micah

I know he left many things undone, for example only enhancing 60% of the delivered result pages. He never got a chance to finish his life’s ambition of promoting RDFa and microformats to the masses or to be the killer app of the (lower-case) semantic web. You could say he will live on as “some of this structured data processing will be supported natively by the Microsoft platform”. Part of the monkey we loved will live on as enhanced results continue to flow forth from the Yahoo/Bing alliance.

The SearchMonkey Alumni group on LinkedIn is filled with wonderful mourners. Micah Alpern wrote there

I miss the team, the songs, and the aspiration to solve a hard problem. Everything else is just code.

Isaac Asimov was reported to have said “If my doctor told me I had only six minutes to live, I wouldn’t brood. I’d type a little faster.” Today we can identify with that sentiment. Keep typing.


Thursday, August 5th, 2010

Balisageurs: XML and JSON

At David Lee’s nocturne about XML and JSON round-trippimg, several folks were talking about a site that listed several “off-the-shelf” conversion methods, but nobody could remember the site.

Late that night, with 15 minutes of battery remaining, I found it. The operative search term is XSLTJSON. -m

Saturday, July 31st, 2010

Balisage bound

See all you markup extremists in Montreal. Look me up. -m

Wednesday, July 21st, 2010

Meade Classe August 7

Join me for another Meade Classe at the Los Altos MoreFlavor brew shop.

Saturday, Aug. 7, 2010 2:00 – 4:00 pm

MoreFlavor 991 N. San Antonio Road

Los Altos, CA 94022

We will taste some meads, focusing on sensory evaluation, then walk through the steps of brewing up a batch. As usual, seating is limited, so email me to reserve a spot. To help the brew shop recover the costs of the honey, yeast, and light snacks available, a $10 donation will help make sure these events can continue.

On a personal note, I’ll be traveling back from a conference in Montreal the day before, so I might be a little jet lagged. Could get interesting. :-) -m

Thursday, July 15th, 2010

VP XIV bound

Thrilled, THRILLED to announce that I’ve been accepted to the 2010 Viable Paradise workshop. I sent in the first 8000 words of a manuscript that about half of the 7 readers of this blog have looked at. You know, the one that is Science Fiction–literally, fiction about science. So I’ll be spending some time in early October at Martha’s Vineyard studying at the feet of published authors and honing my craft.

Class size is limited, so I’ve been actively psyching myself down for the last month, not getting my hopes up too high. Then when the acceptance came, I had computer down time, and nearly exploded from holding the news in for 3 days. :-)

Ahhhh. I should say more, but I believe I may still be in shock. -m

P. S. OK, how about 25 great opening lines.

Sunday, June 20th, 2010

RDBMS Alternatives

For anyone trying to get up to speed on the technology side of non-traditional databases, including NoSQL concepts and not-your-father’s-XML, this webinar looks like a good start. Tuesday June 29, 2pm EST, 11am PST. -m

Sunday, May 30th, 2010

Balisage contest: solving the wikiml problem

I wish I could say I had something to do with the planning of this: part of Balisage 2010 is a contest to “encourage markup experts to review and to research the current state of wiki markup languages and to generate a proposal that serves to de-babelize the current state of affairs for the long haul.”  To enter, you must propose a set of concrete steps (organizational, social, and/or technological) that will enable wiki content interchange, a real WYSIWYG editor, and/or wiki syntax standardization.

This pushes all of my buttons. It’s got structured documents, Web, parser geekery, writing, engineering, and standards. There’s a bunch of open source prior art, including PyXMLWiki, which I adapted from some fantastic earlier work from Rick Jelliffe.

Sadly, MarkLogic employees aren’t eligible to enter. Get your write-up done by July 15 and sent to balisage-2010-contest at marklogic dot com. The winner will be announced at Balisage and will take home some serious prize winnings, and also will be strongly encouraged (but not required) to give a brief summary (~10 minutes) of their winning entry.

Can’t wait to see what comes out of this. -m

Tuesday, May 11th, 2010

XProc is ready

Brief note: The W3C XProc specification, edited by my partner-in-crime Norm Walsh, has advanced to Recommendation status. Now go use it. -m

Thursday, April 29th, 2010


The new MarkLogic developer site is up, cleaner, better organized, and more social. Even cooler, it’s an XSLT-heavy application running on a pre-release version of MarkLogic. The new blog gives some of the details of the new site and transition.

So, if you’re already a MarkLogic developer, this is a great resource. And if you’re not, the site itself shows how fast and simple it is to put together a XSLT and XQuery-powered app. -m

Monday, February 22nd, 2010

Mark Logic User Conference 2010

Are you coming? Link. It starts on May 4 (Star Wars day!) at the InterContinental Hotel in San Francisco. Guest speakers include Chris Anderson, Editor-in-Chief of Wired and Michelle Manafy, Editor-in-Chief of EContent magazine.

Early bird registration ends Feb 28. -m