Newest Post

April 26th, 2012

MarkLogic World 2012

I’m getting ready to leave for MarkLogic World, May 1-3 in Washington, DC, and it’s shaping up to be one fabulous conference. I’ve always enjoyed the vibe at these events–it has a, well, cool-in-a-data-geeky-way thing going on (like the XML conference in the early 2000′s where I got to have lunch with James Clark, but that’s a different story). Lots of people with big data problems will be here, and I always enjoy talking to these kinds of people.

I’m speaking on Wednesday at 3:30 with Product Manager extraordinaire Justin Makeig about big data visualization. If you’ll be at the conference, come look me up. And if you won’t, well, forgive me if I need a few extra days to get back to any email you send this way.

Follow me on Twitter and look for the #MLW12 tag for live coverage.

-m

April 15th, 2012

Actually using big data

I’ve been thinking a lot about big data, and two recent items nicely capture a slice of the discussion.

1) Alex Milowski recounting working with Big Weather Data. He concludes that ‘naive’ (as-is) data loading is a “doomed” approach. Even small amounts of friction add up at scale, so you should plan on doing som in-situ cleanup. He came up with a slick solution in MarkLogic–go read his post for details.

2) Chris Dixon on Making Large Datasets Useful. Typical approaches like machine learning only solve 80-90% of the problem. So you need to either live with errorful data, or invoke manual clean-up processes.

Both worth a read. There’s more to say, but I’m not ready to tip my hand on a paper I’m working on…

-m

February 1st, 2012

Googlebot submitting Flash forms

I’m sure this is old news by now, but here’s one more data point.

As it turns out, XForms Institute uses an old skool XForms engine written in Flash, dating approximately back to the era when Flash was necessary to do XForms-ey things in the browser. The feedback form for the site is, quite naturally, implemented in XForms. Submissions there ultimately make it into my inbox. Here’s what I see:

Tue Jan 31 12:19:22 2012 66.249.68.249 Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

An iPhone running Flash? I doubt it. That’s quite an agent string! Organic versioning in the wild. -m

January 15th, 2012

The ultimate breakfast smoothie

I’ve used this same recipe for three things: weight loss, after-exercise protein, and sore-teeth liquid diet. It’s great.

1 cup 2% milk

1 cup Dannon Fit & Light vanilla yogurt

1 scoop Syntha-6 protein powder (banana is great)

Mix.

This yields 450 calories with a whopping 39g of protein, 48g of carb (but only 30g of that simple sugars), 11g of fat, and 5g of fiber.

You could live off 3 or 4 of these a day. (and I have)

January 15th, 2012

Five iOS keyboard tips you probably didn’t know

Check out these tips. The article talks about iPad, but they work on iPhone too, even an old 3G.

One one hand, it shows the intense amount of careful thought Apple puts into the user experience. But on the other hand, it highlights the discovery problem. I know people who have been using iOS since before it was called iOS, and still didn’t know about these. How do you put these kinds of finishing touches into a product and make sure the target audience can find out about them? -m

January 14th, 2012

Call a Spade a Spade

A cautionary tale of language from Ted Nelson:

We might call a common or garden spade–

  • A personalized earth-moving equipment module
  • A mineralogical mini-transport
  • A personalized strategic tellurian command and control module
  • An air-to-ground interface contour adjustment probe
  • A leveraged tactile-feedback geomass delivery system
  • A man-machine energy-to-structure converter
  • A one-to-one individualized geophysical restructurizer
  • A portable unitized earth-work synthesis system
  • An entrenching tool
  • A zero-sum dirt level adjuster
  • A feedback-oriented contour management probe and digging system
  • A gradient disequilibrator
  • A mass distribution negentroprizer
  • (hey!) a dig-it-all system
  • An extra terrestrial transport mechanism

Spades, not words, should be used for shoveling. But words should help us unearth the truth.

–Computer Lib (1974), Theodor Nelson, p44

December 8th, 2011

Resurgence of MVC in XQuery

There’s been an increasing amount of talk about MVC in XQuery, notably David Cassel’s great discussion and to an extent Kurt Cagle’s platform discussion that touched on forms interfaces. Lots of Smart People are thinking in this area, and that’s a good thing.

A while back I recorded my thoughts on what I called MET, or the Model Endpoint Template organizational pattern, as used in MarkLogic Application Builder. One difference between 2009 and now, though, is that browsers have distanced themselves even farther from XML, which tends to undercut the eliminate-the-impedance-mismatch argument. In particular, the forms model in HTML5 continues to prefer flat data, which to me indicates that models still play an important role in XQuery web apps.

So I envision the app lifecycle like this:

  1. The browser requests a particular page, say the one that lets you configure sorting options in the app you’re building
  2. An HTML page loads.
  3. Client-side script requests the project state from a designated endpoint, the server transforms the XML into a flat list, and delivers it as JSON (as an optimization, the server can package the initial data into the page delivered in the prior step)
  4. Standard form interaction and client-side scripting happens, including manipulation of repeating structures mediated by JavaScript
  5. A standard form submit happens (possibly via script), sending a flat list back to the client, which performs an update to the stored XML.
It’s pretty easy to envision data-mapping tools and libraries that help automate the construction of the transforms mentioned in steps 3 and 5.

Another thing that’s changed is the emergence of XQuery plugin technology in MarkLogic. There’s a rapidly-growing library of reusable components, initially centered around Information Studio but soon to cover more ground. This is going to have a major impact on XQuery app designs as components of the app (think visualization widgets) can be seamlessly added to apps.

Endpoints still make a ton of sense for XQuery apps, and provide the additional advantage that you now have a testable, concern-separated data layer for your app. Other apps have a clean way to interop, and even command-line operaton is possible with off-the-shelf-tools like wget.

Lastly, Templates. Even if you use plugins for the functional core of your app, there’s still a lot of boilerplate stuff you’d not want to repeat. Something like Mustache.xq is a good fit for this.

Which is all good–but is it MVC? This organizational pattern (let’s call it MET 2.0) is a lot closer to it. Does MET need a controller? Probably. (MarkLogic now ships a pretty good one called rest:rewrite) Like MVC, MET separates the important essences of your application. XQuery will never be Ruby or Java, and its frameworks will never be Rails or Spring, but rather something uniquely poised to capture the expressive power of the language to build apps on top of unstructured and big data. -m

November 1st, 2011

5 things to know about MarkLogic 5

MarkLogic 5 is out today. Here’s five things beyond the official announcement that developers should know about it:

  1. If you found the CQ sample useful, you’ll love Query Console, which does everything CQ does and more (syntax highlighting!)
  2. Better Search API support for metadata: MarkLogic has always had support for storing metadata separately from documents. With new Search API support, it’s easy to set up, and it works great with databases of binary documents.
  3. The Hadoop connector, while not officially supported in this configuration, works on Mac. I know a lot of developers use Mac hardware. Once you get Hadoop itself set up (following rules like these), everything works great in my experience.
  4. “Fields” have gotten more general and more powerful. If you haven’t set aside named portions of your documents or metadata for special indexing and access, you should look in to this feature–it will rock your world.
  5. To better understand what your system is doing at any point in time, you can now use the built-in Monitoring Dashboard, which runs in-browser.
And let’s not leave out the Express license, which makes it easier to get started. Check it out.
-m

September 29th, 2011

facebook Challenge results

Andromeda took the facebook Challenge, and found 52 separate requests in 24 hours that would have gone to the facebook mothership. Watch her blog for more updates. How about you?

If you look through these logs, pay particular attention to the referer field. This tells you on which site you were browsing when the data set out on its voyage toward facebook.

September 27th, 2011

Take the facebook Challenge

Worried about how much data facebook is collecting on you, even on 3rd party sites, even if you’re signed out? Try this for 24 hours:

  1. Find a file named ‘hosts’ on your computer. On Mac/Linux systems, it’s under /etc/. On Windows, it used to be under System32 somewhere, but who knows now. Stash a backup copy somewhere.
  2. Add the following on a new line:     127.0.0.1 www.facebook.com
  3. Configure a web server running on your local machine.
This will forcibly redirect all calls to facebook to local. At the end of 24 hours, take a look at your web server’s access log. Every line in there is something that would have gone to facebook. Every ‘like’ button, every little banner, all those things track your movements across the web, whether you are signed in to facebook or not. You’ll marvel at how many blank rectangles appear on sites you visit.
Bonus points: At the end of the 24 hours, don’t restore your hosts file.
Please post your facebook-free experiences here.
-m

July 5th, 2011

Geek Thoughts: how I take my tea

Having been recently accused of “vile” habits in regard to tea-drinking, I feel that I need to clear the air. :)

I’ve never been officially tested, but I am almost certainly a supertaster. (This explains, among other things, my aversion to most vegetables and my status as a nationally ranked beer judge). I’ve never been medically tested, but I did go through the BBC test and some rough taste-bud-counting with blue dye and a mirror.

So I do not generally follow accepted wisdom with tea. To prepare tea, I get a nice glass of cold water and plunk in a tea bag. Same goes for other tea-like substances, such as yerba mate. The result is a much slower steeping process, where subtle flavors shift throughout the day and with different refills. Does it get bitter? While tannins are part of the tea flavor, you don’t get that intense, mouth-puckering astringency like you would hot-steeping tea for too long. It’s more gradual and interesting.

Different kinds of tea have different spectrums of flavor, as revealed over the course of a day. Earl Grey and green tea are particularly nice. Some interesting combinations are possible too, by combining two teas which reach their flavor peaks at different times.

I say keep an open mind, and don’t knock it if you haven’t tried it. :) -m

 

June 9th, 2011

SkunkLink in Belorussian

The awesome thing about the internet is that you never know who’s reading your stuff. Case in point: during the depths of the hypertext linking standards discussions, after folks realized that XLink wasn’t going to work with HTML (not even with XML-flavored XHTML), all kinds of proposals flew around about what to do about it. One was my own SkunkLink, a “skunkworks” attempt to get people thinking in a certain direction.

An enthusiastic follower, Bohdan Zograf, has translated SkunkLink into Belorussian, available here. He’s also mentioned translating all of XForms Essentials, which I completely support, and is just the kind of thing I hoped would happen when I put the text under a liberal content license.

Awesome. -m

May 30th, 2011

Good to Great

One book that Ken Bado, the MarkLogic President and CEO, likes to talk about is Good to Great, (subtitled why some companies make the leap… and others don’t), a result of many man-years of meticulous research.

There’s plenty to think about in this book. It talks about the qualities of a “level 5″ executive: the best have a paradoxical mixture of personal humility and iron will. It talks about getting the right people on the bus, and only then deciding where the bus is going. It talks about a culture where brutal facts surfacing is the normal and expected behavior, resulting in a culture of both discipline and faith in the future. Perhaps the key point of the book is the venn diagram that depicts “great” companies as focusing on the intersection of passion, what they can be the best at in the world, and what drives their economic engine.

The structure of the book is based on 11 key companies that passed several rigorous metrics, including an at-least-15-year period of good financial performance, followed by a turning point and an at-least-15-year period of greatness, that is, returns well above the general and industry markets. (Perhaps unfairly, companies that were in the ‘great’ bucket continuously, with no periods of merely ‘good’ performance, were excluded).

Two of the companies in the list: Fannie Mae and Wells Fargo, raised the eyebrows of this fresh reader. Both of them have been prominently in the headlines in the last few years, and not in a good way. In particular the depictions of Wells Fargo struggling with deregulation in the 80s seem galling to read with the hindsight of going through the Great Recession. Circuit City, another of the good-to-great companies, declared bankruptcy in 2009. The book itself cautions about tough times at Gillette and Nucor in the Epilogue section.

I bring this out not to be negative, but to emphasize that this is a soft discipline, not science. If there are companies that have consistently beat the market from the 80s until today with no serious hiccups, that would be truly remarkable. But there’s lots of hidden variables, the system is chaotic, and mere financial numbers are too shallow a measure by which to measure greatness. A company that can truly follow these principles will almost certainly do better than one that doesn’t. Just look at Yahoo for a negative example.

In particular, I’m thinking the three circles are a good way to approach life, though I sincerely hope an individual’s third circle isn’t about optimizing finances. What can you be the best in the world at, have pasion for, and drive your personal satisfaction engine? Maybe that would be a good area to focus your limited resources on. -m

May 1st, 2011

MarkLogic User Conference coverage

Running commentary on Twitter, but hurry, Twitter’s search infrastructure has the long-term memory of a fruit fly. Posts tagged with MLUC11 will soon be dropping off the search event horizon. -m

March 31st, 2011

Follow me on Twitter

Not an April Fools joke. I’m now tweeting in professional capacity. I’ll talk about XML technologies, the web, and various and sundry geeky topics. -m

February 17th, 2011

MarkLogic in the news

What’s that on your TV screen? Why, it’s MarkLogic, again.

Why President Obama Picked the Bay Area

And it’s true, we’re hiring big time. Maybe your resume should be in that pile… -m

February 3rd, 2011

We’ll always have Prague

Today I exchanged electrons with a major airline, which will ultimately result in them removing a certain amount of abstract currency units from my account.

In other words, see you all at XML Prauge 2011. I’ve never been to this conference before, and each year I hear better and better things. Looking forward to it. -m

January 26th, 2011

Explosive growth of RDFa

Some great data from my one-time colleague Peter Mika. Based on data culled from 12 billion web pages, RDFa is on 3.5 percent of them, even after discounting “trivial” uses of it. Just look at how much that dark blue bar shot up since the last measurement, some 18 months earlier.

Also of note: eRDF has dropped off the map. hAtom and hReview are continuing their climb.

-m

January 24th, 2011

Geek Thoughts: the miserable programmer paradox

I found this article interesting. The author posits:

“A good programmer will spend most of his time doing work that he hates, using tools and technologies that he also hates.”

While I disagree with many of his supporting arguments, I think the overall theme is pretty accurate. Working with software, the good parts seem to disappear away, so what you spend most time on are the grotty bits. In fact, I’d go as far as calling disappearability one of the defining aspects of good code-level software tools & techniques.

More collected Geek Thoughts at http://GEEKTHOUGHTS.info.

January 7th, 2011

XForms Training: Feb 14, 15 in Maryland

The remarkable C. M. Sperberg-McQueen is offering XForms training in Maryland (at Mulberry Technologies), Feb 14 & 15, 2011. This is a two-day hands-on introduction to XForms. Check it out. This is a great opportunity to learn more about XForms. -m

January 5th, 2011

Why I am abandoning Yahoo! Mail (and why you should too)

This is a non-technical description of why Yahoo! Mail is unsafe to use in a public setting, and indeed at all. I will be pointing people at this page as I go through the long process of changing an address I’ve had for more than a decade.

What’s wrong with Yahoo Mail?

A lot of web addresses start with http://–that’s a signal that the “scheme” used to deliver the page to your browser is something called HTTP, which is a technical specification that turns out is a really good way to move around web pages. As the page flows to the browser, it’s susceptible to eavesdropping, particularly over a wi-fi connection, and much more so in public, including the usual hotspots like coffee shops, but also workplaces and many home environments. It’s the virtual equivalent of a postcard. When you’re reading the news or checking traffic, it’s not a big deal if someone can sneak a glance at your page.

Some addresses start with https://–notice the extra ‘s’ which stands for “secure”. This means two things 1) that the web page being sent over is encrypted, and thus unavailable to eavesdroppers, and 2) that the people running the site had to obtain a certificate, which is a form of proof of their identity as an organization (that they’re not, say, Ukrainian phishers). Many years ago, serving pages over https was considered quite expensive in that servers needed much beefier processors to run all that encryption. Today, while it still requires extra computation, it’s not as big of a deal. Most off-the-shelf servers have plenty of extra power. To be fair, for a truly ginormous application with millions of users like Yahoo Mail, it is not a trivial thing to roll out. But it’s critically important.

First, to dispel a point of confusion, these days nearly every site, including Yahoo Mail, uses https for the login screen. This is the most critical time when encryption is needed, because otherwise you’d be sending your password on a postcard for anyone with even modest technical skills to peek at. So that’s good, but it’s no longer enough. Because sites are written so that you don’t have to reenter your password on every single new page, they use a tiny bit of information called a “cookie” in your browser to stay logged in. Cookies themselves are neither good nor bad, but if an eavesdropper gets a hold of one, they can control most of your account–everything that doesn’t require re-entering a password. In Yahoo Mail this includes reading any of your messages, sending mail on your behalf, or even deleting messages. Are you comfortable allowing strangers to do this?

As I mentioned earlier, new, more powerful tools have been out for months that automate the process of taking over accounts this way. Zero technical prowess is needed, only the ability to install a browser plug-in. If there are any web companies dealing in personal information for which this wasn’t a all-hands-on-deck security wake-up, they are grossly negligent. Indeed, other sites like Gmail work with https all-the-time. But still, in 2011, Yahoo Mail doesn’t. I have a soft spot for Yahoo as a former employer, and I want to keep liking them. Too bad they make it so difficult.

The deeper issue at stake is that if this serious of an issue goes unfixed for months, how many lesser issues lurk in the site and have been around for months or years? The issue is trust, my friend, and Yahoo just overdrew their account. I’m leaving.

FAQ

Q: So what do you want Yahoo to do about this?  A: Well, they should fix their site for their millions of remaining users.

Q: What if they fix it tomorrow? Will you delete this message?  A: No. Since I no longer trust the site, I am leaving, even though it takes time to notify all the people who still send me mail, and no matter what other developments unfold in the meantime. This page will explain my actions.

Q: Do you really want everyone else to leave Yahoo Mail?  A: No, only those who care about their privacy.

Q: What’s your new email address?  A: I have a couple, but <my first name> @ <this domain> is a good general-purpose one.

I will continue to update this page as more information becomes available. -m

December 4th, 2010

Yahoo Mail’s inexplicable, inexcusable lack of https support

Dear Yahoo,

What’s the deal? Shortly after FireSheep was announced on Oct 24, 2010, you should have had an emergency security all-hands meeting. You should have had an edict passed down from the “Paranoids” group to get secure or else. Maybe these things happened–I have no way of knowing.

But it is clear that it’s been 6 weeks and security hasn’t changed. It’s simply not possible to read Yahoo mail over https–try it and you get redirected straight back to an insecure channel. As such, anyone accessing Yahoo mail on a public network, say a coffee shop or a workplace, is vulnerable to having their private information read, forwarded, compromised, or deleted.

Wait, did I say 6 weeks?–SSL had apparently been rolled out for mail more than 2 years ago, but pulled back due to problems. Talk about failure to execute.

I feel like I missed an announcement. What’s the deal, Y? Show me that you care about your users. No excuses.

Sincerely,

-m

October 24th, 2010

Geek Thoughts: statistical argument against link shortener sustainability

I’ve seen lots of discussion for and against link shorteners, but not specifically this line of argument:

Let me grab a random shortened link from Twitter. Don’t go away, I’ll be right back.

http://bit.ly/b1fYi1

OK, that’s six characters in the domain, a slash, and six more characters. 50 years from now, if bit.ly is still in operation, the URLspace will be rather more crowded, and the part after the slash might be eight or nine characters. This is a significant cliff, since most people have trouble remembering more than 6 or 7 things in their head at a time. Thus, one could conclude that 50 years from now, newly minted bit.ly URLs will be less fashionable than those from newer link-shortening services, particularly if more short TLDs come online, which seems likely. In that scenario, fewer and fewer people will use bit.ly, and it will become a resource-pit as costs go up (for more database storage, among other things) while usage drops, an economic trend that has only one eventual outcome, leading to the breaking all the external links relying on this service.

I’ve been picking on bit.ly here, but the same principle applies to any shortener service. In fact, the more popular, the more quickly the URLspace will fill.

The moral: don’t use link shorteners for anything that needs to be more durable than something you’d scribble on a scrap of paper at your desk.

More collected Geek Thoughts at http://geekthoughts.info.

October 24th, 2010

Passing the Turing Test

I want to write a program that uses TurKit to pass the Turing Test. Cheating, sure, but should be doable (other than time lag issues), right? -m

September 28th, 2010

Geek Thoughts: accomplishment

Whenever I undertake something big and challenging enough to be worthwhile, whether editing a W3C specification, running a more demanding distance, a new software project, or something else, I notice a similar trajectory of progress:

Ready to start: Full of adrenaline and excitement. Audacious goals seem readily reachable.

5-10% through: Whoa, this is difficult! And I’m only 1/10 or 1/20 of the way through? What was I thinking? It is important to ignore these thoughts.

One third point: Things seem to even out by this point. The hard slog presses on.

Halfway point: Wow, that’s halfway? Feels more like 90%!

Two-thirds point: Things are getting difficult. Should have treated this more like a marathon, less like a sprint.

90% point: There are two distinct kinds of endeavors from here. In what I call ‘type 1′ projects, the goalposts are strictly fixed, in which case a fresh burst of energy propels me through the glorious finish. But in a more sinister ‘type 2′ project, the finish line keeps receding away, as fast as or faster than I can approach. Depending on my level of stubbornness and anger, I will often finish anyway, just to spite the universe and the project masters, but at significant personal cost.

For anyone out there that has influence over large, ambitious projects, one of the most pivotal things you can do is make sure it is a type 1, not a type 2 project, as seen from the 90% line.

Finish.

More collected Geek Thoughts at http://geekthoughts.info.

September 9th, 2010

FCC opens its databases

Good news for big data fans. The FCC has released APIs to several large databases involving broadband statistics, spectrum licenses, and some related topics. I haven’t had a chance for a close look yet, perhaps we can do that together. Link. -m

September 2nd, 2010

Is XForms really MVC?

This epic posting on MVC helped me better understand the pattern, and all the variants that have flowed outward from the original design. One interesting observation is that the earlier designs used Views primarily as output-only, and Controllers primarily as input-only, and as a consequence the Controller was the one true path for getting data into the Model.

But with browser forms, input and output are tightly intermingled. The View takes care of input and output. Something else has primary responsibility for mediating the data flow to and from the model–and that something has been called a Presenter. This yields the MVP pattern.

The terminology gets confusing quickly, but roughly

XForms Instance == MVP Model

XForms Model == MVP Presenter

XForms User Interface == MVP View

It’s not wrong to associate XForms with MVC–the term has become so blurry that it’s easy to lump variants like MVP into the same bucket. But to the extent that it makes sense to talk about more specific patterns, maybe we should be calling the XForms design pattern MVP instead of MVC. Comments? Criticism? Fire away below. -m

August 22nd, 2010

Eulogy for SearchMonkey

This is indeed a sad day for all of us, for on October 1, a great app will be gone. Though we hardly had enough time during his short life to get to know him, like the grass that withers and fades, this monkey will finish his earthly course.

Updated SearchMonkey logo

Photo by Micah

I know he left many things undone, for example only enhancing 60% of the delivered result pages. He never got a chance to finish his life’s ambition of promoting RDFa and microformats to the masses or to be the killer app of the (lower-case) semantic web. You could say he will live on as “some of this structured data processing will be supported natively by the Microsoft platform”. Part of the monkey we loved will live on as enhanced results continue to flow forth from the Yahoo/Bing alliance.

The SearchMonkey Alumni group on LinkedIn is filled with wonderful mourners. Micah Alpern wrote there

I miss the team, the songs, and the aspiration to solve a hard problem. Everything else is just code.

Isaac Asimov was reported to have said “If my doctor told me I had only six minutes to live, I wouldn’t brood. I’d type a little faster.” Today we can identify with that sentiment. Keep typing.

-m

August 11th, 2010

Geek Thoughts: hard to find

Found this article interesting. Not too many hundreds of years ago, cutting-edge scientific research involved watching balls roll down ramps. Making fundamental discoveries seems to be slowing down, or at least getting harder. As a consequence, we should expect more big discoveries from the sciences where the relevant technology follows a Moore’s-Law-like exponential growth trajectory. There may be some hope yet for fundamental, game-changing discoveries in computer science.

Best of all, perhaps, is the word “scientometrics”.

August 5th, 2010

Balisageurs: XML and JSON

At David Lee’s nocturne about XML and JSON round-trippimg, several folks were talking about a site that listed several “off-the-shelf” conversion methods, but nobody could remember the site.

Late that night, with 15 minutes of battery remaining, I found it. The operative search term is XSLTJSON. -m