I’ve been thinking a lot about big data, and two recent items nicely capture a slice of the discussion.
1) Alex Milowski recounting working with Big Weather Data. He concludes that ‘naive’ (as-is) data loading is a “doomed” approach. Even small amounts of friction add up at scale, so you should plan on doing som in-situ cleanup. He came up with a slick solution in MarkLogic–go read his post for details.
2) Chris Dixon on Making Large Datasets Useful. Typical approaches like machine learning only solve 80-90% of the problem. So you need to either live with errorful data, or invoke manual clean-up processes.
Both worth a read. There’s more to say, but I’m not ready to tip my hand on a paper I’m working on…