O’Reilly Radar/ Joseph Hellerstein
“There is a debate brewing among data systems cognoscenti as to the best way to do data analysis at this scale. The old guard in the Enterprise IT camp tends to favor relational databases and the SQL language, while the web upstarts have rallied around the MapReduce programming model popularized at Google, and cloned in open source as Apache Hadoop. Hadoop is in wide use at companies like Yahoo! and Facebook, and gets a lot of attention in tech blogs as the next big open source project. But if you mention Hadoop in a corporate IT shop you are often met with blank stares — SQL is ubiquitous in those environments. There is still a surprising disconnect between these developer communities, but I expect that to change over the next year or two.
We are at the beginning of what I call The Industrial Revolution of Data. We’re not quite there yet, since most of the digital information available today is still individually “handmade”: prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation “factories” such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide. Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users.
To get a glimpse at what that software might look like, consider today’s high-end deployments. There are a few different solutions, but they typically share…”