The term ‘big data’ emerged in the mid-noughties and is typically understood to represent the (currently) modern set of techniques for storing, processing and analysing data that became necessary with the advent of Internet-scale / cloud-scale applications.
Big data has its origins in vast, every-growing streams of data – think the Twitter fire-hose of tweets, think click-streams of all the activity begin generated worldwide by the many millions of websites running Google Analytics – but what constitutes big simply gets redefined all the time. The real differentiator ‘traditional’ data methodologies and what’s today called ‘big data’ wasn’t so much about the scale as the overall architecture of how data is captured, stored and analysed.
Traditionally an organisation seeking to get their data house in order started by attempting to design and organise everything. Often striving to design and plan for the answer to important questions you might ask of your data by conforming it, transforming it, mapping, cleaning and loading it into your data warehouse.
The tooling and design patterns for such objectives evolved, matured and so too did the design and implementation practices and at the core of much of your organisations structured and unchanging core-data, I wouldn’t argue against the practise of a data warehouse and indeed for many types of analytical needs they still address huge need, but they have their limits and their challenges.
One such challenge, and it’s a significant one, is that they are designed and implemented (often at length and great expense) to answer a pre-determined set of questions, defined in advance, years before user and typically use infrastructure to process and house such data that’s often at the pricey end of the spectrum.
The process typically involves ETL – Extracting, Transforming and Loading – the data from its sources into the standardised patterns optimised for reporting. But, by its very nature, doing this means you’ve potentially lost or inferred data along the way, you’ve maybe pruned the fields you didn’t need to answer today’s questions and risk therefore not having that data to answer future questions you don’t yet know you’ll need to ask.
And so alternative approaches have evolved. Massively parallel schemes that use low cost hardware but optimise the processing, so it can be run in parallel – using 100s of ‘cheap’ servers to process data and then shutting them down vs. building a monolithic server and running it 24×7. Combine this with an approach for ‘storing the raw data and transforming it only when you need to use it’ – the ‘Data Lake’ approach- allowing you to keep all the raw data in its native format today, adding the ability to go back and mine it for further value in the future and you’ve got a pattern that can store huge amounts of data, process it effectively and deliver insights and analytics today, whilst not compromising on what you might want to mine in the future.
But interestingly, this pattern doesn’t add up only for the Internet-scale data producers, it too can be applied to regular business workloads, to batch processing and analysis of even modest-data volumes and with ‘pay as you use’ cloud computing services such as Azure Data Lake making these services cheaper, more turnkey to get started and simple to scale they apply as much to ‘little data’ as to ‘big data’.
So is it time to drop the label ‘big’ and just get back to working with ‘data’?