As a teenager taking A-levels, it was relentlessly drummed into me to focus on “what’s the exam question”. The subtext here being to ensure you’ve really understood what the examiner is looking for evidence of, testing to ensure you’ve understood and ensure you provide that evidence in answering your question, not glibly providing the answer without showing your workings and your thought process.
I hear this phrase used a lot in a work context too. Typically its intent is to ensure we’ve understood what the customer is looking for, what they’re trying to achieve, how does the pitch / presentation / demo you’re building address the question being asked or inferred and to ensure we’re not guilty of telling a story we want to tell where we should be focusing on the problem we’re trying to help the customer with.
This trait is incredibly powerful in the field of data science. Understanding the question that’s being asked and framing the data in the right way to answer that question is one of the key facets of success.
Part of the challenge comes from often not knowing precisely which factors weigh in to answer that question. I’ll give an example from a talk I attended this week on a project examining improving safety on remote construction projects by trying to understand contributing factors to accidents.
Rather than looking at point causes of individual accidents, a wider perspective was modelled, broad sets of data analysed and used to feed a machine learning model that sought clusters of factors related to the outcome of “accidents happened”. The critical question was shaped and the assets above designed and deployed to answer it – “What are the conditions in which an accident happens?”
When you start to bring wide ranges of possible data together, for example temperature, rain & wind data, worker swipe ins & outs to determine working duration – that day, and across multi-day shift patterns, personnel skills & experience mix on site, site busyness, specific work activities in the plan, whether the project is ahead, on track of behind budget or schedule; this and a raft of other metrics are analysed, understood and, where appropriate excluded, to help identify the conditions in which accidents happen and which factors are relevant to predicting them.
With all of this in place, indications can be given to help inform behaviours and procedures – “Due to windy conditions, today is a high-risk day. We’ve had reported accidents on the last 7 days like today” – and hence you can tailor safety measures accordingly.
Many big data projects seek to “capture everything” and where that data is later revisited, properly understood and used carefully and selectively that can be powerful. In the construction illustration above, had they not kept staff swipe ins longer than needed for payroll purposes, some of the inferences based upon shift patterns might not have been available to correlate with accidents and other factors, but key is to then carefully analyse how the data you have may (or may not) relate to answering your question and to ensure you’re actually answering the question you think you are.