Most CIOs can tell you exactly how many years their organizations have been storing data. Get a group of CIOs talking, and they’ll start swapping the years their data stores go back just like a baseball players might trade their batting averages.
For many, the logic goes, if data is the new gold, then your IT organization should gather and store as much as it can in hopes that someday artificial intelligence and machine learning can glean profitable findings.
But the reality is data is expensive. In a recent blog post, well-known author and marketer Seth Godin succinctly argues against the notion that more data is better. He lays out three principles that I think are worth thinking about from an IT and machine learning/AI perspective. He writes:
- “Don’t collect data unless it has a non-zero chance of changing your actions.”
- “Before you seek to collect data, consider the costs of processing that data.”
- “Acknowledge that data collected isn’t always accurate, and consider the costs of acting on data that’s incorrect.”
Godin seems interested how having more data may actually hinder marketers from doing their best work. I think his principles are also very applicable for technologists working with big data. Let’s take a look at each of Godin’s three principles, and how they can also apply to IT.
1. Don’t collect data unless it has a non-zero chance of changing your actions.
I’m often asked questions like “how much data should we be collecting” and “how do we know when we have enough data (or the right data, etc).” There’s not a correct answer to these types of questions. You need as much data as you need. Sometimes that means a lot of data and sometimes that means a very small amount of data. You can’t know the answer to that question at the beginning of any type of data-related project, but you can start small to see how any new data might impact your modeling/systems.
[ Read the related article by Eric Brown: Getting started with AI in 2019: Prioritize data management. ]
As an example, I had a client who was building out a model to forecast churn with their clients. They had all the “standard” inputs for a customer churn model in their industry, but they also had some things that were a bit out of place. This particular client had gone out and scoured the web for all sorts of different demographic datasets to try to find an “edge” for their model. While adding another variable to a model might take your accuracy from 86.7 percent to 86.8 percent, is the cost of accessing, storing, and processing that data worth that 0.1 percent increase in accuracy?
If you’re modeling flight dynamics for an airplane this improvement might be something to strive for, but if you’re modeling customer churn, you’re most likely wasting valuable resources for minimum return. Start small and ensure the data you are collecting is meaningful.
2. Before you seek to collect data, consider the costs of processing that data.
I constantly hear from colleagues and clients that “storage is cheap” today so it makes sense to store all the data you can and worry about what to do with it later. I’ve been guilty of this thinking myself and found that I had terabytes of data that I had no idea what to do with or why I even collected it in the first place.
I’m all for experimenting with new data and new inputs to models, but you can experiment with low volume and low scale to see how a new dataset might help (or hurt) your efforts without collecting “all the data” and incurring unnecessary costs.
While storing and processing data might be relatively cheap compared to five, 10, or 20 years ago, those savings do not necessarily equate to overall cost savings when it comes to storing terabytes of data that may never be used, or worse, which may be used without understanding the quality and history of the data. Physically storing data may be cheap, but once you include the costs of interpreting, analyzing, and using that data, you may be surprised how just how expensive “cheap” can become.
3. Acknowledge that data collected isn’t always accurate, and consider the costs of acting on data that’s incorrect.
This principle might just be the most important of the three. Data isn’t always accurate: Even with the best data management practices and systems, you will most likely run across data that is slightly inaccurate or even just flat wrong.
If you have the budget to “store all the data,” have at it – just be sure you understand the total cost of ownership (TCO) for your data. You are not only paying to collect, store, and process data, but also paying to own data that may be completely wrong. At some point, you’ll need to put time into understanding whether that data is “good” data. So you’re going to have costs associated with quality management for your data on top of any collecting, storage, and processing costs.
The costs of inaccurate data can be significant, so it's worth the effort (and the budget) to put proper data quality management practices in place to catch inaccuracies before they make it into production.
Your data's real TCO
Seth’s three principles all revolved around good data management practices. If you have poor data management processes/systems in place, your TCO for data is going to much higher than if you have quality data management and data governance processes.
Seth ends his article with these words:
Strip away all insignificant digits.
As a trained physicist, engineer, and data scientist, these are hard words to read and agree with because all digits are significant in theory. In the real world though, there are very few areas that need accuracy to the second decimal (and sometimes even the first).
In most instances, getting a model accuracy of 86.8 percent is just as good as an accuracy of 86.9 percent. Sure, the improvement might mean a few thousand dollars a year more for your organization, but at what cost? If you’re spending more than that to store, collect, and process data for that small accuracy improvement, you’re really not improving the bottom line.
[ Want lessons learned from CIOs applying AI? Get the new HBR Analytic Services report, An Executive's Guide to Real-World AI. ]