How to do data science without big data

You don't need a data warehouse to pursue important analytics initiatives. Consider these four priorities if you want to gain valuable insights with limited data

Too often, IT leaders park their data science initiatives until they can build a robust data engineering layer. They wait for a data warehouse to be available before planning data analytics projects, assuming that advanced analytics is essential for transformational business value and that large volumes of neatly organized data are a prerequisite for it.

Nothing is farther from the truth.

Here are four things to keep in mind if you don’t have big data but would like to pursue data science initiatives.

[ How can automation free up more staff time for innovation? Get the free eBook: Managing IT with Automation. ]

1. Business problems should determine the kind of analytics you need

About 80 percent of data science projects fail to deliver business outcomes, Gartner estimates. A key reason for this is that leaders do not pick the right business problems to be solved. Most data analytics projects are chosen based on available data, available skills, or available toolsets. These are recipes for failure; a data analytics project should never begin with either data or analytics.

The best way to start the data science journey is by introspecting on the organizational strategy. Find out the most important problems your target users want to be solved and validate whether addressing them will deliver the desired business impact. The chosen business challenges will dictate the analytics approach you should take and hence the data you need.

Not having data to begin with can even be an advantage: When you start with a clean slate, you’re not burdened by legacy baggage. On the other hand, organizations with a much longer footprint often struggle with expensive digital transformations.

Consider Moderna, which built a digital-first culture from its inception in 2010. It built a data and analytics platform in service of its business priorities that revolved around developing mRNA-based drugs. This targeted approach was instrumental in enabling Moderna to create the blueprint for the COVID-19 vaccine in just two days.

2. Your analytics approach dictates the data you source

Organizations can spend months building data warehouses only to find that the data they’ve collected isn’t good enough to perform the analytics they need. Machine-learning algorithms often need data of a particular type, volume, or granularity. Attempting to build a perfect data engineering layer without clarity on how it will be used is a wasted effort.

When you have visibility on the organizational strategy and the business problems to be solved, the next step is to finalize your analytics approach. Find out whether you need descriptive, diagnostic, or predictive analytics and how the insights will be used. This will clarify the data you should collect. If sourcing data is a challenge, phase out the collection process to allow for iterative progress with the analytics solution.

For example, executives at a large computer manufacturer we worked with wanted to understand what drove customer satisfaction, so they set up a customer experience analytics program that started with direct feedback from the customer through voice-of-customer surveys. Descriptive insights presented as data stories helped improve the net promoter scores during the next survey.

Over the next few quarters, they expanded their analytics to include social media feedback and competitor performance using sources such as Twitter, discussion forums, and double-blind market surveys. To analyze this data, they used advanced machine learning techniques. This solution helped deliver $50 million in incremental customer revenue annually.

[ How can public data sets help with data science efforts? Read also: 6 misconceptions about AIOps, explained. ]

3. Data collection begins with easily available, small data

When we think of prerequisites for machine learning, big data often comes to mind. But it’s a misconception that you need large volumes of data to deliver transformational business value. Many leaders incorrectly assume that you must collect millions of data points to discover hidden business insights.

Once you have zeroed in on the objectives, business problems, and analytics approach, the next step is to pull together the data to be analyzed. Many business challenges can be solved with simple descriptive analytics on small spreadsheets of data. By reducing the entry barrier for data to a few hundred rows, you can manually pull together data from systems, digitizing paper records or setting up simple systems to capture the data you need.

Many leaders incorrectly assume that you must collect millions of data points to discover hidden business insights.

In another example, a mattress manufacturer we worked with wanted to tap analytics to improve production yield. As a mid-sized enterprise early in its data journey, the company had a small data footprint primarily comprised of manually prepared spreadsheets. Rather than delay the pursuit of analytics, the company undertook a diagnostic analytics project to optimize yield.

It digitized the machine data available on paper combined it with manually prepared data in a handful of spreadsheets, then used simple statistical techniques to analyze hundreds of rows and identify levers for optimization. By revealing insights such as optimizing production batches to tuning temperature and humidity, the recommendations surfaced potential yield improvement of 2.3 percent, which translated into $400,000 in additional annual revenue.

4. Adopt an incremental approach to deliver transformational value from data

The key takeaway here is to avoid parking your data science initiatives because you have limited amounts of data. It is never a sequential process. Adopt a design thinking approach to identify the right business problems to be solved. Embrace an agile methodology to design the right analytics approach to solve the challenges. Finally, work out an iterative process to source the data you need on an incremental basis.