As a data scientist, when I meet with new clients, I like to crowdsource the question: What do you think data science is?
At first, people hesitate to respond, thinking that the “right” answer is something complicated and involves mathematics beyond their capabilities. But then, as we break it down, a few bold souls will voice their definitions. As you might imagine, the responses vary, not only depending on people’s actual familiarity with data science, but their backgrounds, their needs, and their responsibilities.
For me, data science is simply a scientific way of approaching a business problem. And because our approach is science-based, we follow a defined process. You may remember, from grade school, learning about the “scientific method,” a centuries-old, step-by-step system for experimentation involving observation, research, forming and testing a hypothesis, recording results, drawing conclusions, and then, replication. These steps have absolute parallels in how we approach data science at TKXS, as well as in other aspects of our business, including:
- Questioning. What is the business problem? At this step -- the equivalent of the scientific method’s “observation” step -- we don’t know. So we spend a good bit of time up front, asking questions, getting to know and understand your business. More than just our data science team, our executives, our sales team, our technical solutions owners are all looking at the client’s objectives and further defining objectives based on what we learn. By keeping an open mind -- and listening ears -- we acquire a genuine understanding of the business problem, which is, of course, key to solving it.
- Gathering. You probably are familiar with feeling “data overload, but information deprived.” Most of our clients have plenty of data. They just can’t find it or use it, so it loses its value. Our team is accustomed to collecting and scraping data from multiple sources. And we identify which data sources will be best for answering the questions identified in Step One.
- Scrubbing. It’s one thing to have data, but it’s another to be able to use it. At this stage, our team cleans the data, eliminating misspellings and duplicates, dealing with missing values, etc., based on the questions being asked (Step One) and the data provided (Step Two). The end result is the building block of our project, usable data.
- Analyzing. This is where we begin delving into the data and understanding what we have to work with. In this initial pass, we start seeing the big picture. Trends, averages, ranges, etc. begin emerging, and we start thinking about ways to boil it down into something workable.
- Modeling. This is what most people think of when they think of data science -- algorithms. As we begin identifying patterns, we start building models, training, tuning and testing the models to identify the best performing model. This is the pure development phase, involving software and coding, with multiple iterations to find the best model that answers the problem with the highest accuracy. This is where we as data scientists, reach into our tool bag and select the proper tool for the problem at hand. This is when the data -- gathered, scrubbed, analyzed and modeled -- really begins to sing.
- Communicating. As you might imagine, data findings can be complicated and challenging to communicate. At this step, we create easy-to-understand dashboards, visuals and reports to share back with our stakeholders. We tell the story of the data, often pulling in the talents of our UX team to visualize the pictures that are worth a thousand words. Communication is critical for buy-in of our work. Without proper communication, our models would not be of value for our customers. Therefore, when we do present to our customers, we ensure that we are giving them the “prescription” of our analysis, the action they need to take to improve their business.
- Deploying and Maintaining. After sharing our findings, we’re ready to deploy our model, the equivalent of the scientific method’s “replication.” We’ve got teams to monitor and maintain performance, to identify opportunities to optimize our models for new scenarios, and to ensure that the basics -- the servers are up, jobs are on time, the data is available -- are fully operational and running.
When you break it down this way -- questioning, gathering, scrubbing, analyzing, modeling, communicating, deploying and maintaining -- it’s easy to see that “data science” is simply a process. Once you understand what you’re looking for, you use data to prove your point. So once again, here’s the question, What do you think data science is?