What is Data Science?
Back in 2012, when Harvard Business Review (HBR) described data scientist as the sexiest job of 21st century, data science became a buzzword.
As suggested by the Google trend data, more and more people are getting interested in data science. Yet, when I ask data science newbies a simple question “What is data science?”, most replies are not satisfactory. So, to narrow down this knowledge gap, we have created this article.
In this extensive article, you will get the complete overview of data science: what it is, what it isn’t, its applications, skills required to become a data scientist and much more.
To skip directly to a specific section of the article, you can use the following links:
Part 1: What is data science?
Part 2: How data science works? The Data Science Process
Part 3: Applications of data science
Part 4: All data related jobs don’t fall under Data Science
Part 5: Skills Needed to Become a Data Scientist
Part 6: What does the future Hold for Data Science?
Part 1: What is Data Science?
Before defining data science, let’s get familiarized with three terms – data, information and insight. These three terms seem similar; however, they are not quite same.
- Data is raw unorganized set of information. What we call data is actually raw data. It contains information, however, the information is not readily available.
- Information is when you analyze data so that it provides some sort of understanding of the data. The term we use is information extraction (from data).
- Insight is gained by analyzing data and information to understand what is going on with a particular situation. This can be used to make better decisions.
Suppose you are analyzing situation of a departmental store. We are interested in sales data.
- If you keep record of all sales occured in a month, it is data.
- From the data, you can find out the number of times an item has been sold, this is information.
- Suppose, your task is to increase sales amount. If you find out replacing 10 least sold items with new items will increase sale, that is insight.
Data science is all about turning raw data into insights to make better decisions.
Don’t forget the word “science” in “data science”. Though it is not completely correct. It’s not completely incorrect as well.
This is what I mean.
Science is a systematic and logical approach in understanding how things work. The insight you draw from a dataset may be statistically and mathematically correct; however, if the insight cannot be logically explained, it can be wrong. Let me give you an example.
There is a popular correlation example between shark attacks and ice cream sales. In a survey, it was found that when ice cream sales went up, shark attack increased as well. Does that mean it’s true? Well, it’s not.
In this case, ice cream sale and shark attack were both up because the temperature was up. If you decide to remove all ice cream stalls near beach side from this insight, that would be plain wrong.
We like to think data science more of an art. You take raw unmanaged data and generate meaningful insightful from it to solve analytically complex problems.
But, how? We will discuss it in data science process. 😉
Part 2: How data science works? The Data Science Process
Data science is a multidisciplinary field. It involves the systematic blend of scientific methods, processes, algorithm development and technology to extract insight from data.
But, how does all these areas glue together?
To understand it, you need to understand data science process; loosely speaking, it’s a series of steps you need to take to complete a data science task. This process is always there no matter the raw data or data science related task is.
Step 1: Ask questions to frame the problem
You begin by asking questions to find the problem. Let’s say, your job is to optimize sales funnel of a company. It’s a pretty common task.
A sales funnel is a customer’s journey from awareness (existence of a solution) to action (buying the product). From Awareness -> Interest -> Decision -> Action.
Let’s take an example. You have sore feet and are searching for ways to cure it.
- You found out that, a good pair of shoes can reduce sore feet from a blog post, it’s awareness.
- You engaged with the company that wrote the blog post to ask more about shoes, it’s interest.
- The same company is providing 10% discount on shoes. Because of it, you decided to buy shoes from the same company.
- Finally, you bought it. You took action.
If there are 1000 visitors who read the blog post, only a fraction of visitors, let’s say 150, will engage and ask questions about shoes. Among them, maybe 30 will decide to buy at some point and probably 10 or so will buy the shoes.
Depending upon the situation, you might want to increase all phases of the sales funnel or maybe some or just one. Now you start asking questions:
- Which phase of the funnel is not performing good?
- How can we identify customers who are more likely to buy?
- Can we take a better approach to engage with customers so that they are more likely to take action?
- How is discount on shoes affecting return of investment and possible future buys?
After discussion with the marketing team, you decided to focus on just one problem “How can we identify customers who are more likely to buy?”
Step 2: Get the Data
Once you know what problem you have to solve, it’s time to collect the right data.
In this case, you might want to collect data such as: age, gender, previous customers transaction history and so on.
These data may come from variety of sources. For example, previous customers data may be available in the company’s Customer Relationship Management (CRM) software. The web traffic data is available in web analytics software like Google Analytics. You can take feedback from your real-time visitor by displaying a feedback form. If you think the data available is not sufficient to generate a meaningful insight, you may need to collect new data.
The data you have collected here is “raw data”.
Step 3: Explore the Data
“80% of data scientists’ time is spent in finding, cleaning and organizing data.”
More often than not, the raw data you have collected likely contains anomalies. Before you can remove them, you need to understand every attribute of the dataset.
Then, you begin solving answers to these questions:
- Are there missing fields in the data? If these missing fields impact the insight, how can you remove them?
- Are there invalid values? For example: Nan (Not a number) where there should be number. How can you fix it?
- Are all date values in uniform format and same timezone?
- If there are multiple datasets: Is merging datasets makes sense? If yes, how should you merge them?
After this phase, data is ready for analysis.
Remember, you cannot get correct insight from wrong data. And, getting incorrect insight is worse than having no insight at all.
Step 4: Model the Data
Once you explore data, you can analyze and visualize data to get information. However, this only provide us hints and hypotheses. A data scientist needs to provide justification for these patterns. This can be done by fitting the data to a model.
Data modeling is just another way of saying that we want to find out exactly IF and TO WHAT EXTENT two or more attributes are related to each other. It helps us confirm the hypotheses we have about the data.
A model is a simplified way to approximate data in a mathematically-formalized form (equation). And based on the model, you should be able to make predictions if needed.
That’s where fancy stuffs like machine learning, algorithm development etc. and some non-fancy stuffs like statistics and probability comes in.
Going back to our sales funnel, after creating an accurate model, we should be able to predict (with good accuracy) which customers are more likely to buy. The prediction can be specific like: Male, 20 – 30 age group living in Washington.
Step 5: Communicate Results
You can easily work with the result you got from the model. After all, you are a data scientist. However, not everyone can understand and use it.
Communication is underrated, however, an important part of data science. Your result should make sense to other team members. The phrase people often use is “storytelling”.
You need to effectively communicate: “How can we identify customers who are more likely to buy?”
One way to better communicate is by using visuals. Our brain can grasp visual information far better than numbers and words.
Part 3: Applications of Data Science
According to a report from IBM in 2013, “90% of the data in the world today has been created in the last two years”. It is estimated that the total amount of data will reach to 44 zettabytes by 2020. By the way, 1 zettabyte = 1 trillion gigabytes.
You might have heard the term “Big Data”. It refers to a large volume of data which is difficult to process using traditional database and software techniques.
It’s not important how much data is generated; what’s more important is what you do with the data. There is a ongoing misconception that dealing with large amount of data is data science. Well, it’s certainly true that data science deals with big data.
However, it’s not limited to working with big data. Data science can be applied to smaller datasets to some extent. In fact, people from various fields such as journalism, political science, sociology, marketing etc. are looking for ways to solve important problems using data.
Let’s see a few examples of how and where data science is used.
Used correctly, data science can be a game changer for CMOs looking to develop successful marketing strategies. Below are six ways to optimize your marketing through data science.
- Optimizing your Budget
- Marketing only where it matters based on the budget and requirement
- Create strategies for both old and new customers
- Prioritize and align your advertisements
- Planning high Seasons for customers ahead which is a prime opportunity for a marketing push
- Analyzing social data
If you have ever used Amazon, it’s hard to miss recommended items shown to you as they are so relevant. But, how does Amazon know which item to show to the user among hundreds of millions of items? You guessed it right, “using data science”.
Based on user’s past purchases, items they have liked and rated, users’ purchases compared to similar purchases by other customers, and other various factors, they are able to recommend the best item. This not only adds customers experience but also drives more sale.
This is just one example. Hundreds and hundreds of big companies like Facebook, Netflix, Twitter, Google etc. are using recommendation system to improve user experience and drive more sales.
Data science is being used in healthcare to:
- monitor and improve health problems
- improve diagnostic accuracy and efficiency
- optimizing hospitals and clinic performance
- prevent unnecessary ER visits
The use of data science in healthcare is not only reducing healthcare cost but also saving thousands of lives.
Part 4: All Data Related Jobs Don’t Fall Under Data Science
“Data scientist”, “Data analyst” and “Data engineer” all work with data, and some of their tasks overlap. However, they are not the same.
Data engineers are focused on developing, testing, maintaining and optimizing architectures of database and large-scale processing system. They also work on discovering opportunities for data acquisition and develop pipelines (a set of data processing elements connected in series) that transforms data into formats that data scientists and analysts can use.
Data engineers work far away from analysis in the backend. And, their importance may not be obvious. But without data engineering, there is no data for analysis. I like to call data engineers “dark knights” as they are silent guardians and watchful protector (of data).
A data analyst extracts dataset from the pipeline data engineers built and run various degrees of statistical calculations to solve problems given by the business team.
Common tasks of a data analyst includes:
- cleaning and organizing data
- applying statistical calculations to find big picture and interesting trends from data
- creating visualizations and dashboards that can be used to make business decisions
A data analyst is not expected to create statistical models and algorithms. Nor they are expected to formulate a question themselves that could help the business. That’s the task of a data scientist.
A data scientist is a specialist that creates statistical methods and algorithms to make predicts to answer business questions.
Let’s say an ecommerce website wants to recommend better products to their visitors. A data scientist starts by asking several questions to frame the problem. He chooses one from it (for that he needs to have knowledge of the business) and provides insight on it. Insight could be something like “Showing 2 fewer items can increase sales by 5% with 90% certainty”. For that, he may also use data analyst’s findings and research.
We have explained about tasks of a data scientist in Part 2: How data science works? The Data Science Process. Here’s just the overview:
- Ask an interesting question to frame the business problem
- Get data, clean and organize data
- Explore data to find anomalies
- Model the data, create algorithms and make predictions
- Communicate and visualize results
Part 5: Skills Needed to Become a Data Scientist
Responsibilities of a data scientist maybe different in different companies, and skill set required to accomplish it are different as well. That being said, there are some skills all data scientists are likely to have.
After talking with my team, I have created a list of technical and non-technical skills that you need to master if you want to make strong case for yourself as a data scientist. Although education is a huge factor and most data scientists (70%-75%) have PhD or a Masters degree, we are not including it as a skill. It’s because I personally know a few people from a non-technical background who are doing an awesome job as a data scientist.
Statistics and Mathematics – One of the important and most common task of data scientist is create statistical models for data. For that, they need to have knowledge of probability, regression, multivariable calculus, linear algebra and so on.
Programming – It is the most fundamental skill a data scientist is expected to have. Some of the commonly used languages in the industry are Python, R, SAS, and SQL.
Data in general – After you figure out what business problem to solve, most of the things you do is related to data. You need to figure out what data to use, how you can extract it, what does individual parameter mean. You spend a lot of time wrangling and exploring data. Then you create model based it and visualize the insight from it.
A strong business acumen – A data scientist must have a solid understanding of the industry they are working. The very first step of the data science process is to “ask an interesting question to frame the business problem”. Unless you know about the business, there is no way you can proceed.
Strong communication skills – The findings you get from the raw data may be too technical to other people of the industry. If the industry cannot use your findings, what value does it provide? You must be able to communicate your findings in such a way that non-technical people can understand and use it; your insight should tell the “story”.
Data scientist is a demanding job. But don’t be overwhelmed; no one knows anything by birth. If you have passion, you will find a way to learn it. The more important question is: “Is data science for you?”. Well you never know if you don’t try.
Part 6: What does the future hold for data science?
If you remember, we started this article with a phrase from Harvard Business Review, “data scientist as the sexiest job of 21st century”. In my honest opinion, it’s a very bold statement. We are not even in the end of first quarter of this century.
There are many jobs that didn’t exist before this century like: mobile app developer, youtube content creator, social media manager etc. And there are many jobs that may not exist in future. For example, taxi and uber drivers may not exist in future because of self-driving cars. Automation is the primary reason why most of the jobs that exists now may not exist in future.
It’s certain that businesses need insights to make decision; there is no doubt. However, what if the data science process is automated? If it’s fully automated, there may not be a data science job. But, can it be fully automated?
It’s a tough question to ask. Some experts believe that it can be automated. However, many of us believe that it can’t be fully done anytime soon. How can a computer translate business problem to data problem and result to business strategy? It requires human involvement. There may be a lot gray area between what’s wrong and what’s right? It requires human intervention.
While I firmly believe that some tasks of a data scientist can be automated, I also firmly believe that it’s unlikely to be fully automated. By the end of 21st century, it probably will not be the sexiest job, it’s likely to be there (with some major modification).