Data science aims to take data from some domain and produce a high-level description or model of it that can be applied practically to solve some particular challenge in that domain. How much knowledge about the domain does the data scientist have to have to do a good job? We explore this question in this article.
Before starting on a data science project, someone must define (a) the precise domain on which to focus, (b) the particular challenge to solve, (c) the data to use, and (d) the manner in which to deliver the answer to the beneficiary. All four of these aspects are not data science in themselves but have significant impact on both the data science and the usefulness of the entire effort. Let’s call these aspects the framework of the project.
While doing the data science, the data must be assessed for its quality: precision, accuracy, representativeness, and significance. These are:
* Precision: How much uncertainty is in a value? * Accuracy: How much deviation from reality is there? * Representativeness: Does the dataset reflect all relevant aspects of the domain? * Significance: Does the dataset reflect every important behavior/dynamic in the domain?
In seeking a high-level description of the data, as a formulaic model or some other form, it is practically expedient follow existing descriptions that may only exist in textual, experiential or social forms, i.e. in forms inaccessible to structured analysis. In real projects we find data science often finds (only) conclusions that are trivial to domain experts or does not find a significant conclusion at all. Incorporating existing descriptions prevent the first and make the second apparent a lot earlier in the effort.
It thus becomes obvious that domain knowledge is important both in the framework as well as the body of a data science project. Domain knowledge will make the project faster, cheaper and more likely to yield a useful answer.
The Elephant in the Room
The famous elephant parable illustrates this situation well. Several blind persons, who have never encountered an elephant are asked to touch one and describe it. The descriptions are all good descriptions given the experience of each person, but they are all far from the actual truth because each person was missing data.
This problem could have been avoided with more data or with some contextual information derived from existing elephantine descriptions. Moreover, the effort might be better guided if it is clear what the description will be used for.
Domain Experts and Data Scientists
A domain expert has usually become an expert both by education and experience in that domain. Both imply a significant amount of time spent in the domain. As most domains in the commercial world are not freely accessible to the public, this expertise usually entails a professional career in the domain. This expert is a person who could define the framework for a data science project, since they would know what the current challenges are and how to answer them to be practically useful given the state of the domain as it is today. The expert can judge what data is available and how good it is. The expert can use and apply the deliverables of a data science project in the real world. Most importantly, this person can communicate with the intended users of the project’s outcome. This communication is crucial since many projects end up being shelved because the conclusions are either not actionable or not acted upon.
A data scientist is an expert in the analysis of data. Becoming such an expert also requires a significant amount of time spent in education and gaining experience. Additionally, the field of data science is quickly developing and thus a data scientist must spend some time keeping up with innovations. This person decides which of the many available analysis methods to use in this project and how to parametrize these methods. The tools of the trade (usually software) are familiar to this person and can be used effectively. The data scientist evaluates model quality and goodness-of-fit. Communication with technical persons, such as mathematicians, computer scientists and software developers can be handled by the data scientist.
The expectation that a single individual would be capable of both roles is unrealistic in most practical cases. Just the requirement of time spent, both in the past as well as regular upkeep of competence, prohibits dual expertise. In some areas it might be possible for a data scientist to learn enough about the domain to make a good model, but he or she would still need assistance in defining the challenge and communicating with users; both highly non-trivial. It may also be possible for a domain expert to learn enough data science to make a reasonable model but probably only when standardized tools are good enough for the job.
In conclusion, I advocate strongly for there to be two separate people involved. This construction saves a lot of time and will generally lead to a much better outcome.
Interaction between Domain Expert and Data Scientist
If there are two individuals, they can get excellent results quickly through good communication. While the domain expert (DE) defines the task, the data scientist (DS) chooses and configures the right toolset to solve it. The DE chooses the representative, significant and available data and the DS processes it. The DS builds the tool (that might involve programming) for the task given the data and the DE uses the tool to address the challenge. After obtaining a model, the DS can spot over-fitting where the model has too many parameters in which case it effectively memorizes the data leading to great reproduction of training data but poor ability to generalize. The DE can spot under-fitting where the model provides too little accuracy or precision to be useful in the real-world application. The DS can isolate the crucial data in the dataset needed to make a good model; frequently this crucial data is a small subset of all the available data. The DE then acts on the conclusions by communicating with the users of the project and makes changes to the world. The DS approaches the project in an unbiased way looking at data just as data. The DE approaches the project with substantial bias as the data has significant meaning to the DE and thus comes with pre-formed hypotheses about what the model should look like. It is important to note that bias, in this context, is not at all necessarily a detriment to the effort.
How much Data Science can we do with how much Domain Knowledge?
It is instructive to think about the potential of the outcome if we combine a certain amount of domain knowledge with a certain level of data science capability. The perspective below is my personal opinion but it is probably a reasonable reflection of what is possible today.
First, for the sake of this discussion, let’s divide domain knowledge (DK) into four levels: (1) Awareness is a basic level at which we are aware of the nature of the domain, (2) Foundation is knowing what the elements in the domain do, equivalent to a theoretical education, (3) Skill is having practical experience in the domain, (4) Advanced is the level at which there is little left to learn and where one provides skill and knowledge to other people, i.e. this person is considered a domain expert.
Similarly, we can divide data science (DS) into five levels: (1) Data is where we have a table or database of numbers. We can do little more than draw diagrams with the data. (2) Information is where we derive descriptive statistics about the available data such as correlations and clusters. (3) Knowledge is when we have some static models. (4) Understanding is when we have dynamic models. The distinction between static and dynamic models is whether, or not, the model incorporates the all-important variable of time. A static model makes a statement about how one part of the process affects another while a dynamic model, additionally, makes statements about how the past affects the future. (5) Wisdom occurs when we have both dynamic models and pattern recognition to know what will happen when and what it is.
In the diagram, I present several technologies that exist today ordered by the level of data science they represent and the amount of domain knowledge necessary to create them. There are no technologies in the upper half of the diagram because one cannot make such advanced data science with so little domain knowledge.
In conclusion, data science needs domain knowledge. Since it is unreasonable to expect any one person to fulfill both roles, it is necessary to look for a team effort.
My thanks go to Saeed Mubarak, the chair of the Digital Energy Technical Section (DETS) of the Society of Petroleum Engineers (SPE), for starting a lively discussion on the DETS LinkedIn page. It was this discussion that sparked me to write this article based on responses there and my own views. Thank you also to the many people who participated.