Data analysis is not a conclusion for data scientists

 October 12, 2020

Data analysis is not a conclusion for data scientists

Yoshiharu Maeno
Professor, School of Interdisciplinary Mathematical Sciences,
Meiji University
 

Data science has been attracting a lot of attention in recent years. This is because effective utilization of data is required to solve problems in various fields. However, problems will not be solved as expected by simply analyzing the data. So, what do we need?
In the 19th century, there was a physician who brought a cholera epidemic under control

In March 2020, the WHO (World Health Organization) declared an outbreak of new coronavirus infections (COVID -19) as a pandemic and announced its "concern about the lack of measures". It is a serious situation, but it is not the first time that humans have been attacked by an unknown virus or pathogen.

There was a global pandemic of cholera in 1852. There are many things for us today to learn from the activities of British physicians who brought an epidemic under control at that time.

At that time, a cholera epidemic broke out also in London in England, but the Vibrio cholerae was not yet known, and physicians believed that the disease accompanied by severe diarrhea could be spread through the air and were providing treatments accordingly.

In 1854, however, Dr. John Snow became skeptical about this theory and based on the data collected by the city government, visited residents for himself to carefully examine how they lived, what the people who died of the disease had in common, what the differences were, and what the houses where no one died had in common, and what the differences were.

As a result, he came up with the theory that the disease was not airborne but waterborne.

He mapped the houses of deceased people. Then he found a common well near the houses.

Later, it was revealed that drinking water contaminated by sewage was the cause of the disease, and the cholera epidemic was brought under control by closing the wells.

John Snow is now known as the father of epidemiology.

Data tells us a lot. However, the data also treats the dead as a figure. So, no matter how much you aggregate the data and make some calculations, it ends up being statistics and may not be realistic.

John Snow went into the field based on the data, experienced it and the reality, where he got a hint from water, formulated a hypothesis, and further examined the data to find a way to bring the epidemic under control.

The method employed by John Snow, a man of the 19th century, is now outdated. However, the process of thinking to get clues to solving problems and the way of formulating hypotheses based on the clues and thinking more deeply are the same for modern data science.

Some results cannot be obtained by simply analyzing the data
Over the past 10 years, advances in ICT have led to enormous amounts of high-precision data and high-speed computational algorithms. These are the driving forces behind the rise of data science in all aspects of business.

For example, data and algorithms support product recommendations in online shopping and sales forecasts for convenience stores.

However, solving socio-economic problems is a little different from these business applications. It is not always possible to categorize and solve problems well.

New problems with high specificity cannot be solved by applying algorithms to common structures learned from previous data.

So, what should be done? Not only the ability to make full use of data and algorithms but also the data science to understand reality deeply and precisely is needed.

If John Snow had ended up concluding that the number of deaths would increase in the future based on the data on the number of deaths in the city or that people should be restricted from going out because the disease was transmitted through the air as it was believed at the time, he would not have been able to find that the water of the well was the cause of the infection of an entirely new disease that was right in front of him.

Of course, it would not have led to infection control, that is, a solution to the problem.

Why was he able to do this? As a physician, he had a knowledge of medicine, substantial experience in examining patients, findings obtained through diagnoses and was active in pursuing the reality.

In other words, this is the skill set required for data scientists today.

The activities of repeatedly examining and discussing how and why instead of assuming that a result has been obtained by analyzing the data once, help to formulate a hypothesis, perform data analysis and collect new data to be examined.

Such activities are the starting point to approach solutions to new problems. In other words, what is important is "Why" after analyzing data and the idea and concept of formulating a hypothesis that seemingly contradict the science.
Things necessary for generating ideas from data analysis
So, how can we generate ideas and concepts to solve problems?

First, we need knowledge. John Snow, who was involved in medical issues, had medical knowledge.

Likewise, if we do not have or are not willing to acquire knowledge in the fields of study we are involved in such as social and economic issues, we will not come up with any ideas or concepts.

Of course, in a complex modern society, it is difficult for an individual to have all the necessary knowledge on all the fields to solve problems and to make full use of data analysis and algorithms. It is also necessary to cooperate with experts in the fields. Communication skills are also important.

For example, we need minimal knowledge to understand the terms of experts, and it is also important to acquire the terms and the ability to express our thoughts accurately.

In other words, it is important to be interested in or concerned about society, the economy, and people in order to be a good data scientist. It leads to our own actions and experiences, which lead to a better understanding of the surrounding environment, and then to real knowledge and expertise.

What we acquire there may be what is called culture, comprehensive capability, or resourcefulness. It is the source of the inspiration for ideas and concepts.

In fact, at this point, using new algorithms and being familiar with analytical methods are recognized as the skills of data scientists, but AI will be able to perform the tasks automatically over time. In other words, they will be tasks which humans do not have to do.

This is exactly why there will be opportunities for data scientists to play active roles after that point.

One more thing can be learned from the activities of John Snow. It is the attitude of learning by trial and error freely adjusting factors such as a scale to analyze data.

He went beyond a single point of view to a variety of points of view freely such as scrutinizing the information of deceased people or conversely omitting detailed information and focusing on the positional relationship between the well and deceased people.

This means, for example, that if we look at data and information at a distance, we can see the whole picture without the details, and on the other hand, if we look at them closer, we can see the details that cannot be seen only from figures.

Moving flexibly between these viewpoints helps new ideas emerge and reveals the relationship between the things that seemed to be unrelated.

In fact, there is already computer software that can change the scale at which we view data. Therefore, since there is such a tool, the effective utilization of it depends on the culture and comprehensive capability of the user.

There must have been an analytical method in which the locations of deceased people are mapped before the times of John Snow. However, John Snow was the first to utilize a map to figure out the positional relationship between deceased people and a well.

The new coronavirus pandemic has killed more than 100,000 people in the United States. This is a tragedy. However, can we imagine from this figure how much misfortune there has been in American society?

On May 24, The New York Times published an article that introduced briefly each person who had passed away by carefully examining their lives, how they contributed to society, and how they affected people around them.

What has society lost? The attempt of the article to show us the reality that cannot be seen from just the figure of 100,000 seems to be a reflection of what data science is all about along with activities of John Snow 170 years ago.



* The information contained herein is current as of September 2020.
* The contents of articles on Meiji.net are based on the personal ideas and opinions of the author and do not indicate the official opinion of Meiji University.
* I work to achieve SDGs related to the educational and research themes that I am currently engaged in.


Yoshiharu Maeno
Professor, School of Interdisciplinary Mathematical Sciences, Meiji University
 
Research fields:
Complex system theory, stochastic process theory

Research themes:
Theory and data analysis for understanding complex interrelationships in a socioeconomic system
[Keywords] complex system, reaction-diffusion system, stochastic process, statistical inference, pandemic, financial crisis, fake news

Main books and papers:
◆「Detecting a trend change in cross-border epidemic transmission」(『Physica A: Statistical Mechanics and its Applications』457 (2016), pp.73-81)
◆『Impact of credit default swaps on financial contagion』(collective writing・2014 IEEE Conference on Computational Intelligence for Financial Engineering&Economics, London)
◆『Optimal portfolio for a robust financial system』(collective writing・2013 IEEE Conference on Computational Intelligence for Financial Engineering&Economics, Singapore)
◆「Discovery of a missing disease spreader」(『Physica A: Statistical Mechanics and its Applications』390 (2011), pp.3412-3426)

Page Top

Meiji University

Copyright © 2015 Meiji University. All Rights Reserved.