Technologies and Data Science

Peoples Images/iStock

From machine learning to data science

The use of machine learning technologies has become a decisive factor for competitiveness in many areas of science and industry. Data science refers includes data management and the close cooperation with fields which want to use data.

The ability to learn has been considered a basic cognitive ability in Artificial Intelligence from the very beginning. The triumphant success of machine learning was sparked by the first self-learning search engine: Google. The learning aptitude of autonomous vehicles became a focus of attention in 2005 when Stanford University’s Stanley car won the DARPA Grand Challenge. Google’s AI computer defeat of a human at a Go game in 2016 roused considerable public interest. Somewhat less spectacular – but nevertheless influential in our private lives and work world –are machine learning in the sciences and self-learning systems in marketing, production, transport and logistics.

The evolution of machine learning to data science is a three-step process. First, carefully compiled data sets were analysed in order for knowledge-based systems to generate knowledge automatically. The learned rules were easy to understand and could be assessed by experts. What has resulted includes medical applications for evidence-based treatment and risk prediction.

The era of data mining involved the analysis of available databases. Focus shifted to the processing of data with automatic optimization of characteristics and the selected data sample. Analysis and data management were closely linked. Successful applications include customer management, direct marketing and recommender systems. Sensor data from manufacturing are utilized for anomaly detection and quality control. Large sets of medical genetics data generate hundreds of thousands of characteristics based on just a few cases (patients) and can predict the success of or justify certain treatments.

In the age of Big Data, the terabytes of data flow and often widely dispersed sensor data from the Internet of Things are the source of learning, and the learned models are applied in real-time. Data is stored, aggregated and distributed in various different architectures. Data science encompasses the entire process of management, curation, cleansing and analysis of data as well as storage, validation and application of the learned models.

Many methods have already been tested but a few research problems remain unanswered. The large volumes of data on the one hand and the small data logging devices on the other mean that machine learning and data management face certain constraints: storage memory, energy consumption and even computing capacity are limited. The interaction between modern hardware and new storage and analysis algorithms is an exciting area of research.

Data Science involves the interaction with users in disciplines such as physics, biology or medicine. Explorative and interactive data analysis must be possible for them without any knowledge of machine learning. The challenge consists of using the data to create models which can be readily understood and validated and linked with existing knowledge – and to do so for the benefit of society.

These issues are the focus of Working Group 1 headed by Ms Katharina Morik (TU Dortmund) and Mr Volker Markl (TU Berlin) of the Plattform Lernende Systeme.