Prof. Dr. Volker Markl on huge data volumes and how to cope with them
"A software project that was a vision which was then implemented together with colleagues and PhD students; a project that you hand over from basic research into the open source community and to real users; it's like watching your own child grow up". What others dream of has become reality for Prof. Dr. Volker Markl, Head of the Department of Database Systems and Information Management at the TU Berlin. Over the past ten years, the software solution "Apache Flink", for which a team of students and scientists developed the first prototype under his leadership as part of the Stratosphere Project at the TU Berlin, has become a leading system worldwide for processing huge data streams - "Big Data". In the meantime, large international companies rely on the highly flexible, scalable and expandable Stream Processing Framework, whose prototype has already won the Humboldt Innovation Award and in which a community of more than 21,000 members and more than 400 code contributors are involved. For Prof. Dr. Markl, the work has only just begun: as head of the research group "Intelligent Analysis of Mass Data - Smart Data" at the German Research Center for Artificial Intelligence (DFKI), Director of the Berlin Big Data Center and Co-Director of the Federal Center for Machine Learning (BZML), he would like to provide the experts of today and tomorrow with additional tools to enable them to use data efficiently.
Prof. Dr. Markl, you call data the ”production factor of the 21st century“. What do you mean by that?
Many people compare data with oil. Just as new products such as nylon or gasoline were created from oil, we can "refine" new knowledge and new things from data if we develop the right programming tools. For example, it is conceivable that we could succeed in transforming the thoughts of a mute person into spoken language, just as we are already able today to control a computer with thoughts. Or take the self-driving car, also a product of digitisation. It will fundamentally change the German car industry, and the sector must be careful not to lose touch. After all, Germany's prosperity depends to a very large extent on it. Digitisation therefore also secures our prosperity. However, I do not wish to talk so much about oil; I rather see data as a production factor, comparable to a breeding ground on which new things grow. Just like humus, data are not destroyed when something new is created, and data have to be maintained, cleaned and integrated so that valuable applications can be built from them.
Provided the data are used properly to generate all these innovations. Is this more easily said than done?
That's right, the data generated today are no longer comparable to those of 30 or 40 years ago. We computer scientists speak of the three big Vs which characterize data today. These are "Volume", "Velocity" and "Variety". The amounts of data are huge, they are produced at a tremendous speed, have to be evaluated in real time, and are extremely heterogeneous. A car is equipped with 200 sensors; 1.3 gigabytes of sensor data are sent from the vehicle every hour and a large German car manufacturer receives 30 gigabytes of data from cars every day. This is a real data explosion. At the same time, data analyses in the fields of statistics and machine learning are becoming increasingly complex. The data scientist, i.e. the expert, must have very extensive knowledge in order to master these data, so I like to talk about the ”goose that lays golden eggs”. There aren't many of these. The extensive training of data scientists at our universities must be accompanied by research into tools that facilitate his or her work.
As head of the BMBF-funded competence centre "Berlin Big Data Center" (BBDC), you are responsible, among other things, for the development of tools which simplify the processing and handling of such huge amounts of data. Their research projects have resulted in many tools, Emma as a programming interface, Myriad for data generation, PEEL for performance analysis, to name but a few. The Apache Flink system, which was developed 10 years ago from a basic research project at the TU Berlin, is probably best known worldwide. What is so special about this solution?
Apache Flink can be used when the computing power of a single computer is not sufficient to analyse data streams. It works according to the system: divide and rule. The data set is distributed to any number of computers in a system; each individual computer then only has to analyse a part of the total amount. Apache Flink also coordinates the further work, because if I want a question answered with Big Data, then it must be ensured that Apache Flink has an overview at all times on which computer which part of the data set can be found and how the final result is composed.
The Berlin Center for Machine Learning (BZML), a second centre of excellence within the scope of AI research, has now joined the BBDC at the TU Berlin. Which role does Berlin play here?
In Berlin, outstanding research work in the fields of data science, big data, data management, data analysis and machine learning as well as AI as a whole is being carried out at various scientific institutions. This is one of Berlin's strengths. The good cooperation between the BBDC and the BZML is bringing about a close integration of the previously isolated areas of data management and machine learning. The continued funding of basic research in these two areas is still extremely important, as science has to repeatedly break down the boundaries of technology. In particular, the economy today enjoys immense competitive advantages when modern methods of data analysis and machine learning can be applied to big data. Thus, big data and machine learning are the technological cornerstones of data science and applications of modern artificial intelligence. Berlin has a leading international position in this field, backed up by the BBDC and the BZML in basic research, but also by the related research topics of mathematics in the Excellence Cluster Math+ and in basic AI by the Excellence Cluster Science of Intelligence. This is supplemented by institutes working on the social effects of AI such as the Weizenbaum Institute and communication platforms such as the Smart Data Forum, as well as by application-oriented research in the ECDF and technology transfer by institutes such as the DFKI, the German Research Center for Artificial Intelligence, and Fraunhofer, to name just a few. There are also leading companies with a strong focus on AI, such as Amazon, SAP, Google and Siemens, as well as an exciting start-up scene. Overall, Berlin offers a unique ecosystem for research and technology transfer in the field of artificial intelligence, with cutting-edge research in the field of data science, especially in the essential basics of data management and machine learning.
These strengths are to be expanded in the future, as Germany’s capital wishes to become a digital hotspot. What do you think is needed for this?
I will answer with a look back. In 1999, two young men in Silicon Valley developed an algorithm which ultimately became the basis for Google. They didn't know what business idea could be made from it, but they got money for it, a very large amount of money. What I want to say is that we must on the one hand become more technology-driven and more willing to take risks. On the other hand, we should take the bottom-up mentality of Silicon Valley as an example. We should give massive support to people who have a new idea for a technology and help them to develop business ideas from this. Inventors are often not businessmen, so we need technology-oriented, risk-taking and visionary business angels who understand technologies and their potential. Although there are more of them in Berlin than in many other places in Europe, unfortunately there are still too few. With many graduates in economics or similar fields I miss the necessary technical depth and vision which I found in Silicon Valley. At the same time, we need to massively expand cutting-edge research at universities; in the future areas of data management and machine learning, there are not yet enough chairs and research groups in Berlin in proportion to the needs of business, science and society. Only in this way will we be able to train enough experts and generate innovations to translate the opportunities offered by AI into economic, scientific and social success. At the same time, we need sustainable funding in order to keep research on big data and machine learning in Berlin permanently at world level.