Session 1. Machine learning - Chair: Bart Buelens
Machine learning algorithms detect patterns in data and use these to predict missing data. Data can be missing because it was not collected or observed, or simply because the prediction is about the future. Machine learning algorithms do not aim to model underlying real-world systems explicitly, rather they employ computational techniques to achieve optimal predictive accuracy. Consequently, they are often described as black-box systems, lacking transparency. This session addresses issues related to decision making when machine learning is involved.
Bart Buelens, Senior Data Scientist, Flemish Institute for Technological Research (VITO), Belgium
A machine that learns by itself is often seen as a form of artificial intelligence. Nowadays, applications of machine learning are widespread: from recommender systems to credit card fraud detection and navigation apps. A bird's eye view of the field of machine learning is given, with an emphasis on applications where algorithmic results are used for decision making. Machine learning results are considered in terms of bias and variance, highlighting the importance of appropriate uncertainty quantification. The talk is illustrated throughout with successful as well as failed examples of machine learning for decision making.
Joep Burger, Team Methodology Heerlen, Statistics Netherlands
The use of machine learning in official statistics: two case studies
Driven by the increasing availability of big and complex data such as images and text, machine learning (ML) is becoming a popular addition to the statistician's toolbox. Two case studies on the use of ML in official statistics will be presented. In the first case study, we try to predict someone's propensity to move from a person's digital footprint in two decades of register data, comparing logistic regression with a random forest. In the second case study, we explore the possibilities to learn statistical information such as poverty from aerial or satellite images, using a convolutional neural network.
Chang Sun, Ph.D. Candidate at Maastricht University Institute of Data Science, Netherlands
Use a secure environment to analyze personal data from multiple sources in a privacy-preserving manner.
With the current development in data science domain such as machine learning and data mining technologies, an increasing amount of data are collected and analyzed by a variety of data parties respectively. However, there is a big drawback to train a machine learning model on a single data source. It might lead to incomplete or incorrect knowledge discovery which probably confuses or misleads the society. To tackle this problem, Chang Sun and her colleagues developed a secure infrastructure to analyze personal data from multiple sources in a privacy-preserving manner. As a use case, they applied the infrastructure at CBS and De Maastricht Studie to study how social-economic factors affect people with Diabetes. This infrastructure enables statistics offices to discover more potential social issues and make greater use of data by collaborating with other data sources.
Session 2. Natural Language Processing - Chair: Piet Daas
Natural language processing investigates how large amounts of data consisting of natural language can be processed and analyzed via computers. Some examples: data from social media are used to test the number of messages and the sentiment with regard to certain subjects. The usefulness of this sentiment analysis has, for example, been demonstrated in the context of consumer confidence. Web scraping, where data is extracted from websites, is used in several research areas, including in the context of job vacancy statistics.
Piet Daas, senior-methodologist and CBS big data specialist, professor by special appointment of Big Data in Official Statistics at Eindhoven University of Technology, Netherlands
Natural Language Processing
Converting text to a form that can be interpreted by machine has challenged researchers in various disciplines since the initiation of this field of research in the 1950s. In recent years more and more applications are becoming available that are used by many of us on a daily basis, such as Spam filters, search engines, and Siri/Alexa/Google assistant. In this presentation, the focus is on extracting information from text. First an overview is given on the ways by which this can be achieved. Subsequently, a number of examples are given to reveal how text can be (successfully) used in an official statistical context.
Martina Hahn, Head of Methodology and Innovation in Official Statistics, Eurostat
The Web Intelligence Hub – use and analysis of web scraped data for different statistical domains.
In the context of implementing the Trusted Smart Statistics paradigm, Eurostat, together with Cedefop, the European Agency for vocational training, and the ESS, will develop a Web Intelligence hub (WIH). The WIH aims at providing the ESS with the key building blocks for harvesting information from the web. The Hub will implement and maintain a portfolio of text processing and analytic services at various levels (e.g. text parsing, mining, classification, interpretation). It will build on developments of the web scraping projects of the ESSnets Big Data and on the Cedefop project, which uses online job advertisements to extract information on skills demand in Europe. Activities will initially focus on establishing a modular system for scraping and analysing online job advertisements and will be gradually extended to other information domains, such as information on enterprises or information relevant for ICT statistics.
Paul Keuren, Statistical researcher/ Software Engineer at Centraal Bureau voor de Statistiek, CBS
Adjusting text-analysis to source motive
Text sources/suppliers obtain textual data from multiple sources. For this presentation two separate sources (Chamber of Commerce data and webscrape) are considered and compared. Chamber of Commerce data is investigated further to help demonstrate what quick wins this data can deliver.
Session 3. Images and visualisation - Chair: Edwin de Jonge
This session involves two aspects of the use of images: the use of images as a data source, and visualisation of data to a wide audience. The basic data for data science can consist of images such as satellite images or images from Google street view, which poses new challenges. Furthermore there is the rapidly growing field of data visualisation where abstract information is being made available more efficiently than ever before. This session will give an overview of some projects where images are used as data, as well in data visualisation and dashboard displays to translate abstract data into user-friendly information.
Edwin de Jonge, statistical consultant, methodologist at Statistics Netherlands
Images and visualization
Chris Bonham, Senior Data Scientist at the Data Science Campus, Office for National Statistics, UK
Remote sensing and machine learning to identify vegetation in urban residential gardens
Given their environmental and emotional benefit identifying and understanding the features of urban green spaces is becoming of greater importance. Current approaches often assume residential gardens are almost exclusively covered by natural vegetation and do not take in to account urban areas such as steps, patios and paths. The Data Science Campus and Ordnance Survey (OS) have used remote sensing and machine learning techniques to improve upon the current approach used within the Office for National Statistics to identify the proportion of vegetation in UK residential gardens. A test library of labelled images was created by taking 100 images randomly sampled from Bristol and Cardiff and independently classified to provide a ground truth. Application of several algorithms to the labelled data indicated sensitivity to the presence of shadows. Consequently a neural network classifier was developed specifically to be insensitive to the effects of shadow. Results support the conclusion that a neural network can more accurately classify vegetation and is less susceptible to the effect of shadows when compared with the other algorithms. Additional information can be found at: https://datasciencecampus.ons.gov.uk/projects/green-spaces-in-residential-gardens/
Karim Douïeb, data scientist and data visualisation designer, co-founder of Jetpack.AI
Why official statistics are key to understand social issues?
This talk will illustrate how openly available socio-demographic data about Belgium have been crucial in the context of visual exploration of two studies. The first one is intended to bring awareness to the immigration situation in Brussels and to the challenges lying ahead. The second one is about a potential heath crisis related to the consumption of opioids in Belgium.
Session 4. Preconditions for effective data science deployment - Chair: Johan Van der Valk and Sofie De Broe
The usefulness of data science for supporting decisions doesn't just depend on statistical and technical standards. Several ethical and organisational issues determine important preconditions for delivering good data science. First of all, there are significant debates around the ethics and privacy dimensions of the growing data science field, to be taken into account when techniques are applied to real-life data. Secondly, developing data science methods for official statistics requires active collaboration between NSI's and international institutions such as the UN and Eurostat, given the global nature of many of the new data sources and policy decisions. Thirdly, providers of Big Data are often private companies. How to organise a sustainable relationship with them? Finally, partnerships are set up with universities and companies to optimise the use of these promising techniques for the development, production and quality improvement of official statistics. This is also a challenge.
Johan Van der Valk, coordinator cross-border statistics, and Sofie De Broe, Scientific Director of the Center for Big Data Statistics, both CBS, Statistics Netherlands
Preconditions for effective "data science" deployment.
This presentation elaborates on the non-methodological challenges for successful application of big data in official statistics. Applying big data in official statistics requires specific conditions that differ from the production of traditional statistics. Important elements are: questioning existing statistics, stimulating co-creation with external and international partners, allowing the development and implementation of new statistical products. To achieve sustainable results, collaboration with the outside world of other data producers, data providers and data users is essential. This requires a change of culture and attitude and a specific data ecosystem. We will present some examples to illustrate our views.
Jasmine Grimsley, Senior Data Scientist, Data Science campus, Office for National Statistics, UK
Ethical maintenance of AI systems.
With the increasing adoption of AI in all aspects of our life, it has become critically important to be confident that they perform in an ethical way over time. There are ethical frameworks in place to ensure a prototype is fair, unbiased and effective. This discussion will explore how AI systems have the potential to depart from an ethical ideal over time. This identifies a need for maintenance programs to ensure that over time AI tools performs in a safe, reliable, timely, and trustworthy. Issues explored will include accurate and unbiased performance evaluation and effective maintenance of systems as their working environment evolves. This may include changing societal values, populations, new and unforeseen kinds of data, and policy and other changes.
Marc Ponsen, PhD. in the area of Artificial Intelligence and data scientist, and Bob van de Berg, product developer, both CBS, Statistics Netherlands
Big data ontology enrichment for cross-border job placements and labour market statistics (CBS)
CBS, VDAB and UWV work together to create a cross-border ontology for the labour market, based on the already existing 'Competent' ontology developed by VDAB . This ontology will be enriched with cross-border skills and occupations that will be derived from millions of Dutch and Flemish vacancy texts. This ontology will be the basis to create new statistics on cross-border demand and supply on the Flemish and Dutch labour market.