Data analytics in the era of large-scale machine learning

Date

Tuesday 23 - Wednesday 24 May 2023

Location - The Cyprus Institute and via Zoom

This event is part of the EuroCC2 project and the National Competence Center activities, in collaboration with the Greek National Competence Center.

Pre-requisites

Attendees should be familiar with at least one programming language, such as C/C++, Fortran, Python, R.

Requirements

All attendees will need their own desktop or laptop with the following software installed:

Web browser - e.g. Firefox or Chrome
PDF viewer - e.g. Firefox, Adobe Acrobat
ssh client - Terminal for Mac or Linux is fine. For Windows Putty should be fine.

Participation and Registration

This will be a hybrid event, and participants can attend on-site via zoom.

Feedback form

1st day feedback form, please complete at the following link:
https://forms.gle/5WvJZryAqhSEt7CE9

2nd day feedback form, please complete at the following link:
https://forms.gle/9kJmbSDvMGirLTuv9

Git Repository

The Git Repository with all material of the training event - including presentations and code, will soon be made available.

Agenda

Videos of all above sessions and presented material can be found on the following website: https://eurocc.cyi.ac.cy/data-analytics-in-the-era-of-large-scale-machine-learning/

Tuesday 23 May 2023

09:45 - 10:00: Introduction.
10:00 - 11:15: Large-scale generative models for language and vision (including LLMs): How they work – and what we still do not know about them. Speakers: Professor Constantine Dovrolis and Dr. Mihalis Nicolaou.
11:15 - 11:30: Break
11:30 - 13:00: PyTorch Neural Networks: Running on CPUs and GPUs. Speaker: Dr. Pantelis Georgiades.
13:00 - 13:30: Lunch Break
13:30 - 14:30: Research Seminar: “Tensorization and uncertainty quantification in machine learning”. Speaker: Dr. Yinchong Yang, Siemens AG.
14:30 - 15:00: Break
15:00 - 16:30: Parallel computing techniques for scaling hyperparameter tuning of Gradient Boosted Trees and Deep Learning. Speaker: Dr. Nikos Bakas.

Wednesday 24 May 2023

11:00 - 12:30: Efficient Data Cleaning and Pre-processing Techniques for Robust Machine Learning.
12:30 - 13:15: Lunch Break
13:15 - 14:45: GPU CUDA Programming - Session 1. Speaker: Dr. Giannis Koutsou.
14:45 - 15:00: Break
15:00 - 16:30: GPU CUDA Programming - Session 2. Speaker: Dr. Giannis Koutsou.

Videos of all above sessions and presented material can be found on the following website: https://eurocc.cyi.ac.cy/data-analytics-in-the-era-of-large-scale-machine-learning/

Large-scale generative models for language and vision (including LLMs): How they work – and what we still do not know about them

Speakers: Professor Constantine Dovrolis and Dr. Mihalis Nicolaou

Description: This research talk provides a comprehensive overview of large-scale generative models in machine learning, such as generative adversarial nets, transformers, and large language models (LLMs), focusing on key technologies such as ChatGPT, Bert, Generative Advesarial Networks, and Stable Diffusion. We will discuss the mathematical underpinnings of these models, including attention mechanisms, self-attention, and positional encoding. An examination of the deep neural network architectures used, such as the multi-layered transformer architecture, will offer insight into their impact on natural language processing and other fields.

The presentation will also cover the training and fine-tuning processes of these advanced models, highlighting how they enable a wide range of applications across diverse domains. Furthermore, we will address the limitations and open questions surrounding these technologies, including their interpretability, potential biases, energy consumption, and the development of more efficient and robust models. By offering a holistic understanding of the current state of machine learning transformers and large-language models, this talk aims to encourage further research and innovation in the field.

PyTorch Neural Networks: Running on CPUs and GPUs

Speaker: Dr. Pantelis Georgiades

Prerequisites: Trainees should be comfortable with the Python programming language.

Description: In this session we will present a simple introduction to neural networks and work through a classification problem using the PyTorch framework in Python using both CPUs and GPUs. PyTorch is a deep learning framework developed by Meta and offers a fast and flexible set of tools to develop and deploy deep learning neural network models on both CPUs and GPUs. The example will be presented in an interactive Jupyter Notebook and the trainees will have the opportunity to become familiar with the work-flow and implementation of a Data Science project using state-of-the-art deep learning libraries.

Research Seminar: Tensorization and uncertainty quantification in machine learning.

Speaker: Dr. Yinchong Yang, Siemens AG.

Biography: Yinchong Yang holds a master in statistics and a PhD in computer science from the Ludwig Maximilian University of Munich. As a senior key expert of robust AI at Siemens, he conducts research in the quantification and certification of robustness and uncertainty for industrial grade AI. He’ also interested in tensor decomposition methods in machine learning, such as tensorized neural networks and relational learning from tensor data.

Abstract

Modern deep neural networks, which consist of large weight matrices, are often prone to over-parameterization and can be computationally expensive to train and store. Tensorizing and decomposing these weight matrices has emerged as an effective solution to this problem, since it allows for neural networks to represent large matrices with significantly fewer parameters. This technology has been applied in various neural network architectures and use cases, making it an interesting topic of research. This talk will include a brief introduction on the basic idea, some hands-on tutorials on how to implement such models with very few code, and an overview on related publications.

Deep neural networks have achieved impressive results in a wide range of machine learning tasks, but accurately quantifying the uncertainty of these models remains a significant challenge. Gaussian processes, on the other hand, provide a principled approach to modeling uncertainty. However, they often struggle to scale to large amounts of training data. This talk will first introduce the fundamental concept of the latest research on scalable Gaussian Process models. Second, we would discuss two recent publications that demonstrate how to incorporate scalable Gaussian Processes with representation / deep learning. References to programming frameworks will also be included.

Parallel computing techniques for scaling hyperparameter tuning of Gradient Boosted Trees and Deep Learning

Speaker: Dr. Nikos Bakas

Description: The presentation discusses the hyperparameter tuning in machine learning model development when trained on supercomputers. We will present parallelization techniques using XGBoost and PyTorch on large-scale supercomputers aiming to scale up performance in terms of computing time and accuracy. Computational bottlenecks during hyperparameter tuning, and the impact of multiprocessing on CPU utilization, will be presented, along with a cross-validation algorithm for efficient exploration of the hyperparameter optimization search space. The usage of XGBoost and PyTorch in a multiprocessing setting on powerful CPUs will be demonstrated, as well as insights on handling multiple OpenMP runtimes. Scaling-up results from applying the parallelization techniques on supercomputers will be presented, analyzing the impact of increasing the number of threads on hyperparameter optimization and the resulting reduction in tuning time.

Efficient Data Cleaning and Pre-processing Techniques for Robust Machine Learning

Speaker: Dr. Charalambos Chrysostomou

Description: In this session, we will explore various data cleaning and pre-processing techniques that can enhance data quality and improve the performance of machine learning models. The session will cover handling missing values, outlier detection, data transformation, feature scaling, and encoding categorical variables. By applying these techniques, participants will learn how to create robust and high-performing machine-learning models. The examples will be presented using Python and popular data processing libraries such as Pandas and Scikit-learn. Attendees will have the opportunity to become familiar with the workflow and implementation of data cleaning and pre-processing techniques.

GPU CUDA Programming

Speaker: Dr. Giannis Koutsou

Prerequisites: Trainees should be comfortable programming using C.

Description: An introduction to the GPU programming model and CUDA in particular will be provided. The hands-on component will begin with a step-by-step tutorial on how to write your first GPU program using CUDA, and continue with examples that demonstrate how data-layout, use of shared memory, and GPU thread distribution affect GPU performance.

EuroCCLogo

CaSToRC