PASADENA workshop

Prediction and Analysis of Structured AnD Heterogeneous DAta

Program

9:00 Welcome coffee

9:30 - 10:20 Generalized Concomitant Multi-Task Lasso for sparse multimodal regression
Joseph Salmon (Université de Montpellier) - Slides.

10:20 - 10:50 Coffee break

10:50 - 11:40 Variational inference for Poisson lognormal models: application to multivariate analysis of count data
Julien Chiquet (AgroParisTech, INRA MIA Paris) - Slides.

11:40 - 12:30 Natural Language Processing for social computing : from opinion mining to human-agent interaction
Chloé Clavel (LTCI, Telecom-ParisTech) - Slides.

12:30 - 14:00 Lunch (free buffet)

14:00 - 14:50 Infinite Task Learning in RKHSs
Romain Brault (Thalès)

14:50 - 15:40 Simulating Alzheimer’s disease progression with personalised digital brain models
Stanley Durrleman (Aramis Lab, ICM)

15:40 - 16:10 Coffee break

16:10 - 17:00 Structured feature selection in high dimension for precision medicine
Chloé-Agathe Azencott (CBIO, Mines ParisTech, Institut Curie, INSERM) - Slides.

17:00 End

Abstracts

Structured feature selection in high dimension for precision medicine

Chloé-Agathe Azencott (CBIO, Mines ParisTech, Institut Curie, INSERM)
Many problems in genomics require the ability to identify relevant features in data sets containing many more orders of magnitude than samples. One such example is genome-wide association studies (GWAS), in which hundreds of thousands of single nucleotide polymorphisms are measured for orders of magnitude fewer samples.

This setup poses statistical and computational challenges, and traditional feature selection methods fall short. In my talk, I will present how prior knowledge of the structure of the features can help tackle this problem. In the first part of the talk, I will describe how to integrate additional information on the structure of the features, such as a biological network, to constrain the feature selection procedure. In a second part, I will discuss how to account for the linkage disequilibrium structure of the genome when searching for synergistic (or epistatic) effects between features.

Infinite Task Learning in RKHSs

Romain Brault (Thalès)
Machine learning has witnessed tremendous success in solving tasks depending on a single hyperparameter. When considering simultaneously a finite number of tasks, multi-task learning enables one to account for the similarities of the tasks via appropriate regularizers. A step further consists of learning a continuum of tasks for various loss functions. A promising approach, called Parametric Task Learning, has paved the way in the continuum setting for affine models and piecewise-linear loss functions. In this work, we introduce a novel approach called Infinite Task Learning whose goal is to learn a function whose output is a function over the hyperparameter space. We leverage tools from operator-valued kernels and the associated vector-valued RKHSs that provide an explicit control over the role of the hyperparameters, and also allows us to consider new type of constraints. We provide generalization guarantees to the suggested scheme and illustrate its efficiency in cost-sensitive classification, quantile regression and density level set estimation.

Variational inference for Poisson lognormal models: application to multivariate analysis of count data

Julien Chiquet (AgroParisTech, INRA MIA Paris)
Many application domains such as ecology or genomics have to deal with multivariate count data. A typical example is the joint observation of the respective abundances of a set of species in a series of sites, aiming to understand the co-variations between these species. The Gaussian setting provides a canonical way to model such dependencies, but does not apply in general. We adopt here the Poisson lognormal (PLN) model, which is attractive since it allows one to describe multivariate count data with a Poisson distribution as the emission law, while all the dependencies is kept in an hidden friendly multivariate Gaussian layer. While usual maximum likelihood based inference raises some issues in PLN, we show how to circumvent this issue by means of a variational algorithm for which gradient descent easily applies. We then derive several variants of our algorithm to apply PLN to PCA, LDA and sparse covariance inference on multivariate count data. We illustrate our method on microbial ecology datasets, and show the importance of accounting for covariate effects to better understand interactions between species.

Natural Language Processing for social computing : from opinion mining to human-agent interaction

Chloé Clavel (LTCI, Telecom-ParisTech)

The Social Computing topic aims to gather research around computational models for the analysis of social interactions whether for web analysis or social robotics. The peculiarity of this theme is its multidisciplinary approach: computational models are established in close collaboration with research fields such as psychology, sociology, and linguistics. They are based on methods from various fields in signal processing (eg speech signal processing for the recognition of emotions), in machine learning (e.g. structured output learning for the detection of opinions in texts ), in computer science (ex: the automatic processing of the natural language for the detection of opinions, the integration of the socio-emotional component in the human-machine interactions). This presentation will describe examples of studies conducted around Social Computing topic.

In particular, we will examine the role of natural language processing in human-agent interaction by presenting our progress on the different research topics we are currently working on, such as the analysis of the likes and dislikes of the user during her interactions with a virtual agent using symbolic methods (Langlet & Clavel, 2016) and machine learning methods (Barriere et al., 2018). Opinion mining methods and their challenges in terms of machine learning will also be tackled (Garcia et al., 2018).

Simulating Alzheimer’s disease progression with personalised digital brain models

Stanley Durrleman (Aramis Lab, ICM)

Simulating the effects of Alzheimer's disease on the brain is essential to better understand, predict and control how the disease progresses in patients. Our limited understanding of how disease mechanisms lead to visible changes in brain images and clinical examination hampers the development of biophysical simulations.

Instead, we propose a statistical learning approach, where the repeated observations of several patients over time are used to synthesise personalised digital brain models. They provide spatiotemporal views of structural and functional brain alterations and associated scenarios of cognitive decline at the individual level.

We show that the personalisation of the models to unseen subjects reconstructs their progression with errors of the same order as the uncertainty of the measurements. Simulation of synthetic patients generalise the distributions of the data in the training cohort. The analysis of factors modulating disease progression evidences a prominent sexual dimorphism and probable compensatory mechanisms in APEO-ε4 carriers.

This first-of-its-kind simulator offers an unparalleled way to explore the heterogeneity of the disease's manifestation on the brain, and to predict its progression in each patient.

Concomitant Lasso with Repetitions (CLaR): beyond averaging multiple realizations of heteroscedastic noise

Joseph Salmon (Université de Montpellier)

Sparsity promoting norms are frequently used in high dimensional regression. A limitation of Lasso-type estimators is however that the regularization parameter depends on the noise level which varies between datasets and experiments. Estimators such as the concomitant Lasso address this dependence by jointly estimating the noise level and the regression coefficients. As sample sizes are often limited in high dimensional regimes, simplified heteroscedastic models are customary. However, in many experimental applications, data is obtained by averaging multiple measurements. This help reducing the noise variance, yet it dramatically reduces sample sizes, preventing refined noise modeling. In this work, we propose an estimator that can cope with complex heteroscedastic noise structures by using non-averaged measurements and a concomitant formulation. The resulting optimization problem is convex, so thanks to smoothing theory, it is amenable to state-of-the-art proximal coordinate descent techniques that can leverage the expected sparsity of the solutions. Practical benefits are demonstrated on simulations and on neuroimaging applications.

This is joint work with Quentin Bertrand, Mathurin Massias et Alexandre Gramfort.