On machine learning algorithms and compositional data

Tolosana-Delgado, R.; Talebi, H.; Boogaart, K. G. van den

On machine learning algorithms and compositional data

Tolosana-Delgado, R.; Talebi, H.; van den Boogaart, K. G.

Predictive methods such as Lasso regression, partition trees and random forests (RF), artificial neural networks (ANN) and deep learning, or support-vector machines (SVM) and other kernel methods have become in the last years increasingly popular, also in the compositional data community. However, most of the contributions using machine learning algorithms on compositional data just applied the relevant method to an additive, centered or isometric log-ratio (alr, clr, ilr) transformed version of the training data, without caring about the properties of the construct. In this contribution we briefly review the fundamental construction of these methods, and check in which way can they be tweaked or adapted to account for the compositional scale of the data.

As an example, a binary partition tree aims at constructing a hierarchy of classification, where each branch splits the data in two subgroups according to the one single covariable that provides highest purity of the two resulting subgroups; at the end of the hierarchy, all branches contain only data from one pure group. Random Forests (Breiman, 2001) were introduced to deal with the obvious over-fitting of partition trees, with a double randomisation strategy: first bootstrapping the number of observations, creating B different trees that form the forest; second, each branching of each tree is based not on the whole set of variables, but on a different random subset of them. The fact that at each branching only one variable is actively used makes the method non-invariant under the choice of possible log-ratio transformations. A way to allow for this one feature selection while keeping the relative nature of compositional information would be to build the trees on the set of pairwise log-ratios (pwlr). This applies to all kinds of tree-based methods with compositional covariables.

Keywords: affine equivariance; subcompositional coherence; variable selection

Contribution to proceedings
8th International Workshop on Compositional Data Analysis, 03.-08.06.2019, Terrassa, Spanien

Permalink: https://www.hzdr.de/publications/Publ-28476