Skip to main content

STIC - MATH - CLIMAT AmSud

Statistical modeling, nonparametric inference and model selection for complex data

STIC - MATH - CLIMAT AmSud

1st SMILE Workshop

"Workshop en Modelamiento Estadístico para datos complejos"

As a kickoff initiative, a two days workshop is to take place in Valparaíso, Chile, during the 1st semester of the project, the 22th and 29th of august 2024 with conferences on the specific topics addressed in the project.

The talks will take place on Thursday 22th of august 9:00-13:00 and Thursday 29th of august 9:00-13:00  (time in Chile, see venue at the bottom of the page) at CIAE building, Universidad de Valparaíso.

 

Programme:

Thursday 22th of august:

9:00-9:45: John Barrera, Universidad de Valparaíso

10:00-10:45: Tania Roa, Universidad Adolfo Ibañez

10:45-11:10: Coffee break

11:10-11:55: Maritza Márquez, Universidad Adolfo Ibañez 

12:10-13:00:  Marta Avalos, Université de Bordeaux-INRIA

Thursday 29th of august:

9:00-9:45: Miguel Padrino, Universidad de Valparaíso

10:00-10:45: Natalia Bahamonde, Pontificia Universidad Católica de Valparaíso

10:45-11:10: Coffee break

11:10-11:55: Cristina Chávez, Université Paris-Nanterre

12:10-13:00:  Ana Karina, Université Paris-Nanterre

 

 


ABSTRACTS

  • Estimation of a two-part mixed effects model for longitudinal compositional data using the SAEM algorithm (John Barrera)

The study of the human microbiome based on genetic sequencing techniques produces particular longitudinal data that must be analyzed using complex mixed-effects models that can explain temporal variation, given its asymmetric nature and overabundance of zero values. The estimation procedures proposed to date to perform this task are based on log-likelihood approximation methods, as for instance, the Gauss-Hermite quadrature. It is well known that this numerical approaches can produce inconsistent estimates and require a sufficiently large number of quadrature points, implying an often slow and unstable convergence. It is here that this work is based, as we seek to improve the quality of statistical inference in longitudinal microbiome data that will allow better decision-making and conclusions. Thus, in this paper we propose a maximum likelihood estimation method for one of these models, the zero-inflated beta regression (ZIBR) based on the Stochastic Approximation Expectation Maximization (SAEM) algorithm which has showed good performances in complex mixed-effects models.  The details of its application and implementation are presented, as well as the results in both simulated and real data. Comparisons with the method based on log-likelihood approximations, in the context of simulation studies, show that SAEM produces better results in parameter estimation and hypothesis testing in the scenario of unbalanced data. Finally, the SAEM-based estimation method is used on two real examples of microbiome data (pediatric inflammatory bowel disease patients and vaginal microbiome in pregnant women) and the results prove its usefulness in detecting changes in both presence and abundance of bacterial taxa.

  • Consistencia del estimador de Nadaraya Watson en un modelo de regresión no paramétrica, dirigido por el movimiento Browniano fraccionario  muestreado en tiempos aleatorios (Tania Roa)

El estimador de Nadaraya-Watson (N-W) puede considerarse como un caso especial de una clase más amplia de estimadores no paramétricos, los llamados estimadores polinómicos locales. En este trabajo se aborda el estudio de la consistencia $L^{2}$ del estimador N-W cuando las observaciones se muestrean en tiempos aleatorios y considerando un ruido dirigido por un proceso con memoria larga, mediante un control del ancho de banda por el número de observaciones. A diferencia de trabajos anteriores, también se adjunta un estudio de simulación, donde se presentan dos kernels diferentes, el análisis del error y el comportamiento asintótico del Error Cuadrático Medio (ECM). Finalmente, se resumen las principales conclusiones del estudio teórico y práctico, implementadas a través del estudio de simulación, mostrando que el control del número de observaciones registradas junto con el uso del estimador N-W con un kernel con soporte acotado, proporciona mejores resultados.

  • Longitudinal Data Classification through nonlinear mixed effects models with heterogeneity in the random effects in some subpopulations (Maritza Márquez)

Biomedical markers are generally associated with longitudinal data, and these data are usually analyzed using linear or nonlinear mixed-effects models, depending on their complexity. Non-linear mixed effects (NLME) models with normal distribution are commonly used for modeling complicated longitudi- nal trajectories, assuming that individuals come from homogeneous populations with normal errors. However, a homogenous population assumption may inap- propriately ignore significant aspects related to between-individual and within- individual variability, leading to incorrect modeling outcomes. On the other side, the normality assumptions for model errors can result in a lack of robust- ness, leading to unreasonable estimates, especially if the data show asymmetry. Incorporating heterogeneity in random effects can be addressed by proposing NLME models that relax the assumption of a homogeneous population and al- low us to distinguish different parameters between several unobserved classes within a heterogeneous population. Our research proposal consists of using the methodology suggested by [1] in a context of mixtures of NLME using the MSAEM algorithm on a dataset of pregnant women that by default is divided into normal pregnancies, made up of women who carried their pregnancy to term, and pregnancies abnormal, that are the rest of the women who, for one reason or another, had a spontaneous miscarried. In other investigations, this last group has always represented a challenge for its correct estimation and dis- cretization. It is for this reason that we propose i) fit a mixture of NLME models based on a mixture of the distributions (specifically a mixture of distributions in the individual unobservable parameters), using the methodology proposed by [1] and ii) propose an efficient method for the classification supervised in two groups, taking into consideration the discretizing of new subgroups or classes that can surge, indicating the group to which each individual belongs. To il- lustrate this proposal, we consider a clinical study related to the levels of the hormone β-HCG in pregnant women, a biomarker used to indicate changes during pregnancy. Our proposal provides a framework for modeling longitudinal trajectories, assuming individuals come from heterogeneous sub-populations. We categorize pregnant women into two groups using the proposed models and analyze their data, distinguishing between normal (baby birth) and abnor- mal (miscarriages) pregnancies, and separating new subgroups or classes that emerge in the abnormal pregnancies subpopulation by assuming heterogeneity.

  • A Biostatistician in the Era of a Paradigm Shift Towards Data Science in Epidemiology: Examples from My Experience (Marta Avalos)

 

  • Estimación adaptativa en modelos de regresión para datos débilmente dependientes y variable explicativa con densidad conocida (Miguel Padrino)

Esta charla se centra en la estimación de la función de regresión en modelos donde la variable explicativa es un proceso débilmente dependiente con coeficiente de correlación de decaimiento exponencial y una función de densidad conocida y acotada. Se evalúa la precisión de la estimación mediante el riesgo puntual, proponiendo un enfoque basado en datos que emplea la estimación por núcleos con selección del ancho de ventana utilizando el método de Goldenshluger-Lepski. Se demuestra que el estimador resultante cumple con una desigualdad tipo oráculo y muestra adaptabilidad en clases de Hölder. Además, se exploran técnicas de aprendizaje estadístico no supervisado para calibrar el método, respaldadas con simulaciones que ilustran su desempeño.

  • An irregularly spaced ARMA(1,1) model for contamination data (Natalia Bahamonde)

Missing observations and unevenly spaced data are problems common to different disciplines in the context of time series analysis. This paper introduces a new approach to deal with both issues, by considering an irregularly spaced autoregressive moving average process of order (1,1) that is stationary, homoscedastic and invertible, but inhomogeneous, allowing temporal variations in its coefficients. We test our model in the analysis of greenhouse time series by comparing it with a standard benchmark in the literature. As a result, our methodology leads to a huge advantage in the computational time with respect to the competitor.

  • Ridge regularization for spatial autoregressive models with multicollinearity issues (Cristina Chávez)

This work proposes a new method for building an explanatory spatial autoregressive model in a multicollinearity context. We use Ridge regularization to bypass the collinearity issue. We present new estimation algorithms that allow for the estimation of the regression coefficients as well as the spatial dependence parameter. A spatial cross-validation procedure is used to tune the regularization parameter. In fact, ordinary cross-validation techniques are not applicable to spatially dependent observations. Variable importance is assessed by permutation tests since classical tests are not valid after Ridge regularization. We assess the performance of our methodology through numerical experiments conducted on simulated synthetic data. Finally, we apply our method to a real dataset and evaluate the impact of some socio-economic variables on the COVID-19 intensity in France.

  • Un punto de vista de métodos de regularización y seleccion de modelos de problemas inversos mal planteados en la modelación de datos espacialmente dependientes (Ana Karina Fermín)

 


Organizing Comitee

  • Lisandro Fermin (Universidad de Valparaíso)

  • Cristian Meza (Universidad de Valparaíso) 


Venue

Sala 402, Edificio CIAE, Universidad de Valparaíso

Blanco 1931, Valparaíso