NApy: Efficient Statistics in Python for Large-Scale Heterogeneous Data with Enhanced Support for Missing Data

Fabian Klaus Woller; Lis Arend; Christian Fuchsberger; M List; DB. Blumenthal

Conference presentation

NApy: Efficient Statistics in Python for Large-Scale Heterogeneous Data with Enhanced Support for Missing Data

Fabian Klaus Woller, Lis Arend, Christian Fuchsberger, M List and DB. Blumenthal

33rd Conference on Intelligent Systems for Molecular Biology (ISMB) / 24th European Conference on Computational Biology (ECCB) (Liverpool, 20/07/2025–24/07/2025)

2025

Handle:

https://hdl.handle.net/10863/51637

Abstract

Computational Genomics

Existing Python libraries and tools lack the ability to efficiently run statistical test (such as Pearson correlation, ANOVA, Mann-Whitney-U test) for large datasets in the presence of missing values. This presents an issue as soon as constraints on runtime and memory availability become essential considerations for a particular use case. Relevant research areas where such limitations arise include interactive tools and databases for exploratory analysis of large mixed-type data. At the same time, until today, biomedical data analyses on such large datasets (e.g. population cohorts or electronic health record data) mostly investigate statistical associations between specific variables (e.g., correlations between measurements as body mass index and blood pressure). However, the rapidly growing popularity of systems approaches in biomedicine makes it increasingly relevant to be able to efficiently compute pairwise statistical associations for all available pairs of variables in a dataset. To address this problem, we present the Python tool NApy, which relies on a Numba and C++ backend with OpenMP parallelization to enable scalable statistical testing for mixed-type datasets in the presence of missing values. Both with respect to runtime and memory consumption, we assess NApy’s efficiency on simulated as well as real-world input data originating from a population cohort study. We show that NApy outperforms Python competitor tools and baseline implementations with naïve Python-based parallelization by orders of magnitude enabling on-the-fly analyses in interactive applications. NApy is publicly available at https://github.com/DyHealthNet/NApy.

Files and links (1)

url

https://www.iscb.org/ismbeccb2025/homeView

Details

Title: NApy: Efficient Statistics in Python for Large-Scale Heterogeneous Data with Enhanced Support for Missing Data
Creators: Fabian Klaus Woller
Lis Arend
Christian Fuchsberger
M List
DB. Blumenthal
Conference: 33rd Conference on Intelligent Systems for Molecular Biology (ISMB) / 24th European Conference on Computational Biology (ECCB) (Liverpool, 20/07/2025–24/07/2025)
Identifiers: (EURAC)30796881
991007227587401241
Academic Unit: Institute for Biomedicine
Language: English
Resource Type: Conference presentation
Description coverage: none
Description audience: Scientific
Local Fields: Scientific
Author Names String: Woller F, Arend L, Fuchsberger C, List M, Blumenthal DB.

Metrics

1 Record Views