Abstract
Existing Python libraries and tools lack the ability to efficiently run statistical test (such as Pearson correlation, ANOVA, Mann-Whitney-U test) for large datasets in the presence of missing values. This presents an issue as soon as constraints on runtime and memory availability become essential considerations for a particular use case. Relevant research areas where such limitations arise include interactive tools and databases for exploratory analysis of large mixed-type data. At the same time, until today, biomedical data analyses on such large datasets (e.g. population cohorts or electronic health record data) mostly investigate statistical associations between specific variables (e.g., correlations between measurements as body mass index and blood pressure). However, the rapidly growing popularity of systems approaches in biomedicine makes it increasingly relevant to be able to efficiently compute pairwise statistical associations for all available pairs of variables in a dataset. To address this problem, we present the Python tool NApy, which relies on a Numba and C++ backend with OpenMP parallelization to enable scalable statistical testing for mixed-type datasets in the presence of missing values. Both with respect to runtime and memory consumption, we assess NApy’s efficiency on simulated as well as real-world input data originating from a population cohort study. We show that NApy outperforms Python competitor tools and baseline implementations with naïve Python-based parallelization by orders of magnitude enabling on-the-fly analyses in interactive applications. NApy is publicly available at https://github.com/DyHealthNet/NApy.