Abstract
Motivation
Polygenic Risk Scoring (PRS) is a common tool in understanding the contribution of genetics in complex traits and diseases. However, the heterogeneity of PRS algorithms and the complexity of genomic data have posed significant challenges in their integration and analysis. Different algorithms are available, and comparing the performances of each score capturing its own peculiarity has become a standard practice. Pre-processing from multiple data sources and integration and harmonization of different data types is a complex and error-prone task. We implemented a new pipeline based on Snakemake framework, a robust workflow management system, aimed at addressing the challenges of reproducibility, scalability, and high-performance
computing in genetic research. Availability at https://github.com/EuracBiomedicalResearch/prs_pipeline .
Methods
The pipeline integrates a variety of tools and programming languages within a Dockerized environment, ensuring a seamless and reproducible workflow. Implementation uses different conda environments for each module. Each envirnoment is defined through a requirements file and can be installed separately for debugging. This modular design not only allows for tailored configurations based on user needs but also facilitates easy upgrades and modifications for expert users. The pipeline's robustness and accuracy have been validated using real-world genomic datasets. Furthermore, it has been rigorously tested on various computing environments, including parallel clusters and personal notebooks, ensuring its versatility and
reliability.
Results
Our pipeline enables the seamless integration of multiple PRS algorithms, including PRS-CS, LDPred2, and SBayesR, with various genomic data formats, such as BGEN, PLINK, and VCF. By leveraging Snakemake's scalability and reproducibility features, our pipeline provides a streamlined and efficient solution for PRS analysis, significantly reducing the time and effort required for data processing and analysis. This novel pipeline can significantly enhance the utility of PRS in precision medicine and clinical genetics. Its ability to handle large-scale genomic datasets and efficiently analyze various PRS algorithms enables researchers to focus on data interpretation and advance our understanding of complex traits and diseases. By leveraging Snakemake's parallel processing capabilities, this pipeline simplifies the PRS analysis process, making it more accessible and user-friendly for researchers in the field.