Abstract
Ensuring high data quality is critical for effective decision-making and operational efficiency in sensor-driven environments. However, machine learning (ML) models are highly dependent on the quality of the data they utilize, and existing methodologies often fail to address this dependency adequately. This thesis examines how data quality influences machine learning (ML) performance and vice versa, with a focus on anomaly detection, root cause analysis, and pattern recognition in sensor-based datasets. Despite advancements in ML, significant gaps remain in understanding how data anomalies impact ML performance and in developing strategies for anomaly detection and root cause analysis. Current approaches lack a framework that integrates ML metrics with data quality dimensions and do not provide mechanisms for continuous improvement based on data feedback. To address these gaps, the study proposes a layered data architecture that integrates ML metrics with data quality dimensions, creating a feedback loop that aims to continually refine data collection and analysis processes. This framework aims to improve the accuracy and utility of sensor-generated data, contributing to the fields of data quality management within machine learning applications. The thesis presents a framework for ML quality management to enhance data integrity, offering insights for improving decision-making and operational outcomes in sensor-driven environments. The primary achievements of this thesis include the development of a novel framework for ML quality management, which improves data integrity and provides actionable insights for better decision-making and operational outcomes in sensor-driven environments. The thesis specifically explores the relationship between data quality and ML model performance by examining the impact of data anomalies on the accuracy, precision, and recall of ML algorithms. It introduces a rule-based anomaly identification system that detects various types of anomalies in ML models, identifying quality anomalies through a set of rules derived from actual data observations. In summary, this thesis offers an approach to enhancing data quality in ML applications, presenting a significant step forward in the management of sensor data quality and the performance of ML models.