Abstract
Time series data mining is a diverse field that offers algorithms for tasks ranging from anomaly detection to classification, underpinned by methods that assess similarity between subsequences within complex data. A critical yet under-explored area within this field is the assessment of dataset complexity, especially for multi-class time series. This thesis introduces the concept of empirical hardness—a dataset-specific measure of classification difficulty—and investigates the effectiveness of existing complexity measures in time series analysis. Our findings show that while many traditional complexity measures correlate with empirical hardness, they often offer redundant insights and fail to adequately capture the nuances of multi-class datasets, highlighting the need for new, multi-class-specific metrics tailored for time series data. A second focus of this thesis is the efficient extraction and evaluation of shapelets, discriminative subsequences that serve as features in classification tasks. We identify that the primary challenge in shapelet discovery lies in the high computational cost of evaluating distances across a vast number of candidates. To address this, we introduce an algorithm that reduces the candidate pool by clustering similar subsequences and selecting representative patterns, allowing us to explore multiple window lengths. Furthermore, we introduce an evaluation method, which prioritizes intra-class versus inter-class distinction, yields higher accuracy even with a small number of shapelets, enhancing both efficiency and accuracy. This approach enables us to capture a wider range of informative subsequences, achieving strong classification performance with fewer, more discriminative shapelets. Lastly, we address the challenge of exactly computing correlations between all aligned subsequences of two time series and present a visualization tool that compactly represents these relationships. This correlation analysis serves as a valuable tool for understanding when and where two signals align or diverge, enabling a deeper exploration of time-dependent patterns within the data.