Abstract
Predicting stock prices remains a complex challenge due to the semi-random behavior of market prices and the influence of various factors. A significant determinant is the news related to companies. The news articles that impact the market are termed as Event, while those with no influence are considered noise. Given the semi-random nature of stock prices and the high volume of news articles, establishing a correlation between stock prices and news content, and identifying events within news articles, and eventually predicting price changes based on news articles is a challenging task. This thesis introduces a model designed to predict the impact of news articles on future stock prices. The model can capture the relationship between the price change and the news articles by an event detection method. The event detection method improves the prediction model by extracting the events among all news articles published in history and removing non-important news articles, in other words, it decreases the noise of the trainset. This approach helps to identify if a new news article is similar to past events, which in turn improves the performance of our prediction model. The findings from our research affirm that this method effectively enhanced the accuracy of the predictions. Through the development of this model, we discovered that various models identify different numbers of events. As a result, the amount of predictions made from a single dataset varies with each model, and the exact number remains uncertain. Contrary to expectations, evaluating the performance of these classifiers is quite challenging. These classifiers were regarded similarly to the other classifiers and the fact that the number of predictions is unknown has not been taken into consideration. To address this, we introduce the ’Relative Information Superiority’ (RIS) metric, which is adaptable to scenarios with an unknown number of predictions and is particularly effective in our context. RIS takes into account the information uncovered by the classifiers, and its benefits and applications are demonstrated within this thesis, highlighting its superiority over traditional metrics in specific cases. Finally, we identify a gap in the analysis and comparison of different binary evaluation metrics for a specific problem. The performance of classifiers is represented through a confusion matrix, which details the count of correct and wrong predictions for each class. To determine if one classifier outperforms another, it is necessary to assess whether the changes in the confusion matrix are worth it. However, there is no method to evaluate it. To address this, we propose the Worthiness Benchmark (γ), a novel concept that establishes the minimum required change in the confusion matrix for one classifier to be considered superior to another. Subsequently, we conduct a γ-analysis on various binary classification evaluation metrics. This was done to demonstrate the specific worthiness benchmarks they adhere to when comparing different models.