Updating DistilBERT Stance Detection Model for Streaming Data
Jan 11, 2024
I would like to give a little bit of detail about our basic stance detection project which is conducted for the CS533: Information Retrieval System course which is one of the course I have taken for this semester. This project was made by me and my friends, Ann Neşe Rende and Ekrem Polat. In this post, most important portions of the project is explained. Navigate to the GitHub repository to view the code by clicking here.
Stance detection is a task that identifies the attitude (favor, against, or none) in textual content and plays a critical role in unveiling the collective idea of society. However, the prediction of the stance has to overcome the challenge of the constantly changing nature of the data. This phenomenon, known as concept drift, needs significant adaptive techniques to maintain model efficiency over time. Our motivation seeks to address this important issue by proposing an approach to overcome the concept drift to sustain the performance of stance detection models.
- SemEval-2016: SemEval-2016 is stance detection dataset which contains 5 different target topics; Atheism, Climate Change, Feminist Movement, Hillary Clinton and Legalization of Abortion
- P-Stance: P-Stance is a dataset which contains train, validation, and test data for opinions towards Bernie Sanders, Donald Trump, and Joe Biden
- MDI: MDI is a political dataset from the 2020 US presidentail election, containig tweets about Donald Trump and Joe Biden.
- XStance: XStance is a dataset containing tweets in German, French and Italian about varius questions. All the text in this dataset was translated to English with the T5 model before being used for training.
Our first approach is combining pre-trained language model DistilBERT, Long Short-Term Memory (LSTM) to detect stance. We utilize DistilBERT-generated word embeddings as inputs to bidirectional LSTM with 128 layer which is followed by fully connected layer for classification. The final layer, consisting of three neurons, is tailored to our stance detection task, categorizing text into "favor", "against", or "neither". Since better accuracy on the test dataset is obtained using another approach that is going to be expalined in the next paragraph, we proceed with the next method.
In our approach, an end-to-end solution is considered. We completely rely on the fine-tuned DistilBERT for predicting the stance of given inputs. Initially, DistilBERT is fine-tuned for stance detection task by using the P-Stance dataset. The accuracy, recall, precision, and F1 measurement metrics for the fine-tuned DistilBERT model on the P-Stance dataset are displayed in the figure below.
Our fine-tuned DistilBERT stance detection model demonstrates an accuracy of 71.8% in predicting the stance of tweets used for test purposes. Following the fine-tuning of our base model, we evaluate its performance on stream data. An essential aspect of this approach is to set a specific number of data points for measuring the stance detection model's performance. Initially, this value is set to around 7,000 data points, and the remaining data is fed to the model incrementally in batches of 1,000 for the purpose of system performance measurement. Another critical decision in this approach is determining when to update the model. For this purpose, a threshold value for accuracy is necessary. We set this threshold at 50%. In other words, the stance detection model will be updated when the accuracy for the upcoming batch of stream data falls below 50%. The figure below provides a detailed depiction of the entire approach.
After fine-tuning DistilBERT for the stance detection task with the P-Stance dataset, SemEval 16 and MDI dataset are combined and considered as first stream data. Since this stream data includes much broader topics than P-Stance, it is expected to decrease the accuracy of the model. After feeding around 7,000 data, the accuracy of our model drops to 34.5%, as expected. Since the obtained accuracy is lower than the threshold we set, the stance detection model is updated using these stream data. It is important to note that for updating the our stance detection model, 70% of the current stream data is used for training, 20% of it used as validation set and 10% of it used for test set. After updating the model, it is evaluated on test set and 57% accuracy is achieved. The accuracy obtained on test set after first update verifies our approach. In other words, updating stance detection model which solely rely on fine-tuned DistilBERT when its accuracy drops down under a threshold allows our model to adapt to the evolving contents or topics.
Afterward, the XStance dataset is also considered as streaming data. XStance is split into seven different sets, each containing 1000 data points. Each different set is fed to our stance detection model and accuracies are observed on these datasets. Recall that the algorithm is designed such that when the accuracy drops below the threshold, which is 50%, the model is updated. However, on the XStance dataset, there was no need for any update since accuracy didn't drop below the threshold. Accuracy values for each stream data can be seen in the table below.
Streams from XStance | Stream-1 | Stream-2 | Stream-3 | Stream-4 | Stream-5 | Stream-6 | Stream-7 |
---|---|---|---|---|---|---|---|
Accuracies | 0.531 | 0.531 | 0.537 | 0.533 | 0.525 | 0.502 | 0.526 |
Note that parameters like threshold and number of data that accuracy is measured are important because they may change the frequency of the system update. As it can be seen from the table above, for the sixth stream dataset, accuracy is measured as 50.2% which is quite close the the predetermined threshold value. If number of data that accuracy is measured was set to different than 1000, the accurracy might drop below the threshold, and system update would be needed. Similarly, if the threshold value was set to a number larger than 0.5 then system would be updated more frequently.
Please be aware that for the explained approach, it is assumed that new data point can be labeled in a short time so that accuracy can be measured. As a future work, instead of that assumption, clustering algorithms can be utilized to put the new data into a similar group, which is highly likely to have the same label. With this approach we could eliminate the manual annonation process in real life.
Overall, as new data is constantly created in our daily lifes, the existing models become less accurate with the passage of time. We focused on the development of a model which can evolve with new data to satisfy a certain accuracy. Therefore, when our new data reaches a certain amount, and we observe a substantial decrease in the accuracy of our model, it is updated with the lastest new data. With this approach, we are able to have a model that tries to keep the accuracy above 50%.
You might also be interested in reading this: A Cloud-Native Application: GelGit Travel
Hasan Alp Caferoglu © 2024