Data and Methods
We scraped written complaints addressed to the municipality and its enterprises via SikayetVar - an online platform for raising complaints to public organisations and private companies. The complaints have been classified as IGDAS (enterprise responsible for natural gas distribution and billing), IETT (enterprise responsible for public transportation, mainly buses), ISKI (enterprise responsible for water distribution and billing) or other enterprises with respect to their contents. To represent the texts that do not pertain to the municipal scope of work (with label irrelevant), we scraped tweets sent to the municipality but which do not need to be addressed by the enterprises. Before preprocessing and train-test split, we made sure that number of sentences from each class was roughly equal.
Preprocessing of texts involved removal of punctuation signs and URLs, detection and indication of numbers, monetary symbols and e-mail addresses. We introduced extra steps to tweet preprocessing as the algorithms could associate Twitter usernames, RT (stands for retweet) with the irrelevant class even though these phrases are not related to the content of those tweets. We obtained the roots of all words via stemmer and converted the texts to vector format via bag-of-words approach. We also removed extremely short sentences, so that short tweets with no meaning would not be considered within the dataset.
While splitting the data, we set aside 20% of the data for the test set. Within the remaining data, 80% and 20% were allocated to train and validation sets respectively. We chose 4 algoritms to work with: Logistic Regression, Stochastic Gradient Descent, Random Forest, Multinomial Naive Bayes. Algorithms were trained on the train set and evaluated by their performance on validation sets. f1-score was the specific criteria for performance evaluation.