I have been searching for new good papers and continued with my annotated bibiliography this week.
- Diale, M., Celik, T., & Wal, C. V. (2019). Unsupervised feature learning for spam email filtering. Elsevier, 89-104.
In this study data transformation stage, prior to data classification has been examined because higher number of features in a classifier have a negative effect on the performance and the process time of a project. As a solution for this, feature representation which has the ability to preserve class separability including lower dimensional space to detect spam has been proposed.
This proposed feature representation provides its own robustness which facilitates classifiers such as Random Forest, Decision tree C4.5 and support Vector Machines to identify spam. This proposed system is able to operate with small feature size with good generalization of features and this method is independent from data source. For the purposed methodology, Distributed memory and distributed bag of words along with Cosine similarity techniques have been used to achieve the objective. Apart from that Autoencoder unsupervised learning algorithm has been used to compresses a large feature spaces into small feature space. This research experiment has recorded significant improvements in the results when using C4.5 and SVM spam classifiers.
- Fenga, L., Wanga, Z., & Zuo, W. (2016). Quick online spam classification method based on active and incremental learning. ResearchGate, 17-27.
The proposed study is focused on improving the online-spam classification speed while keeping a higher accuracy in filtering. This experiment has used two corpuses which are Enron-spam and Trec2007. Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Naïve Based has been used as the classifiers.
Firstly, The emails are combined using a term frequency based interest categories which is introduced in the study. Secondly, analyze emails and classifies them using active learning theories(KNN,SVM and NB) based on interest categories. At the final stage, The identified spam are evaluated and used for learning purposes. Three research solutions have been found using this study. First is, NB based classifier is used to reduce the time consumption using the interest sets which are a combination positive and negative frequency data. Secondly, Boundary density based evaluation function has been used to label samples in active learning methodology. The last solution is, the study has created a sampling labeling model based on user interested classifications.
- Huang, L., Jia, J., Ingram, E., & Peng, W. (2018). Enhancing the Naive Bayes Spam Filter through Intelligent Text Modification Detection. IEEE Xplore, 849-854.
The research is focused on a text modification using Naive Bayes for spam classification. This advanced Naive Bayes algorithm classifies spam when it contains with diacritics or leetspeak. This novel algorithm enhances the spam filtering accuracy and reduce false positives. Furthermore, the study has discovered a relationship between spam score and the length of the email.
The researches have used an advanced Naive Bayes algorithm in this project. This has able to convert symbols inside words to possible letters. Apart from that, The addition of Python algorithm is consisted with semantic and keyword based machine learning algorithm. This has achieved fuzzy matching to its optimal threshold. Multinomial Naive Bayes optimization has been used for multi-class prediction feature. These new advancements have resulted in an improvement in accuracy of 23.9% over original Spamassassin. Moreover, the authors have been able to identify an exponential regression relationship between spam score and the length of an email. This confirms Bayesian Poisoning negatively influences spam score.