Week 7 – Annotated Bibliography Continue 4

I have found all twelve researchpapers and have completed the summeries for the papers. The details are shown below.

  • Kaur, H., & Sharma, A. (2016). Improved Email Spam Classification Method Using Integrated Particle Swarm Optimization and Decision Tree. IEEE Xplore, 516-521.

The proposed hybrid technique is integrated with particle swarm optimization(PWO) basing on Decision Tree algorithm. This is a unsupervised machine learning methodology classifies spam mails and non-spam from spam base email datasets. J48 algorithmic program , K-means clustering, support vector machine are also been used in the proposed system evaluate the final output.

The study has obtained spam base dataset from UCI machine learning repository . This has a adequate amount of instances with discrete format, attributes which improves the accuracy of the study. Advanced Particle swarm optimization algorithm and J48 been used with and without supervised learning to calculate the final result of a the particular dataset to identify spam. Apart from that, unsupervised filtering is also has been used as a pre-processing system while classifying e-mail to achieve higher accuracy. According to authors, This proposed project is able to classify in to spam and non-spam with an accuracy of 98.32% .

 

  • Singh, M., Pamula, R., & shekhar, S. k. (2018). Email Spam Classification by Support Vector Machine. IEEE Xplore, 878-882.

This paper has purposed a system to evaluate performance of Non- liner SVM based machine learning spam classifiers using two separate Kernel functions. These functions are Gaussian Kernel and Linear Kernel. This has used ‘Spam Assasin – Public Corpus Dataset’ for the training. The data set is analyzed using both kernels to determine which has higher accuracy in testing and training steps. As the final stage, the researchers have used Gmail inbox and its spam collector to classify spam using the purposed system.

In this study, for the processing of data set two separate data sets have been used to increase the accuracy. These are spamTrain.mat  and spamTest.mat. The processing is carried out using pre-described processing functions. After the training, ‘Real Time Spam Prediction’ is carried out for spam identification using binary classification. This proposed system has the feature of storing the email sample as ‘.txt file’ format which can be easily deployed to email service providing website. This facilitates classifying incoming spam email registered to specific Email ID. As per the final finding of this project, training time consumption is higher for Gaussian Kernel compared to Linear Kernel. But both kernels have same accuracy level. However, Gaussian Kernel is more advanced and best fitted kernel for the proposed project as per the authors compared to linear kernel because the dataset used is large.

 

  • Swetha, S. M., & Sarraf, G. (2019). Spam Email and Malware Elimination employing various Classification Techniques. IEEE Xplore, 140-145.

The research is focused on solution to eliminate spam using Supervised Machine learning-  classification method using binary signature analysis. The authors have compared ten different machine learning classification algorithms such as Support Vector Machine, Decision Tree, k-Nearest Neighbors and Naïve Bayes. The algorithms are trained using pre-labeled data. The accuracy of each classifier is computed based on a set of novel data.

In this research, all he algorithms are treated with pre-labelled data sets. The first one is partial processed text dataset containing 19,000 ham and 26,000 spam. The second one is malicious file dataset consists with about 11,000 Mb size of files; 16,000 malicious and 9,000 legitimate files. The analysing novel data set is examined by all ten  algorithms separately using 32 pre-identified parameters. In the project, Multiple executions have been carried out with each and every classifier to obtain higher accuracy. This has resulted varying in success level for different classifiers. Higher accuracy have been achieved for both text and file classification by SVM. The accuracy achieved in this study is about 99% as per the authors.

Week 5 – Annotated Bibliography Continue 3

I have been searching for new good papers and continued with my annotated bibiliography this week.

  • Diale, M., Celik, T., & Wal, C. V. (2019). Unsupervised feature learning for spam email filtering. Elsevier, 89-104.

In this study data transformation stage, prior to data classification has been examined because higher number of features in a classifier have a negative effect on the performance and the process time of a project. As a solution for this, feature representation which has the ability to preserve  class separability including lower dimensional space to detect spam has been proposed.

This proposed feature representation provides its own robustness which facilitates classifiers such as  Random Forest, Decision tree C4.5  and support Vector Machines to identify spam. This proposed system is able to operate with small feature size with good generalization of features and this method is independent from data source. For the purposed methodology, Distributed memory and distributed bag of words along with Cosine similarity techniques have been used to achieve the objective. Apart from that Autoencoder unsupervised learning algorithm has been used to compresses a large feature spaces into small feature space. This research experiment has recorded significant improvements in the results when using C4.5 and SVM spam classifiers.

  • Fenga, L., Wanga, Z., & Zuo, W. (2016). Quick online spam classification method based on active and incremental learning. ResearchGate, 17-27.

The proposed study is focused on improving the online-spam classification speed while keeping a higher accuracy in filtering. This experiment has used two corpuses which are Enron-spam and Trec2007. Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Naïve Based has been used as the classifiers.

Firstly, The emails are combined using  a term frequency based interest categories which is introduced in the study. Secondly, analyze emails and classifies them using active learning theories(KNN,SVM and NB) based on interest categories. At the final stage, The identified spam are evaluated and used for learning purposes. Three research solutions have been found using this study. First is, NB based classifier is used to reduce the time consumption using the interest sets which are a combination positive and negative frequency data. Secondly,  Boundary density based evaluation function has been used to label samples in active learning methodology. The last solution is, the study has created a sampling labeling model based on user interested classifications.

  • Huang, L., Jia, J., Ingram, E., & Peng, W. (2018). Enhancing the Naive Bayes Spam Filter through Intelligent Text Modification Detection. IEEE Xplore, 849-854.

The research is focused on a text modification using Naive Bayes for spam classification. This advanced Naive Bayes algorithm classifies spam when it contains with diacritics or leetspeak. This novel algorithm enhances the spam filtering accuracy and reduce false positives. Furthermore, the study has discovered a relationship between spam score and the length of the email.

The researches have used an advanced Naive Bayes algorithm in this project. This has able to convert symbols inside words to possible letters. Apart from that, The addition of Python algorithm is consisted with semantic and keyword based machine learning algorithm. This has achieved fuzzy matching to its optimal threshold. Multinomial Naive Bayes optimization has been used for multi-class prediction feature. These new advancements have resulted in an improvement in accuracy of 23.9% over original Spamassassin. Moreover, the authors have been able to identify an exponential regression relationship between spam score and the length of an email. This confirms Bayesian Poisoning negatively influences spam score.

Week 5 – Annotated Bibliography Continue 2

This week I am supposed to continue working on my annotated bibiliography and search good journal articles from CSU database and Primo Search.

The method of finding and filtering of artciles were discussed during the class and I found some good research articles for my research.

Following are the papers I found and I have provided the annotated bibiliography I have written this week.

  1. Ahsan, M. I., Nahian, T., Kafi, A. A., Hossain, M. I., & Shah, F. M. (2016). An Ensemble approach to detect Review Spam using hybrid Machine Learning Technique. IEEE Xplore, 388-394.The proposed approach has introduced an ensemble hybrid machine learning approach to detect fake online review. This has two different learning methods which are supervised and active. This generates hybrid dataset which can be used for both pseudo reviews and real life applications. The proposed methodology is called as Hybrid approach to Detect Review Spam (HDRS)

 

2. Alamlahi, Y., & Muthana, A. (2018). An Email Modelling Approach for Neural Network Spam Filtering to Improve Score-based Anti-spam Systems. Modern Education and Computer Science Press.

This study proposes a system for presenting email to Artificial Neural Network (ANN) which can classify spam and ham. This model is based on 13 fixed features associated with spam combined with text features which are pre-selected.

3. Alurkar, A. A., Ranade, S. B., Joshi, S. V., Ranade, S. S., Sonewar, P. A., Mahalle, P. N., & Deshpande, A. V. (2017). A Proposed Data Science Approach for Email Spam Classification using Machine Learning Techniques. IEEE Xplore.

The text provides a pretrained machine learning technique to identify a pattern using repetitive keywords which are pre-classified as spam. Moreover, this study has focused on classification of incoming emails based on other features in an email such as header, Cc/Bcc, domain and etc. The main objective of the project is to block senders identified through this mechanism who are likely to spam. This project is able to recognize spam emails more efficiently, rather than specifying it manually.

 

4.Aski, A. S., & Sourati, N. K. (2016). Proposed efficient algorithm to filter spam using machine learning techniques. Elsevier, 145-149.

The study focuses on three machine learning algorithms which can be used to filter spam from legitimate emails with a lower error percentage and a higher efficiency rate. This has used a multilayer perceptron(PM) model in the study. C4.5 decision tree classifier, Naïve Bayes classifier and C4.5 decision tree classifier techniques has been used the proposed project. These techniques used to train data to classify email into spam or ham.

 

Week 4 – Annotated Bibiliography

In this week im Supposed to work on my annotated bibiliography.

Annotated bibliography is a list that refers to books, articles and research papers. It has citations that have around 200-250 words paragraphs each. The purpose of the annotated bibliography is to let the reader know what is included in the sources.

I have continued my research and found some more research papers for my project.

  • Aski, A. S., & Sourati, N. K. (2016). Proposed efficient algorithm to filter spam using machine learning techniques. Elsevier, 145-149.
  • Diale, M., Celik, T., & Wal, C. V. (2019). Unsupervised feature learning for spam email filtering. Elsevier, 89-104.
  • Santoso, B. (2019). An Analysis of Spam Email Detection Performance Assessment Using Machine Learning. JOIN (Jurnal Online Informatika) , 53-54.
  • jarif, N. N., Azmi, N. F., Chuprat, S., Sarkan, H. M., Yahya, Y., & Sam, S. M. (2019). SMS Spam Message Detection using Term Frequency-Inverse Document Frequency and Random Forest Algorithm . ScienceDirect, 510-512.