ITC571 Emerging Technologies & Innovations

Week 10 – Final Report and Presentation

nathalidj on May 22, 2020May 27, 2020

In this week I am writing the final report. Apart from that, I am supposed to conduct my presentation this week for the class.

The final report has following structure.

Title page
Table of Contents
Abstract (From Journal Synopsis)
Introduction to research domain
Literature review
Analysis and findings (in tabular, graph, matrix formats)
Conclusion
References
Appendices

This is the end of this research project that i have been carried out for ten weeks

Week 9 – Final Report and Presentation

nathalidj on May 13, 2020May 27, 2020

I am working on my final presentation this week. The presentation is due on next week

Presentation structure is as follows.

Title Slide which should have research title, student name, id, blog address
Introduction, research domain and justification
Article reviews from the annotated bibliography
Results and findings after analyzing the 12 articles reviewed

Week 8 – Annotated Bibliography and Journal Synopsis

nathalidj on May 3, 2020May 27, 2020

I have been working on my Journal synopsis this week and finalised. annotated bibiliography this week.

Journal Synopsis

This research paper presents a ‘Review on Machine Leaning Spam Classification Algorithms’. The main objective of this project is to identify the different types of Machine Learning(ML) algorithms that are used in spam detection, specifically in the mail spam classification field. A comprehensive analysis has been carried out using twelve different research articles. This study has identified different types of ML algorithms such as supervised, unsupervised, semi supervised and hybrid algorithms. Different types of approaches have been used by researchers to generate the proposed outcomes such as of the projects such as achieving higher accuracy, precision, facilitating new features or reducing the resource and time consumption of the projects. Spam detection and classification is a trending and important technique to focus nowadays because everyone uses emails and the frauds that are being carried out using spams are increasing every day. Having a proper understanding on the mechanisms that can be used in this field is the foundation to create a novel solution for this critical problem.

Week 7 – Annotated Bibliography Continue 4

nathalidj on April 27, 2020May 27, 2020

I have found all twelve researchpapers and have completed the summeries for the papers. The details are shown below.

Kaur, H., & Sharma, A. (2016). Improved Email Spam Classification Method Using Integrated Particle Swarm Optimization and Decision Tree. IEEE Xplore, 516-521.

The proposed hybrid technique is integrated with particle swarm optimization(PWO) basing on Decision Tree algorithm. This is a unsupervised machine learning methodology classifies spam mails and non-spam from spam base email datasets. J48 algorithmic program , K-means clustering, support vector machine are also been used in the proposed system evaluate the final output.

The study has obtained spam base dataset from UCI machine learning repository . This has a adequate amount of instances with discrete format, attributes which improves the accuracy of the study. Advanced Particle swarm optimization algorithm and J48 been used with and without supervised learning to calculate the final result of a the particular dataset to identify spam. Apart from that, unsupervised filtering is also has been used as a pre-processing system while classifying e-mail to achieve higher accuracy. According to authors, This proposed project is able to classify in to spam and non-spam with an accuracy of 98.32% .

Singh, M., Pamula, R., & shekhar, S. k. (2018). Email Spam Classification by Support Vector Machine. IEEE Xplore, 878-882.

This paper has purposed a system to evaluate performance of Non- liner SVM based machine learning spam classifiers using two separate Kernel functions. These functions are Gaussian Kernel and Linear Kernel. This has used ‘Spam Assasin – Public Corpus Dataset’ for the training. The data set is analyzed using both kernels to determine which has higher accuracy in testing and training steps. As the final stage, the researchers have used Gmail inbox and its spam collector to classify spam using the purposed system.

In this study, for the processing of data set two separate data sets have been used to increase the accuracy. These are spamTrain.mat and spamTest.mat. The processing is carried out using pre-described processing functions. After the training, ‘Real Time Spam Prediction’ is carried out for spam identification using binary classification. This proposed system has the feature of storing the email sample as ‘.txt file’ format which can be easily deployed to email service providing website. This facilitates classifying incoming spam email registered to specific Email ID. As per the final finding of this project, training time consumption is higher for Gaussian Kernel compared to Linear Kernel. But both kernels have same accuracy level. However, Gaussian Kernel is more advanced and best fitted kernel for the proposed project as per the authors compared to linear kernel because the dataset used is large.

Swetha, S. M., & Sarraf, G. (2019). Spam Email and Malware Elimination employing various Classification Techniques. IEEE Xplore, 140-145.

The research is focused on solution to eliminate spam using Supervised Machine learning- classification method using binary signature analysis. The authors have compared ten different machine learning classification algorithms such as Support Vector Machine, Decision Tree, k-Nearest Neighbors and Naïve Bayes. The algorithms are trained using pre-labeled data. The accuracy of each classifier is computed based on a set of novel data.

In this research, all he algorithms are treated with pre-labelled data sets. The first one is partial processed text dataset containing 19,000 ham and 26,000 spam. The second one is malicious file dataset consists with about 11,000 Mb size of files; 16,000 malicious and 9,000 legitimate files. The analysing novel data set is examined by all ten algorithms separately using 32 pre-identified parameters. In the project, Multiple executions have been carried out with each and every classifier to obtain higher accuracy. This has resulted varying in success level for different classifiers. Higher accuracy have been achieved for both text and file classification by SVM. The accuracy achieved in this study is about 99% as per the authors.

Week 5 – Annotated Bibliography Continue 3

nathalidj on April 21, 2020May 27, 2020

I have been searching for new good papers and continued with my annotated bibiliography this week.

Diale, M., Celik, T., & Wal, C. V. (2019). Unsupervised feature learning for spam email filtering. Elsevier, 89-104.

In this study data transformation stage, prior to data classification has been examined because higher number of features in a classifier have a negative effect on the performance and the process time of a project. As a solution for this, feature representation which has the ability to preserve class separability including lower dimensional space to detect spam has been proposed.

This proposed feature representation provides its own robustness which facilitates classifiers such as Random Forest, Decision tree C4.5 and support Vector Machines to identify spam. This proposed system is able to operate with small feature size with good generalization of features and this method is independent from data source. For the purposed methodology, Distributed memory and distributed bag of words along with Cosine similarity techniques have been used to achieve the objective. Apart from that Autoencoder unsupervised learning algorithm has been used to compresses a large feature spaces into small feature space. This research experiment has recorded significant improvements in the results when using C4.5 and SVM spam classifiers.

Fenga, L., Wanga, Z., & Zuo, W. (2016). Quick online spam classification method based on active and incremental learning. ResearchGate, 17-27.

The proposed study is focused on improving the online-spam classification speed while keeping a higher accuracy in filtering. This experiment has used two corpuses which are Enron-spam and Trec2007. Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Naïve Based has been used as the classifiers.

Firstly, The emails are combined using a term frequency based interest categories which is introduced in the study. Secondly, analyze emails and classifies them using active learning theories(KNN,SVM and NB) based on interest categories. At the final stage, The identified spam are evaluated and used for learning purposes. Three research solutions have been found using this study. First is, NB based classifier is used to reduce the time consumption using the interest sets which are a combination positive and negative frequency data. Secondly, Boundary density based evaluation function has been used to label samples in active learning methodology. The last solution is, the study has created a sampling labeling model based on user interested classifications.

Huang, L., Jia, J., Ingram, E., & Peng, W. (2018). Enhancing the Naive Bayes Spam Filter through Intelligent Text Modification Detection. IEEE Xplore, 849-854.

The research is focused on a text modification using Naive Bayes for spam classification. This advanced Naive Bayes algorithm classifies spam when it contains with diacritics or leetspeak. This novel algorithm enhances the spam filtering accuracy and reduce false positives. Furthermore, the study has discovered a relationship between spam score and the length of the email.

The researches have used an advanced Naive Bayes algorithm in this project. This has able to convert symbols inside words to possible letters. Apart from that, The addition of Python algorithm is consisted with semantic and keyword based machine learning algorithm. This has achieved fuzzy matching to its optimal threshold. Multinomial Naive Bayes optimization has been used for multi-class prediction feature. These new advancements have resulted in an improvement in accuracy of 23.9% over original Spamassassin. Moreover, the authors have been able to identify an exponential regression relationship between spam score and the length of an email. This confirms Bayesian Poisoning negatively influences spam score.

Week 5 – Annotated Bibliography Continue 2

nathalidj on April 17, 2020May 27, 2020

This week I am supposed to continue working on my annotated bibiliography and search good journal articles from CSU database and Primo Search.

The method of finding and filtering of artciles were discussed during the class and I found some good research articles for my research.

Following are the papers I found and I have provided the annotated bibiliography I have written this week.

Ahsan, M. I., Nahian, T., Kafi, A. A., Hossain, M. I., & Shah, F. M. (2016). An Ensemble approach to detect Review Spam using hybrid Machine Learning Technique. IEEE Xplore, 388-394.The proposed approach has introduced an ensemble hybrid machine learning approach to detect fake online review. This has two different learning methods which are supervised and active. This generates hybrid dataset which can be used for both pseudo reviews and real life applications. The proposed methodology is called as Hybrid approach to Detect Review Spam (HDRS)

2. Alamlahi, Y., & Muthana, A. (2018). An Email Modelling Approach for Neural Network Spam Filtering to Improve Score-based Anti-spam Systems. Modern Education and Computer Science Press.

This study proposes a system for presenting email to Artificial Neural Network (ANN) which can classify spam and ham. This model is based on 13 fixed features associated with spam combined with text features which are pre-selected.

3. Alurkar, A. A., Ranade, S. B., Joshi, S. V., Ranade, S. S., Sonewar, P. A., Mahalle, P. N., & Deshpande, A. V. (2017). A Proposed Data Science Approach for Email Spam Classification using Machine Learning Techniques. IEEE Xplore.

The text provides a pretrained machine learning technique to identify a pattern using repetitive keywords which are pre-classified as spam. Moreover, this study has focused on classification of incoming emails based on other features in an email such as header, Cc/Bcc, domain and etc. The main objective of the project is to block senders identified through this mechanism who are likely to spam. This project is able to recognize spam emails more efficiently, rather than specifying it manually.

4.Aski, A. S., & Sourati, N. K. (2016). Proposed efficient algorithm to filter spam using machine learning techniques. Elsevier, 145-149.

The study focuses on three machine learning algorithms which can be used to filter spam from legitimate emails with a lower error percentage and a higher efficiency rate. This has used a multilayer perceptron(PM) model in the study. C4.5 decision tree classifier, Naïve Bayes classifier and C4.5 decision tree classifier techniques has been used the proposed project. These techniques used to train data to classify email into spam or ham.

Week 4 – Annotated Bibiliography

nathalidj on April 9, 2020May 27, 2020

In this week im Supposed to work on my annotated bibiliography.

Annotated bibliography is a list that refers to books, articles and research papers. It has citations that have around 200-250 words paragraphs each. The purpose of the annotated bibliography is to let the reader know what is included in the sources.

I have continued my research and found some more research papers for my project.

Aski, A. S., & Sourati, N. K. (2016). Proposed efficient algorithm to filter spam using machine learning techniques. Elsevier, 145-149.
Diale, M., Celik, T., & Wal, C. V. (2019). Unsupervised feature learning for spam email filtering. Elsevier, 89-104.
Santoso, B. (2019). An Analysis of Spam Email Detection Performance Assessment Using Machine Learning. JOIN (Jurnal Online Informatika) , 53-54.
jarif, N. N., Azmi, N. F., Chuprat, S., Sarkan, H. M., Yahya, Y., & Sam, S. M. (2019). SMS Spam Message Detection using Term Frequency-Inverse Document Frequency and Random Forest Algorithm . ScienceDirect, 510-512.

Week 3 – Project Plan

nathalidj on March 20, 2020May 27, 2020

In week 3, we were supposed to prepare the project proposal and the plan

1. Research Questions

To generate the project report with detailed information about the algorithms used in big data analytics, I’m supposed to carry out a comprehensive research on the following areas.

What are the Methodologies and algorithmic analysis used in ML to organise the data?

What are the Machine Learning algorithms used to analyse and generate information?

The research would be mainly focused on the commonly used algorithms in the field. In these algorithms, I would be detailed analyse on the requirements to use the selected algorithm and what specific outcome can be generated using the selected algorithm. Some of the algorithms that would be researched are listed below.

K-Means Clustering Algorithm Linear Regression Algorithms

Association Rule Mining Algorithm Regression Algorithms

Support vector machine (SVM)

2. Conceptual or theoretical framework

In the research, I will be using ‘qualitative theoretical framework’ as the research framework. This approach would aid me have a wide understanding about the area and be able to gain knowledge from previously carried out research projects. The research projects that are being used in this project, are from expertise that have carried out comprehensive research in the particular area which would help to me understand, learn and research in depth about my selected topic.

3. Methodology

5.1 Research data collection and analysis

In this research project, a detailed analysis would be carried out on 12-15 previously published research papers and journal articles from different authors. These research projects should have adequate information on the area and should be relevant to my research since my research project is mainly a theoretical framework project depending on these selected projects. The selected resources would be analysed and a critique would be documented in the project with an informative research paper on algorithms as the final outcome of the project.

5.2 Sample research papers

The following are two of the research papers that I intend to use in my project.

Bhargavi, P. (2018). Machine Learning Algorithms in Big data Analytics. Research Gate.

Bruce, S., Li, Z., Yang, H.-C., & Mukhopadhyay, S. (2019). Nonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications. IEEE Xplore .

5.3 Research Method

The planned research project is ‘a qualitative framework project’ with consist of analysis of other research papers. In my research I would not be introducing new algorithms or frameworks but to analyse and construct a comprehensive report with existing algorithms. Therefore, the research method using would be ‘report analysis research method’

For the research project, a project plan was created with the tasks that have to be completed in order to successfully complete the project.

The tasks and the estimated dates are shown below.

Week 2- Weekly progress

nathalidj on March 16, 2020May 26, 2020

In my research, during the week 2 I have been writing on the Abstract on my selected topic. Apart from that, I have been producing an report assignment on the basic outline of my research project.

While writing my Abstract and conducting some research on existing published journals on my selected topic I have found new two other suitable journals which can be used for my project.

During the class I was able to finalise the topic after discussing more on the project with my supervisor.

After all, This week was productive where I was able to get more understanding on the depth of the subject and as well as successfully completing my first assignment on the project.

Alamlahi, Y., & Muthana, A. (2018). An Email Modelling Approach for Neural Network Spam Filtering to Improve Score-based Anti-spam Systems. Modern Education and Computer Science Press.

2. Ahsan, M. I., Nahian, T., Kafi, A. A., Hossain, M. I., & Shah, F. M. (2016). An Ensemble approach to detect Review Spam using hybrid Machine Learning Technique. IEEE Xplore, 388-394.

Alurkar, A. A., Ranade, S. B., Joshi, S. V., Ranade, S. S., Sonewar, P. A., Mahalle, P. N., & Deshpande, A. V. (2017). A Proposed Data Science Approach for Email Spam Classification using Machine Learning Techniques. IEEE Xplore.

Week ONE Topic And Blog

nathalidj on March 10, 2020May 27, 2020

Emerging technologies and Innovation

About me:

I’m Nathali Jayasinghe who is currently studying for my Master Degree in Information Technology in Charles Sturt University. For the subject Emerging Technologies and Innovations, I have chosen one of the innovative and fundamental topics in Big Data Analytics Field, which is ‘Statistical Algorithms’. In my study, I would be deeply focusing on Statistical Algorithms which are used in Big Data Analytics. I have selected this topic because all the big data analytics are carried out using different kind of Statistical Algorithms. I always wanted to explore on big data analytics field, Therefore it is essential to have a proper understanding on its fundamentals. This comprehensive study would aid me to achieve my requirement.

Project Title: Machine Leaning Algorithms in Spam Detection

I will be blogging the weekly updates on my study using the below blog site.

https://thinkspace.csu.edu.au/nathali

Project Problem Domain: In the Information Management Field, one of the most critical and important area is ‘how to deal with the massive amount of rapidly increasing of meta data in the current world since everything is becoming digitalized’. In here, it is important to looking to different and each phase which meta data undergo: Techniques using to gather data, the data storing phase, securing the collected data phase and possibly the most critical phase of how the data is interpreted. Applying the most effective data analyzing techniques and algorithms are the keys to extract massive amount of meaningful information from the collected big data. The organizations who are able to do so gain a huge competitive advantage over the other in the industry.

Background/Context/Description:

In the current environment, the organizations are focusing on the methods which can derive the collected data into a meaningful and valuable information. This derived information should be valid, accurate, usable and relevant to the organization. Statistical Algorithms comes into picture in this context. Although organizations have been collecting and analyzing their data from a long time ago, having specific algorithms to analyze big data is relatively new. In here, different statistical models are created with effective algorithms to sort the data, classify into different categories, and process to result in valuable information.

Project Aim/Objectives:

In my comprehensive study the main objective would be to dive into the different algorithms and techniques which are being used in the big data analytics and to understand the characteristics, similarities and differences, the effectiveness and suitable environments to use the specific algorithm.

Scope:

There are different types and methods of Statistical Algorithms which cater to specific environments and requirements. I would be focusing on mainly used and effective algorithms in the field.

Deliverables/Outcomes:

In this project the findings and the deliverables would be presented as a paper after the successful completion.

Resources:

Bhargavi, P. (2018). Machine Learning Algorithms in Big data Analytics. Research Gate.

Bruce, S., Li, Z., Yang, H.-C., & Mukhopadhyay, S. (2019). Nonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications. IEEE Xplore .