Inlämning av Examensarbete / Submission of Thesis

Syed Imran Haider; Raja M. Khurram Shahzad MSC-2009:5, pp. 52. COM/School of Computing, 2009.

The work

Författare / Author: Syed Imran Haider, Raja M. Khurram Shahzad,,
Titel / Title: Detection of Spyware by Mining Executable Files
Abstrakt Abstract:

Malicious programs have been a serious threat for the confidentiality, integrity and availability of a system. Different researches have been done to detect them. Two approaches have been derived for it i.e. Signature Based Detection and Heuristic Based Detection. These approaches performed well against known malicious programs but cannot catch the new malicious programs. Different researchers tried to find new ways of detecting malicious programs. The application of data mining and machine learning is one of them and has shown good results compared to other approaches.
A new category of malicious programs has gained momentum and it is called Spyware. Spyware are more dangerous for confidentiality of private data of the user of system. They may collect the data and send it to third party. Traditional techniques have not performed well in detecting Spyware. So there is a need to find new ways for the detection of Spyware. Data mining and machine learning have shown promising results in the detection of other malicious programs but it has not been used for detection of Spyware yet.
We decided to employ data mining for the detection of spyware. We used a data set of 137 files which contains 119 benign files and 18 Spyware files. A theoretical taxonomy of Spyware is created but for the experiment only two classes, Benign and Spyware, are used. An application Binary Feature Extractor have been developed which extract features, called n-grams, of different sizes on the basis of common feature-based and frequency-based approaches. The number of features were reduced and used to create an ARFF file. The ARFF file is used as input to WEKA for applying machine learning algorithms. The algorithms used in the experiment are: J48, Random Forest, JRip, SMO, and Naive Bayes. 10-fold cross-validation and the area under ROC curve is used for the evaluation of classifier performance. We performed experiments on three different n-gram sizes, i.e.: 4, 5, 6. Results have shown that extraction of common feature approach has produced better results than others. We achieved an overall accuracy of 90.5 % with an n-gram size of 6 from the J48 classifier. The maximum area under ROC achieved was 83.3 % with Random Forest.

Ämnesord / Subject: Datavetenskap - Computer Science\Electronic Security
Mathematics\Probability and Statistics
Datavetenskap - Computer Science\Artificial Intelligence
Nyckelord / Keywords: Spyware Detection, Data Mining, Machine Learning, Feature Extraction, WEKA, ARFF

Publication info

Dokument id / Document id:
Program:/ Programme Master of Science in Security Engineering
Registreringsdatum / Date of registration: 06/17/2009
Uppsatstyp / Type of thesis: Masterarbete/Master's Thesis (120 credits)


Handledare / Supervisor: Dr. Niklas Lavesson
Examinator / Examiner: Guohua Bai
Organisation / Organisation: Blekinge Institute of Technology
Institution / School: COM/School of Computing

+46 455 38 50 00
Anmärkningar / Comments:

+46709325761, +46762782550

Files & Access

Bifogad uppsats fil(er) / Files attached: thesis_imsy07_rmsh07.pdf (396 kB, öppnas i nytt fönster)