Syed Imran Haider; Raja M. Khurram Shahzad MSC-2009:5, pp. 52. COM/School of Computing, 2009.
Malicious programs have been a serious threat for the confidentiality, integrity and availability of a system. Different researches have been done to detect them. Two approaches have been derived for it i.e. Signature Based Detection and Heuristic Based Detection. These approaches performed well against known malicious programs but cannot catch the new malicious programs. Different researchers tried to find new ways of detecting malicious programs. The application of data mining and machine learning is one of them and has shown good results compared to other approaches.
A new category of malicious programs has gained momentum and it is called Spyware. Spyware are more dangerous for confidentiality of private data of the user of system. They may collect the data and send it to third party. Traditional techniques have not performed well in detecting Spyware. So there is a need to find new ways for the detection of Spyware. Data mining and machine learning have shown promising results in the detection of other malicious programs but it has not been used for detection of Spyware yet.
We decided to employ data mining for the detection of spyware. We used a data set of 137 files which contains 119 benign files and 18 Spyware files. A theoretical taxonomy of Spyware is created but for the experiment only two classes, Benign and Spyware, are used. An application Binary Feature Extractor have been developed which extract features, called n-grams, of different sizes on the basis of common feature-based and frequency-based approaches. The number of features were reduced and used to create an ARFF file. The ARFF file is used as input to WEKA for applying machine learning algorithms. The algorithms used in the experiment are: J48, Random Forest, JRip, SMO, and Naive Bayes. 10-fold cross-validation and the area under ROC curve is used for the evaluation of classifier performance. We performed experiments on three different n-gram sizes, i.e.: 4, 5, 6. Results have shown that extraction of common feature approach has produced better results than others. We achieved an overall accuracy of 90.5 % with an n-gram size of 6 from the J48 classifier. The maximum area under ROC achieved was 83.3 % with Random Forest.