Programs that have the potential to violate the privacy and security of a system can be labeled as Potentially Unwanted Programs (PUP). These programs include: Spyware which may collect personal information of user and relay it to third party with/without user knowledge, adware which automatically shows the advertisements to users as per their personal preferences, Trojans which are harmful programs and backdoors which can help intruder to gain remote access of the system. Spyware may compromise confidentiality, integrity, and availability of the system. They may obtain sensitive information such as credit card numbers, logins and their passwords, shopping habits, bank account information and personal preferences with or without user consent.
Unlike viruses which are always unwanted, PUP can sometimes be installed with the user's expressed consent; since it may provide some useful functionality either on its own or by an accompanying software application or may be mentioned in End User License Agreement (EULA) in tricky way. Due to this reason PUP overlaps the boundaries of what is considered legal and illegal software.
Specific anti-PUP tools have been developed as countermeasures but there seems to be no single anti-PUP tool that can prevent all existing and future PUP. Current anti-PUP tools make use of: signature-based methods which uses specific patterns / information, called signature, extracted from PUP and match them in any file for detection or heuristic-based methods which uses the rules made by human experts to detect new PUP, as approaches against PUP. Signature-based systems demonstrates good detection results for known PUP but often lacks the capability of identifying new and unseen instances. Heuristic-based systems try to detect known and unknown PUP on the basis of rules. The heuristic method is considered costly, time consuming and often ineffective against new PUP. So it is needed to apply some other existing technologies which can help in detecting both known and new PUP.
I try to find solution for detection of both known and unknown PUP by using data mining. On the basis of this detection, applications can be classified as PUP or not. During this process it will be investigated that which features, such as n-grams of byte sequences i.e. specific length string of hexadecimal dump of program, instruction sequences, calls to API or DLL, text strings, of binary/executable files can be used for distinguishing between legitimate software and Spyware. I try to find suitable approaches for reducing the number of features (extracted from the binaries) and determine which learning algorithms and their parameters configurations are suitable for optimized PUP detection.