Abstract: security. As the hottest mania in

Abstract: Apparently,
we are living in the most defining and developing period of human history.the
future of computer is wide and amusing. This is the period where computing
generation reached from large mainframes to PCs to cloud over Internet. Network
security has become an important issue due to the evolution of internet. It
brings people not only together but also provides huge potential threats.
Intrusion detection
technique is considered as the immense method to deploy networks security behind
firewalls. There
is no doubt that Machine Learning (ML)/Artificial Intelligence (AI) has rapidly
gained more vogue in the previous couple of years due to their unique
properties like adaptability, scalability, and potential to rapidly adjust to
new and unknown challenges. Cyber security is a
fast-growing field demanding a great deal of attention because of remarkable
progresses in social networks, cloud and web technologies, online banking,
mobile environment, smart grid, etc. Diverse ML methods have been successfully
deployed to address such wide-ranging problems in computer security. As the hottest mania in the tech industry at
present, ML extremely powerful to make predictions and calculated suggestions
which is generally based on the very large amount of data. Cyber security is
positioned to leverage ML to improve malware detection, triage events, and
recognize breaches and alert organizations to security issues. ML can be used
to identify advanced targeting and threats such as organization profiling,
infrastructure vulnerabilities and potential interdependent vulnerabilities and
exploits. ML can significantly change the cyber security landscape. This paper
describes role of ML to detect and highlight advanced malware for cyber defense
analysts. Various ML algorithms are discussed and compared. This paper gives an
idea about how different
applications of ML in cyber security like phishing, spam, network intrusion
detection etc.

 

Keywords: Machine learning algorithms, Artificial
Intelligence, Cyber Security

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

I.     
INTRODUCTION

 

In the time, where every manual work are being
computerized, the meaning of manual is reshaping. ML calculations can enable
PCs to play games, to perform surgeries and get more quick witted and more
private. Artificial Intelligence (AI) is affecting the lives of regular people
from moment to moment helping to tackle the complexities of –
Transportation (Google’s AI-Powered Predictions, Ridesharing Apps Like Uber and
Lyft, Commercial Flights Use an AI Autopilot), Email (Spam Filters, Smart Email
Categorization), Grading and Assessment (Plagiarism Checkers, Robo-readers),
Banking/Personal Finance (Mobile Check Deposits, Fraud Prevention, Credit
Decisions), Social Networking (Facebook, Pinterest, Instagram, Instagram),
Online Shopping (Search, Recommendations, Fraud Protection), Mobile Use
(Voice-to-Text, Smart Personal Assistants) and stay ahead of cyber security
threats.

One of the main features of these
transformations is how computing techniques and tools have been democratized.
In the past few years, data scientist has assemble evolutionary data-crunching
machines by seamlessly executing advanced techniques 1. The results are
amazing. ML is a data analytics technique through which computers learns to do
what comes naturally to humans and animals i.e. learn from experience. ML
algorithms use computational methods to remember information which is directly
from the data without depending on a predetermined equation. The ML algorithms
adaptively enhance their performances as the inputs available for learning
increases.

There is no doubt that AI/ML has
rapidly gained more popularity in the previous couple of years. As Big Data is
the trending mania in the tech industry at present moment, machine learning is
very strong for calculated suggestions and makes predictions which are based on
the large amount of data. Specific attacks like malware and Ransom ware
continue to pose a major challenge for most commercial, government and academic
organizations. The ability to train and provide cyber security expertise
represents a daunting challenge for the global security community. New techniques
such as ML offers a unique opportunity to close the cyber skills gap by
reducing  the number of cyber security
personnel needed to research, analyze and share malware detection
information1. Some of the very common and famous examples of ML are Netflix’s
algorithms or Amazon’s algorithms that recommend books based on the books you
have bought or searched before. ML is inevitable in cyber security. It provides
potential solutions in all these domains and more, and is set to be a pillar of
our future civilization.

 

II.  EVOLUTION OF MACHINE LEARNING (ML)

 

                      ML was born from pattern
recognition and the theory that computers can learn without being programmed to
perform some specific tasks. But researchers who are concerned in artificial
intelligence wanted to see that if computers could learn from data. The
repetitive side of ML is important because models learn from computations to
generate reliable, repeatable decisions and results 1. While there are many ML algorithms
have been around for a very long time, the capability to automatically apply
complex mathematical calculations to big data is a new development.

 

Types of Machine Learning Algorithms

There are three types of ML algorithms as depicted
in Fig.1.

 

Figure 1: Types of Machine Learning Algorithms

 

1.       Supervised Learning

This algorithm consists of an outcome
variable or dependent variable. Using these set of variables, it generates a
function that maps inputs to their desired outputs. This process will go on
until the model reaches a desired level of accuracy on the training data. Some
examples of supervised learning are Decision Tree, Regression, Logistic
Regression etc.

 

2.     Unsupervised Learning

In this type of algorithm, there is no
target or outcome variable to estimate. This algorithm is used for gathering
population in different groups, which is widely used for segmenting customers
in different groups for specific intervention.

 

3.     Reinforcement Learning

In this algorithm, the machine is coach
to make specific decision. Machine is exposed to the environment where it
trains itself continually using trial and error. Then the machine learns its
past experience and then tries to catch the possible best knowledge to make
accurate decisions.

 

Some Machine Learning
Algorithms

Machine learning algorithms can be broadly
categorized into 3 categories: – supervised learning, unsupervised learning,
and reinforcement learning. Supervised learning is very handy in the cases
where a label is available for a certain training set, but needs to be
predicted for other objects. Unsupervised learning is useful when the challenge
is to find implicit relationships in a given unlabeled datasets. Reinforcement
learning lies between these 2 categories.

 

·      
Linear Regression: In this process, a relationship is established between dependent and independent
variables by fitting them in a line. This line is known as regression line and
represented by a linear equation.

 

·        
Logistic Regression: Logistic regression is generally used to calculate discrete values from a
set of independent variables. This helps to predict the probability of an
event. It is also called logit regression.

 

Here
are some methods that are often used to improve logistic regression models:

·        
Include
interaction terms

·        
Eliminate
features

·        
Regularize
techniques

·        
Use
a non-linear model

 

·     
Decision Tree: This is the most famous ML algorithm that is being used in
present situation. This is one kind of supervised algorithm that is used for
classified algorithms. It works well for classifying both categorical and
continuous dependent variables.

 

·     
Support Vector Machine (SVM): It
is the method for classification in which points are plotted as raw data in
an n-dimensional space. Then the value of each feature is tied to a particular
coordinate, making it very easy to classify the data. Lines that are used to
spilt the data are called classifiers and they can used to plot the graph on
them.

 

·     
Naïve Bayes: This algorithm assumes that the presence of particular feature in
a class is not related to the presence of any other feature. Naive Bayes classifier
consider these variables independent even if they are related to each other
when calculating the probability of a particular outcome. A Naïve Bayes model
is very useful for massive datasets and easy to build 2. It is a simple model
which is known to outperform highly sophisticated classification methods.

 

·     
K-Nearest Neighbors (KNN): This algorithm can be applied to both classification and regression
problems. It is widely used in solving classification problems. It is very
simple algorithm that stores all available cases and classifies any new case by
taking a majority vote of its k neighbors. It is very easy algorithm that can
be understood by comparing it to the real life. Here are some things that must
be considered before selecting KNN:

·        
KNN
is computationally expensive.

·        
Variables
should be normalized otherwise higher range variables can bias the algorithm.

·        
Data
still needs to be pre-processed.

 

·     
K-Means: K-Means is an unsupervised algorithm through which one can solve
clustering problems. In this algorithm, data sets being classified into a
particular number of clusters in such a way that within a cluster, all the data
points are homogeneous and heterogeneous from the data in other clusters 2.
Here are some points through which one can know how K-Means form clusters:

·        
The
K-Means algorithm picks k number of points for each cluster. These points are
known as centroids.

·        
Each
data point forms a cluster with the closest centroids.

·        
Then
it creates new centroids, based on the existing cluster members.

·        
Now
with these new centroids, determine closest distance for each data points. This
process is being repeated until the centroids do not change.

 

·        
Random Forest: A collective of decision tree is called random forest. In this,
each tree is classified and tree votes for that class and this process is done
to classify a new object based on its attributes. The forest chooses the
classification which has the most votes.

Each tree is planted and grown as
follows:

If
the number of cases in the training set is N, then a sample of N cases is taken
at random. This sample will be the training set for the growing tree. Each tree
is growing to the largest extent possible. There is no pruning.

·        
Dimensionally Reduction Algorithms: Vast amounts of data is being stored and analyzed by corporate
in today’s world. Dimensionality reduction algorithms remove the challenges
coming in the identifying significant patterns and variables.

 

·      Gradient
Boosting and Ad boost: These
are the boosting algorithms used when enormous loads of data have to be handled
in order to make predictions with high accuracy. Boosting is an ensemble
learning algorithm that combines the predictive power of several base
estimators to improve robustness.

To summarize, it combines multiple weak
or average predictors to build a strong predictor. These boosting algorithms
always work well in data science competitions like Kaggle, AV Hackathon etc.
These are the most preferred ML algorithms today. ML has played a critical role
across several technologies and practices that we have developed to reduce the
opportunity for and limit the damage of cyber-attacks 3. 

             Cyber security threat is the
vulnerabilities caused by factors outside the end user’s control, such as
security flaws in applications and protocols. The traditional remedies include
using firewalls and antivirus software; distributing patches that fix newly
discovered problems, and amending protocols. While the defense against such
threats is still an ongoing battle, software engineers have been effective in countering
most threats and reducing the risk to an acceptable level in most cases
4. The other
approach, which has received less observations, includes the problems caused by
unaware user actions. For example, an attacker may convince inexperienced users
to install a fake antivirus, which in reality corrupts their computers.

 

III.         
MACHINE LEARNING IN
CYBER SECURITY

 

ML
is a powerful tool that can be hired in many areas of cyber security.

 

·       Phishing
detection:

Phishing is a
deception technique that utilizes a combination of social engineering and
technology to gather sensitive and personal information, such as passwords and
credit card details by masquerading as a trustworthy person or business in an
electronic communication. ML is applied to predict whether a given URL or domain is phishing
website or not. It can accurately identify a wide variety of phishing pages,
including those that only present users with an image to elude content analysis
and those that deliver dynamic content to the page to evade web crawlers. Role
of ML is to detect a phishing site and alert the affected users. It also alerts
the affected brand that the phishing site was trying to mimic, so it can take
the proper precautions to protect itself. The phishing domain
detection with ML techniques are grouped as given below.

·        
URL-Based
Features

·        
Domain-Based
Features

·        
Page-Based
Features

·        
Content-Based
Features

LR, SVM, RF, Decision Tree and K-Means ML algorithms are applied
in phishing detection.

                                                                  

·        
Network Intrusion Detection:

Many intrusion detection
systems are specially based on machine learning techniques due to their
adaptability to new and unknown attacks. There are three main types of cyber
analytic in support of IDSs: misuse-based (sometimes also called signature
based), anomaly-based, and cross-breed.

 

Misuse-based techniques
are designed to detect known attacks by using signatures of those attacks. They
are effective for detecting known type of attacks without generating number of
false alerts. They require frequent manual updates of the database with rules
and signatures. Misuse-based techniques cannot recognize novel (zero-day)
attacks 5.In the case of misuse detection, it uses pre-defined fitting models
or new information go through the model and model is delegated to whether it
has a place in misuse detection or is ordinary. To figure out what has been
stolen, maybe record get to logs or system activity would be investigated by
the examiner, searching for access to delicate documents, or a lot of
information streaming out of the system7. Malware investigation of the disk
may be needed to try and track down known malware samples using signatures
developed by other human analysts. Or on the other hand maybe examination of
the running framework, searching for irregular procedures running or different
strange practices would be led as a component of the occurrence reaction.

 

Anomaly-based techniques
model the normal network and framework conduct, and identify anomalies as
deviations from normal behavior. They have the advantage to detect zero-day
attacks. Another advantage is that the profiles of normal activity are
customized for every system, application, or network, thereby making it
difficult for attackers to know which activities they can carry out undetected.
Additionally, the data on which anomaly-based techniques alert (novel attacks)
can be used to define the signatures for misuse detectors. The principle
drawback of irregularity based procedures is the potential for high false
caution rates (FARs) on the grounds that already concealed (yet genuine)
framework practices might be sorted as oddities.

 

Cross-breed techniques
combine misuse and anomaly detection. 
They are utilized to raise identification rates of known interruptions
and decline the false positive (FP) rate for unknown assaults.

Majority of the methods used are really
mix of both the technologies. Therefore, in the descriptions of ML the anomaly
detection and hybrid methods are described together 6.

An ML approach usually consists of two
phases: training and testing. Often, the following steps are performed:


Identify class attributes (features) and classes from
training data.

• Identify a subset of the attributes
necessary for classification (i.e., dimensionality reduction).

• Learn the model using training data.

• Use the trained model to classify the
unknown data.

 

 

·        
Validation and Authorization with
keystroke dynamics:

Keystroke dynamics represents – a class
of behavioral biometrics that captures the writing style of a client. The
majority of computer systems employ a login ID and password is the primary
strategy to get to security. In stand-alone situations, this level of security
might be satisfactory, yet when PCs are associated with the web, the
vulnerability to a security breach is expanded. Keeping in mind the end goal to
decrease defenselessness to attack, biometric solutions have been employed.
Probabilistic Neural Network (PNN) is extremly suitable candidate for a novel
ML algorithm in the context of keystroke dynamics authentication. At present,
there are two noteworthy types of biometrics: those in view of physiological
properties and those in light of behavioral qualities. Physiological biometrics
incorporates an estimation of some physiological component, for example,
fingerprints, retinal vein examples and iris designs into a computerized
validation composition. Behavioral biometrics then again separate and
coordinate data about human conduct, for example variations in our speech
pattern, gait, signature and the way we type into the authentication schema.

 

·        
Artificial intelligence and robotics:

For
ML frameworks to have a certifiable effect in these essential spaces, these
frameworks must have the capacity to speak with profoundly gifted human
specialists to investigate their judgment and learning, and offer valuable data
or examples from the information. ML strategies and people have aptitudes that
supplement each other — ML procedures are great at calculation on information
at the most minimal level of granularity, though people are better at
developing learning from their experience, and spreading the information.
Neural Network is utilized for character acknowledgment.

 

·           
Encrypted-decrypted techniques:

We
can speak to control utilization as buoys, and we can speak to comes about as
trust in key piece. In any case, I don’t concur this is entirely cryptography,
as this isn’t generally important approach to interface ML to cryptosystem.
This is scarcely utilized as approach to break down information. It is utilized
predominantly for information extraction with ML.

 

·        
Examine security properties of
protocols:

Security
and reliability of network protocol implementations are essential for
correspondence administration or communication services. Most of the approaches
for confirming security and reliability, such as formal validation and
black-box testing, are limited to checking the specification or conformance of
implementation. In any case, a protocol implementation may contain building
point of interest, which are excluded in the system specification but may result
in security flaws. Black-box implementations are deployed with Linear Regression &
Logistic Regression calculations.

 

·  
End user techniques:

Spammers exploit social systems for employing
phishing attacks, disseminating malware, and promoting affiliate websites. It
is no wonder thus that ML is making inroads everywhere. Performing tasks
without the need for programming things explicitly is what makes ML so
powerful. Diverse
strategies are intended to channel spam, including boycott/white rundown,
Bayesian arrangement calculations, catchphrase coordinating, header data
handling, examination of spam-sending variables and examination of got sends.
The way spam recognitions are arranged relies on various systems mapping, get
together, pre-filtration, characterization. In mapping and gathering a standard
model is determined for each question, which is characterized by the structure.
For instance in our proposed framework we have utilized two models: message
model or profile display. In Pre-Filtering the approaching item is checked by
contrasting it and a boycott. spam identification in interpersonal
organizations utilizing Decision Tree, SVM, Random Forest and Naïve Bayesian
methodologies is profoundly viable and a blend of spam counteractive action
channels will give higher precision. Spammers are associated with posting
numerous messages by making counterfeit profiles. Spammers additionally
endeavor to hack diverse client profiles. Thus SVM is prepared in such a way in
this exploration work, that it will order the testing information considering
both the profile model and message show.

 

VI. FINDINGS AND RESULTS

 

ML in security is a
fast-growing trend. Analysts at ABI Research estimate that ML in cyber security
will boost spending in big data, artificial intelligence (AI) and analytics to
$96 billion by 2021. Most of the major companies in security have shifted from a
traditional “signature-based” system which was used to detect malware, to a ML
system that tries to interpret actions and events and learns from a variety of
sources to result safe data or information. ML is used to design security
system, evaluation over the protocol implementation and providing human
interaction to the machine. Random forest based classifiers are the
best classifier with great classification accuracy of 97.47% for the given
dataset of phishing site. SVM techniques performs best 95.5% detection rate. Analysis of The SVM in
spam detection demonstrates precision to be 70% to 82% for a given datasets.
Table 1 consolidates Areas of cyber Security and ML Algorithms.

 

 

 

 

 

 

 

 

 

 

 

 

Table
1: Areas of Cyber Security and ML Algorithm

 

VII. CONCLUSION

 

The goal of this paper was to better
understand how machine learning is applied in cyber security domain. There
exist some robust anti-phishing algorithms and network intrusion detection
systems. Data mining has been popularly recognized as an important means to
mine useful information from large volumes of data which is noisy, fuzzy, and
random. Machine learning algorithms can improve the efficiency of IDS. Machine
learning can be successfully used for developing authentication systems,
evaluating the protocol implementation, assessing the security of human
interaction proofs, smart meter data profiling, etc. There are many
opportunities in information security to apply machine learning to address
various challenges in such complex domain. Spam detection, virus detection, and
surveillance camera robbery detection are only some examples.