Introduction:
With the increasingly deep integration of the Internet and society, the Internet is changing the way in which people live, study and work, but the various security threats that we face are becoming more and more serious. How to identify various network attacks, especially unforeseen attacks, is an unavoidable key technical issue. An Intrusion Detection System (IDS), a significant research achievement in the information security field, can identify an invasion, which could be an ongoing invasion or an intrusion that has already occurred. Analyzing network flows, logs, and system events has been used for intrusion detection. Network flows, logs, and system events, etc. generate big data. Big Data analytics can correlate multiple information sources into a coherent view, identify anomalies and suspicious activities, and finally achieve effective and efficient intrusion detection. In this project, I am trying to construct an IDS model with deep learning techniques. In this project, I trying to implement a deep learning approach for intrusion detection using a Recurrent Neural Network (RNN) and train the IDS model using KDD Cup 1999 dataset.
What is Intrusion Detection System :
Intrusion Detection Systems look for attack signatures, which are specific patterns that usually indicate malicious or suspicious intent. In fact, intrusion detection is usually equivalent to a classification problem, such as a binary or a multi-class classification problem, i.e., identifying whether network traffic behavior is normal or anomalous, or a five-category classification problem, i.e., identifying whether it is normal or any one of the other four attack types: Denial of Service (DOS), User to Root (U2R), Probe (Probing) and Root to Local (R2L). In short, the main motivation of intrusion detection is to improve the accuracy of classifiers in effectively identifying the intrusive behavior.

Why Security Analytics:
Currently information security is crucial to all organization to protect their information and conducts their business. Information security is defined as the protection of information and the system, and hardware that use, store and transmit that information. Information security performs four important for an organization which is protect the organization’s ability to function, enable the safe operation of applications implemented on the organization’s IT systems, protect the data the organization collect and uses, and lastly is safeguards the technology assets in use at the organization. There are also challenges and risk involves in implemented information security in organization. To get rid of these risks Intrusion Detection Systems (IDS) have nowadays become a necessary component of almost every security infrastructure. Intrusion detection plays an important role in ensuring information security, and the key technology is to accurately identify various attacks in the network.
Dataset Description:
KDD Cup 1999 dataset has been used to measure a performance of IDS in many researches. Although the dataset is old, it is good to compare the IDS models. Because there are lots of performance measurement results with the same dataset. That is the main reason why I choose KDD Cup 1999 dataset. There are 4,898,431 network traffics in the dataset and each traffic has 41 features. And 22 attacks are categorized according to their characteristic. Below table shows the category of the attacks. DoS attack depletes resources of the target servers and makes them precluding any services. R2L attack enables an unauthorized remote access. U2R attack tries to acquire the super user authority. Probe attack uses for finding vulnerabilities of the target server.
Because there are too many data records in the original dataset, we use KDD Cup 1999 10 percent data for training and testing. However, the data ratio of the attacks is weighted towards DoS attack. And the others ratio is only 1 percent. It is shown in Figure 2. So the IDS model will be trained unfairly. As a result, DoS attack and normal traffic can easily be detected but the other attacks cannot be caught. In order to solve this problem, we generate a new training dataset evenly. I extract 300 instances from each attack category except U2R attack. Due to the lack of instances, I only extract 30 data from U2R category. Also, we use 1,000 normal instances.
Data Preprocessing
- Numericalization: There are 41 numeric features KDD dataset. Because the input value of RNN-IDS should be a numeric matrix, we must convert some non-numeric features, such as ‘protocol_type’, ‘service’ and ‘flag’ features, into numeric form. For example, the feature ‘protocol_type’ has three types of attributes, ‘tcp’, ‘udp’, and ‘icmp’, and its numeric values are encoded as binary vectors (1,0,0), (0,1,0) and (0,0,1). Similarly, the feature ‘service’ has 70 types of attributes, and the feature ‘flag’ has 11 types of attributes. Continuing in this way, 41-dimensional features map into 122-dimensional features after transformation.
- Normalization: First, according to some features, such as ‘duration[0,58329]’, ‘src_bytes[0,1.3×109 ]’ and ‘dst_bytes[0,1.3×109 ]’, where the difference between the maximum and minimum values has a very large scope, we apply the logarithmic scaling method for scaling to obtain the ranges of ‘duration[0,4.77]’, ‘src_bytes[0,9.11]’ and ‘dst_bytes[0,9.11]’. Second, the value of every feature is mapped to the [0, 1] range linearly according to (1), where Max denotes the maximum value and Min denotes minimum value for each feature.
xi=xi−Min/Max−Min(1) (1)
Algorithm defining process:
After identifying data source, a set of algorithms can be selected for data exploration. This is a trial and error process because it is very difficult to assess the effectiveness of an algorithm before applying it to a particular data set. Algorithms are like pieces and figuring out which one needed and work together is a tedious task. Upon various study, below is my analysis for choosing algorithm.
Machine learning methodologies have been widely used in identifying various types of attacks, and a machine learning approach can help the network administrator take the corresponding measures for preventing intrusions. However, most of the traditional machine learning methodologies belong to shallow learning and often emphasize feature engineering and selection; they cannot effectively solve the massive intrusion data classification problem that arises in the face of a real network application environment. With the dynamic growth of data sets, multiple classification tasks will lead to decreased accuracy. In addition, shallow learning is unsuited to intelligent analysis and the forecasting requirements of high-dimensional learning with massive data. In contrast, deep learners have the potential to extract better representations from the data to create much better models. As a result, intrusion detection technology has experienced rapid development after falling into a relatively slow period.
After Professor Hinton proposed the theory of deep learning in 2006, deep learning theory and technology underwent a meteoric rise in the field of machine learning. The deep learning theory and technology has had a very rapid development in recent years means that a new era of artificial intelligence has opened and offered a completely new way to develop intelligent intrusion detection technology.
Due to growing computational resources, Recurrent Neural Networks (RNNs) have recently generated a significant development in the domain of deep learning (which have been around for decades but their full potential has only recently started to become widely recognized, such as convolutional neural networks (CNNs)). In recent years, RNNs have played an important role in the fields of computer vision, natural language processing (NLP), semantic understanding, speech recognition, language modelling, translation, picture description, and human action recognition.

Because deep learning has the potential to extract better representations from the data to create much better models, and inspired by recurrent neural networks, I have proposed a deep learning approach for an intrusion detection system using recurrent neural networks (RNN-IDS).

Conclusion:
Information security is crucial in organization. All information stored in the organization should be kept secure. The information security in important in the organization because it can protect the confidential information, enables the organization function, also enables the safe operation of application implemented on the organization’s Information Technology system, and information is an asset for an organization. So we are designing the Intruder Detection System using Recurrent Neural Networks (RNN).
References:
[1] Ralf C. Staudemeyer, Applying long short-term memory recurrent neural networks tointrusion detection:Research Article – SACJ No.56, http://sacj.cs.uct.ac.za/ index.php /sacj/article/ viewFile/ 248/150
[2] Jihyun Kim ; Jaehyun Kim ; Huong Le Thi Thu ; Howon Kim;Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection;DOI: 10.1109 /PlatCon. 2016. 7456805;Electronic ISBN: 978-1-4673-8685-2;Published in: 2016 International Conference on Platform Technology and Service (PlatCon)
[3] Chuanlong Yin ; Yuefei Zhu ; Jinlong Fei ; Xinzheng He;A Deep Learning Approach for Intrusion Detection Using Recurrent Neural Networks;DOI: 10.1109/ ACCESS.2017. 27624 18;Electronic ISSN: 2169-3536,Date of Publication: 12 October 2017
[4] Bai, Yuebin, and Hidetsune Kobayashi, Intrusion detection systems: technology and development, AINA 2003. 17th International Conference on. IEEE, 2003
[5] Depren, Ozgur, et al., An intelligent intrusion detection system (IDS) for anomaly and misuse detection in computer networks, Expert systems with Applications 29.4, pp.713-7 22, 2005
[6] Liao, Yihua, and V. Rao Vemuri, Use of k-nearest neighbor classifier for intrusion detection, Computers & Security 21.5, pp.439-448, 2002