faazo
Illegal cyber activity is going bonkers!
As the prowess of network security becomes more demanding with the innovation of technology, the demand for lawful activity on the internet must continue to increase. Despite a seemingly hopeful statement such as the latter, malicious actors across the world continue their illegal due diligence and spread chaos across the internet through exploitation, social engineering, spreading malware, etc. At a global scale, these hackers may not be united by goal or driven by the same ideologies, but they do share tendencies; tendencies this tutorial aims to explore.
How are we going to explore these tendencies?
We'll be exploring these tendencies by analyzing attacks made on AWS honeypots between 9:53pm on March 3 2013 to 5:55am on 8 September 2013. The dataset we will be exploring in this analysis contains variables that I will be defining here, to ease the in the understanding of what they mean later on:
All this talk of honeypots, what are they?
In simplest terms, honeypots are a trap for network attacks, and records the metadata from those attacks (information listed above) for analytical purposes. AWS describes honeypots as "a security mechanism intended to lure and deflect an attempted attack. AWS’s honeypot is a trap point that one can insert into a website to detect inbound requests from content scrapers and bad bots."
Main objective
The main objective of this tutorial is to take you along the journey of analyzing a real-world dataset through the use of Python and its libraries and packages, and by trying to figure out if given the dataset used below, a malicious act against a honeypot can be predicted.
Final Notes
Hereinafter, "columns" and "attributes" will be interchangeable. Anything related to the "dataset" will refer to the most updated version of the "data" variable, which will be used for most current analysis.
import pandas as pd
import numpy as np
import folium
import matplotlib.pyplot as plt
from folium.plugins import HeatMap
from sklearn import datasets, metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, RepeatedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
The dataset used in this tutorial was found on Kaggle[.]com: an online database of public-use datasets. You can click here to find this dataset's page.
Additionally, this is a link to the author's "Data Driven Security" blog; the original source of the dataset.
How much data is there?
The dataset contains 451,581 data points. Each datapoint (row) represents a cyber-attack that occurred at an AWS Honeypot. Among a few other attributes (columns), each datapoint has a corresponding date, time, source country, source IP address and port, destination port, and the Honeypot attacked.
To get a good idea of what we're working with, we have to read in the downloaded .csv file from the link above. To do this - and to manipulate and analyze the data later on - we'll need to read it into a pandas dataframe.
Here, the data is read into said dataframe and a sample of the data is printed.
# Read in the .csv file.
data = pd.read_csv('marx-geo.csv')
# Display a sample of what the data looks like.
data
How do we clean it?
Seeing as to how the dataset contains a very large number of datapoints, the main analysis of this dataset will disregard all rows that don't hold values ("NaN'') for rows where data is expected: datetime, host, src, proto, spt, dpt, srcstr, country, locale, latitude, and longitude.
Some rows have been input by its authors invalidly; these rows have an extra column associated with them, when they shouldn't. I have custom-labelled this column "format-issues" and any row with a value in this column will be disregarded in all analysis. Ex: some rows have 16 columns instead of the expected 15.
Additionally, the following columns will be disregarded as they bear no value on the analysis: localeabbr and CC (because they are equivalent to the locale and country attributes, respectively), and postalcode and type (because the majority of data points do not hold values for these attributes).
In the input of data, the authors invalidly input some latitude and longitude values to be out of the ranges [-90, 90] and [-180, 180], respectively. Any values out of these ranges will also be disregarded in all analysis.
We will also separate the datetime attribute into date and time, and remove "groucho-" from host strings.
Here, the code gets cleaned up and prepared for analysis, per the guidelines established above.
# Removing columns to be disregarded, as defined above.
data.drop(columns = ['postalcode', 'format-issues', 'cc', 'localeabbr', 'type'], inplace = True)
# Dropping "NaN" entries for important rows.
data.dropna(subset=['datetime', 'host', 'src', 'proto', 'spt', 'dpt', 'srcstr', 'country', 'locale', 'latitude', 'longitude'], inplace= True)
# Dropping all entries with invalid latitude and longitude values, as defined above.
data = data[data.latitude >= -90]
data = data[data.latitude <= 90]
data = data[data.longitude >= -180]
data = data[data.longitude <= 180]
# Show the new, cleaned up data after its indices have been reset.
data.reset_index(inplace = True, drop = True)
# Separating 'datetime'
new = data['datetime'].str.split(" ", n = 1, expand = True)
# Removing "groucho" from host names
for row, col in data.iterrows():
if data.at[row, 'host'] != 'groucho-norcal':
data.at[row, 'host'] = data.at[row, 'host'].replace('groucho-', '',)
# Initialize and set date and time columns into a panda
temp = pd.DataFrame()
temp['date'] = new[0]
temp['time'] = new[1]
data.drop(columns= ['datetime'], inplace = True)
# Merging temp into data
data = temp.join(data)
data
The data's clean, now what?
As you may notice from the printed samples from before and after cleaning of the data, the number of datapoints drops from 451,181 to 312,715. Despite a nearly 25% decrease in data points, the remaining entries are complete and have no missing attributes; this decluttering will allow for more precise and informative analysis.
Now that the code's cleaned and prepped, let's start visualizing it!
First, we'll see which month between and including: March and September had the most activity.
# Separating the dates across the 7 months.
count = [0] * 7
months = ['March', 'April', 'May', 'June', 'July', 'August', 'September']
for row, col in data.iterrows():
# Getting the month in the date value
curr = data.at[row, 'date'][6]
# Converting the string to an int
curr = int(curr)
if curr == 3:
count[0] += 1
elif curr == 4:
count[1] += 1
elif curr == 5:
count[2] += 1
elif curr == 6:
count[3] += 1
elif curr == 7:
count[4] += 1
elif curr == 8:
count[5] += 1
else:
count[6] += 1
# Pie plot!
plt.figure(figsize =(10, 7))
plt.pie(count, labels = months)
plt.legend(months)
plt.title('Months With Most Activity')
plt.show()
Interesting! It looks like we have a uniform spread of activity across the months with data collected. September looks like it's lagging behind but there was only data collected for 8 days in the month. Attackers are consistent with their attacks throughout the year!
Next up we'll show how the number of attacks at each honeypot looked among the rest by using a bar graph.
# Separating the 9 honeypots
count = [0] * 9
pots = ['EU', 'Oregon', 'SA', 'Singapore', 'Sydney', 'Tokyo', 'US East', 'Groucho-norcal', 'Zeppo-norcal']
for row, col in data.iterrows():
# Getting the host name
curr = data.at[row, 'host']
if curr == 'eu':
count[0] += 1
elif curr == 'oregon':
count[1] += 1
elif curr == 'sa':
count[2] += 1
elif curr == 'singapore':
count[3] += 1
elif curr == 'sydney':
count[4] += 1
elif curr == 'tokyo':
count[5] += 1
elif curr == 'us-east':
count[6] += 1
elif curr == 'groucho-norcal':
count[7] += 1
elif curr == 'zeppo-norcal':
count[8] += 1
# Bar graph!
plt.figure(figsize =(15, 10))
bar = plt.bar(pots, count)
# Setting different colors for each honeypot, then plotting.
bar[0].set_color('r')
bar[1].set_color('b')
bar[2].set_color('g')
bar[3].set_color('y')
bar[4].set_color('pink')
bar[5].set_color('purple')
bar[6].set_color('orange')
bar[7].set_color('cyan')
bar[8].set_color('yellowgreen')
plt.title('Honeypots Most Attacked')
plt.show()
This is fascinating! The Oregon and Tokyo honeypots are targeted almost 4 times as much as: EU, SA, Sydney, US East, Groucho-norcal, and Zeppo-norcal. Singapore is the third most targeted honeypot at roughly 3 times as the servers preceding it (in targets).
We've now identified malicious actors' activity across months and favorite honeypots to target, but now we have to see where these attacks are coming from.
Here, we're going to use the folium.plugins package to create a heatmap of the attack sources. To do this, we'll be utilizing the latitude and longitude values available to us in the dataset.
# Creating the map, centered at the (0, 0) coordinates and zoomed out at the global scale
global_map = folium.Map(location = [0, 0], zoom_start= 2)
# Defining the heat map
heat = []
for row, col in data.iterrows():
heat.append((data.at[row, 'latitude'], data.at[row, 'longitude']))
HeatMap(heat).add_to(global_map)
# Showing the map
global_map