know.bi blog

Basic Machine Learning - Anomaly Detection

Apr 11, 2018 10:00:00 AM / by Yannick Mols

What's weird about this?

At certain times you might be faced with unexpected patterns or events appearing in your data. Let's take a look on how we can tackle anomalies, by detecting them.

Imagine you're exploring a data set and suddenly notice some anomalies.

As an example we'll take a look at unexpected locations of player kills in a videogame. Every record has a certain map, x and y attached to it.

Going by the data there are two maps, so first things first we'll need to filter the data on only including one of them:

df = pd.read_csv(filePath, usecols=['map','victim_position_x','victim_position_y'], nrows=20000) # first 20K rows
df = df.loc[(df['map']=='ERANGEL')] # select one of the maps

If we then plot this data we get a good looking cluster of points:

deaths = df[['victim_position_x','victim_position_y']].as_matrix(columns=None)
plt.scatter(deaths[:,0],deaths[:,1])
plt.show()
plt.clf()


Immediately we can spot quite a few outliers in our data, but how do we predict which are anomalies and which aren't? To do this we can use gaussian (also named normal) distribution to help with anomaly detection.

 anomaly detection.png

Gaussian distribution is a function which predicts the exact distribution of events and with it, can be used to determine extreme values which fall outside of the general pool of observations using the mean and variance.

normal-distr.png

mu = deaths.mean(axis=0)
sigma = deaths.var(axis=0)

[5.71298987 5.35145847] [7.36143001 6.82879176]

We determine a probability treshold which can indicate an outlier and the probability that a death falls into the normal distribution (see the notebook for the select_treshold function).

epsilon, f1 = select_threshold(pval, yval) 
outliers = np.where(p < epsilon) # get outliers

We can then apply these probabilities to indicate which deaths are normal and which are anomalies. Plotting this data we can easily show the normal distribution as blue and the outliers as red dots:

# plot data
plt.scatter(deaths[:,0], deaths[:,1])
# plot outliers
plt.scatter(deaths[outliers[0],0], deaths[outliers[0],1], s=50, color='r', marker='o')
plt.show()

Of course this is only one way of doing anomaly detection, in the future we may look at other techniques to tackle this problem.

Get the code here!

Get started with data science!

Topics: data science, anomaly detection, outliers

Yannick Mols

Written by Yannick Mols

Subscribe to Email Updates

Recent Posts