Methodology

How Overwatch detects network intrusions using anomaly-based machine learning.

Problem Statement

Network intrusion detection is critical but challenging. Traditional signature-based systems (like Snort rules) can only catch known attack patterns. They require humans to write rules first, and they're completely blind to zero-day attacks.

Anomaly-based detection inverts this: learn what normal network traffic looks like, then flag anything that deviates. This approach can theoretically catch attacks it has never seen before.

Dataset: CICIDS2017

This model is trained on the CICIDS2017 dataset from the Canadian Institute for Cybersecurity.

Size

2.8M flows

Features

78 per flow

Labels

Benign + 14 attacks

Duration

Mon–Fri capture

Model: Isolation Forest

We use Isolation Forest, an ensemble method designed for anomaly detection. It's ideal because:

One-Class Learning

Trained exclusively on benign traffic. Learns normality, not attack patterns. Can catch zero-day attacks.

Browser-Portable

Just a collection of binary trees. Scoring is tree traversal + averaging. No ML runtime required.

Interpretable

Anomaly score is based on path length. Intuitive: easy to isolate means unusual.

Model Parameters

n_estimators = 100 trees

max_samples = auto (min 256)

contamination = 0.01

scaler = RobustScaler

Training Protocol

01Load all 8 CSV files, concatenate to 2.8M flows
02Clean: fix whitespace, handle infinities, drop NaN
03Select 15 discriminative features
04Train/test split: 70% benign train, 30% benign + ALL attacks test
05Fit RobustScaler on training data only
06Fit Isolation Forest on scaled benign data
07Evaluate on test set
08Export model as JSON for browser inference

Results

What Works Well

Volumetric and rate-based attacks (DDoS, DoS, Port Scan) detected with high recall. These produce genuinely anomalous traffic.

Harder Cases

Low-volume attacks (Infiltration, some Web Attacks) are tougher. These mimic normal traffic and look statistically similar to benign browsing.

Why

Tradeoff: signature IDS catches known attacks but misses zero-days. Anomaly detection catches novel attacks but has incomplete recall on stealthy known ones.

Limitations

01No temporal modeling: Flows treated independently. Sequences of low-volume probes might be missed.
02Fixed threshold: Global detection threshold. Should be tunable in production.
03Concept drift: Normal traffic patterns change. Model needs periodic retraining.
04Feature extraction: CICFlowMeter extracts post-flow. Live deployment needs streaming extractor.

Future Work

LSTM/autoencoder for sequence-aware detection

Online learning for concept drift adaptation

Ensemble: Isolation Forest + Local Outlier Factor

Real-time feature extraction (Apache Flink)

Interactive threshold tuning per security policy

References

Sharafaldin et al. (2018). "Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization." ICISSP 2018.
Liu, Ting, Zhou (2008). "Isolation Forest." ICDM 2008.
UNB CICIDS2017 Dataset →