Methodology

How Overwatch detects network intrusions using anomaly-based machine learning.

Problem Statement

Network intrusion detection is critical but challenging. Traditional signature-based systems (like Snort rules) can only catch known attack patterns. They require humans to write rules first, and they're completely blind to zero-day attacks.

Anomaly-based detection inverts this: learn what normal network traffic looks like, then flag anything that deviates. This approach can theoretically catch attacks it has never seen before.

Dataset: CICIDS2017

This model is trained on the CICIDS2017 dataset from the Canadian Institute for Cybersecurity.

Size
2.8M flows
Features
78 per flow
Labels
Benign + 14 attacks
Duration
Mon–Fri capture
Model: Isolation Forest

We use Isolation Forest, an ensemble method designed for anomaly detection. It's ideal because:

One-Class Learning
Trained exclusively on benign traffic. Learns normality, not attack patterns. Can catch zero-day attacks.
Browser-Portable
Just a collection of binary trees. Scoring is tree traversal + averaging. No ML runtime required.
Interpretable
Anomaly score is based on path length. Intuitive: easy to isolate means unusual.
Model Parameters
n_estimators = 100 trees
max_samples = auto (min 256)
contamination = 0.01
scaler = RobustScaler
Training Protocol
  1. 01Load all 8 CSV files, concatenate to 2.8M flows
  2. 02Clean: fix whitespace, handle infinities, drop NaN
  3. 03Select 15 discriminative features
  4. 04Train/test split: 70% benign train, 30% benign + ALL attacks test
  5. 05Fit RobustScaler on training data only
  6. 06Fit Isolation Forest on scaled benign data
  7. 07Evaluate on test set
  8. 08Export model as JSON for browser inference
Results
What Works Well
Volumetric and rate-based attacks (DDoS, DoS, Port Scan) detected with high recall. These produce genuinely anomalous traffic.
Harder Cases
Low-volume attacks (Infiltration, some Web Attacks) are tougher. These mimic normal traffic and look statistically similar to benign browsing.
Why
Tradeoff: signature IDS catches known attacks but misses zero-days. Anomaly detection catches novel attacks but has incomplete recall on stealthy known ones.
Limitations
  1. 01No temporal modeling: Flows treated independently. Sequences of low-volume probes might be missed.
  2. 02Fixed threshold: Global detection threshold. Should be tunable in production.
  3. 03Concept drift: Normal traffic patterns change. Model needs periodic retraining.
  4. 04Feature extraction: CICFlowMeter extracts post-flow. Live deployment needs streaming extractor.
Future Work
LSTM/autoencoder for sequence-aware detection
Online learning for concept drift adaptation
Ensemble: Isolation Forest + Local Outlier Factor
Real-time feature extraction (Apache Flink)
Interactive threshold tuning per security policy
References
  • Sharafaldin et al. (2018). "Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization." ICISSP 2018.
  • Liu, Ting, Zhou (2008). "Isolation Forest." ICDM 2008.
  • UNB CICIDS2017 Dataset →