Methodology
How Overwatch detects network intrusions using anomaly-based machine learning.
Problem Statement
Network intrusion detection is critical but challenging. Traditional signature-based systems (like Snort rules) can only catch known attack patterns. They require humans to write rules first, and they're completely blind to zero-day attacks.
Anomaly-based detection inverts this: learn what normal network traffic looks like, then flag anything that deviates. This approach can theoretically catch attacks it has never seen before.
Dataset: CICIDS2017
This model is trained on the CICIDS2017 dataset from the Canadian Institute for Cybersecurity.
Size
2.8M flows
Features
78 per flow
Labels
Benign + 14 attacks
Duration
Mon–Fri capture
Model: Isolation Forest
We use Isolation Forest, an ensemble method designed for anomaly detection. It's ideal because:
One-Class Learning
Trained exclusively on benign traffic. Learns normality, not attack patterns. Can catch zero-day attacks.
Browser-Portable
Just a collection of binary trees. Scoring is tree traversal + averaging. No ML runtime required.
Interpretable
Anomaly score is based on path length. Intuitive: easy to isolate means unusual.
Model Parameters
n_estimators = 100 trees
max_samples = auto (min 256)
contamination = 0.01
scaler = RobustScaler
Training Protocol
- 01Load all 8 CSV files, concatenate to 2.8M flows
- 02Clean: fix whitespace, handle infinities, drop NaN
- 03Select 15 discriminative features
- 04Train/test split: 70% benign train, 30% benign + ALL attacks test
- 05Fit RobustScaler on training data only
- 06Fit Isolation Forest on scaled benign data
- 07Evaluate on test set
- 08Export model as JSON for browser inference
Results
What Works Well
Volumetric and rate-based attacks (DDoS, DoS, Port Scan) detected with high recall. These produce genuinely anomalous traffic.
Harder Cases
Low-volume attacks (Infiltration, some Web Attacks) are tougher. These mimic normal traffic and look statistically similar to benign browsing.
Why
Tradeoff: signature IDS catches known attacks but misses zero-days. Anomaly detection catches novel attacks but has incomplete recall on stealthy known ones.
Limitations
- 01No temporal modeling: Flows treated independently. Sequences of low-volume probes might be missed.
- 02Fixed threshold: Global detection threshold. Should be tunable in production.
- 03Concept drift: Normal traffic patterns change. Model needs periodic retraining.
- 04Feature extraction: CICFlowMeter extracts post-flow. Live deployment needs streaming extractor.
Future Work
LSTM/autoencoder for sequence-aware detection
Online learning for concept drift adaptation
Ensemble: Isolation Forest + Local Outlier Factor
Real-time feature extraction (Apache Flink)
Interactive threshold tuning per security policy
References
- Sharafaldin et al. (2018). "Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization." ICISSP 2018.
- Liu, Ting, Zhou (2008). "Isolation Forest." ICDM 2008.
- UNB CICIDS2017 Dataset →