Isolation Forest (iForest) is an unsupervised machine learning algorithm specifically designed for anomaly detection. Unlike most anomaly detection methods that try to define a “normal” profile and look for deviations, Isolation Forest takes the opposite approach: it explicitly identifies anomalies by isolating them from the rest of the data.
The algorithm is based on the premise that anomalies are few and different.
Because they are rare and have attribute values that vary significantly from normal points, they are easier to separate (isolate) using random partitions.
- Random Partitioning: The algorithm picks a random feature and then picks a random split value between the maximum and minimum values of that feature.
- Tree Structure: This process is repeated recursively to create a binary tree called an Isolation Tree (iTree).
- Path Length: An anomaly is typically separated from other data points in much fewer splits than a normal point. Therefore, it will have a shorter path length (distance from the root to the leaf node) in the tree.
- The “Forest”: Since one tree might be biased, the algorithm builds an ensemble of these trees. The final anomaly score is based on the average path length across the entire forest.
Why is this relevant for cybersecurity telemetry / SIEM data?
- Unsupervised: It does not require labeled data for training, making it ideal for real-world scenarios where fraud or errors are rare and unlabeled
- Efficiency: It has a low linear time complexity and small memory requirements, which allows it to handle very large and high-dimensional datasets effectively
- Sub-sampling: Isolation Forest performs remarkably well with small sub-samples of data, which helps avoid “masking” (where anomalies are hidden by normal clusters).
So, let’s crank out some code..
import pandas as pdfrom sklearn.ensemble import IsolationForestimport redef clean_siim_logs(logs_df): # Normalize logs_df['timestamp'] = pd.to_datetime(logs_df['timestamp'], errors='coerce') logs_df['ip'] = logs_df['ip'].apply(lambda x: re.sub(r'[^0-9.]', '', str(x)) if pd.notna(x) else None) logs_df = logs_df.drop_duplicates().dropna(subset=['timestamp', 'event']) # AI anomaly detection (Isolation Forest) features = pd.get_dummies(logs_df[['event_type', 'severity']]) iso = IsolationForest(contamination=0.1, random_state=42).fit(features) logs_df['anomaly'] = iso.predict(features) cleaned = logs_df[logs_df['anomaly'] == 1].drop(columns=['anomaly']) return cleaned# Sample messy SIEM logsdata = { 'timestamp': ['2026-04-28T10:00:00', '2026-04-28T10:00:00', 'invalid', '2026-04-28 11:00'], 'ip': ['192.168.1.1', '192.168.1.1', '10.0.0.x', '192.168.1.2'], 'event': ['login success', 'login success', 'brute force', 'phish attempt'], 'event_type': ['auth', 'auth', 'intrusion', 'malware'], 'severity': [1, 1, 5, 4]}df = pd.DataFrame(data)cleaned_df = clean_siim_logs(df)cleaned_df.to_parquet('cleaned_siim_logs.parquet') # Ready for data lakeprint(cleaned_df)
TLDR?
This code normalizes timestamps / IPs, removes duplicates / NaNs, extracts event features, runs Isolation Forest to flag anomalies, keeps clean data and saves as Parquet for SIEM data lake.
Changes to make this production-grade?
- Scale to millions of rows with Dask/Spark
- Add regex-based parsers for 50+ log formats (Syslog, JSON, CEF)
- Integrate real-time streaming (Kafka → Spark)
- Use supervised models + labeled threat data
- Add schema validation, audit logging, and automated retraining
- Deploy as Docker + FastAPI endpoint with monitoring.

Leave a comment