Sujeet Bambawale

Cleaning data with Isolation Forest

Isolation Forest used for data cleaning: Cleans raw logs for data lakes: Fixes timestamps & IPs Drops duplicates & bad rows Uses Isolation Forest (ML that quickly spots weird anomalies by random splits) Outputs clean Parquet file

Sujeet Bambawale

April 29, 2026

2–3 minutes

ai, artificial-intelligence, clean-data, data-science, isolation-forest, machine-learning, siem, technology

Isolation Forest (iForest) is an unsupervised machine learning algorithm specifically designed for anomaly detection. Unlike most anomaly detection methods that try to define a “normal” profile and look for deviations, Isolation Forest takes the opposite approach: it explicitly identifies anomalies by isolating them from the rest of the data.

The algorithm is based on the premise that anomalies are few and different.

Because they are rare and have attribute values that vary significantly from normal points, they are easier to separate (isolate) using random partitions.

Random Partitioning: The algorithm picks a random feature and then picks a random split value between the maximum and minimum values of that feature.
Tree Structure: This process is repeated recursively to create a binary tree called an Isolation Tree (iTree).
Path Length: An anomaly is typically separated from other data points in much fewer splits than a normal point. Therefore, it will have a shorter path length (distance from the root to the leaf node) in the tree.
The “Forest”: Since one tree might be biased, the algorithm builds an ensemble of these trees. The final anomaly score is based on the average path length across the entire forest.

Why is this relevant for cybersecurity telemetry / SIEM data?

Unsupervised: It does not require labeled data for training, making it ideal for real-world scenarios where fraud or errors are rare and unlabeled
Efficiency: It has a low linear time complexity and small memory requirements, which allows it to handle very large and high-dimensional datasets effectively
Sub-sampling: Isolation Forest performs remarkably well with small sub-samples of data, which helps avoid “masking” (where anomalies are hidden by normal clusters).

So, let’s crank out some code..

			
import pandas as pd
from sklearn.ensemble import IsolationForest
import re
def clean_siim_logs(logs_df):
    # Normalize
    logs_df['timestamp'] = pd.to_datetime(logs_df['timestamp'], errors='coerce')
    logs_df['ip'] = logs_df['ip'].apply(lambda x: re.sub(r'[^0-9.]', '', str(x)) if pd.notna(x) else None)
    logs_df = logs_df.drop_duplicates().dropna(subset=['timestamp', 'event'])
    
    # AI anomaly detection (Isolation Forest)
    features = pd.get_dummies(logs_df[['event_type', 'severity']])
    iso = IsolationForest(contamination=0.1, random_state=42).fit(features)
    logs_df['anomaly'] = iso.predict(features)
    cleaned = logs_df[logs_df['anomaly'] == 1].drop(columns=['anomaly'])
    
    return cleaned
# Sample messy SIEM logs
data = {
    'timestamp': ['2026-04-28T10:00:00', '2026-04-28T10:00:00', 'invalid', '2026-04-28 11:00'],
    'ip': ['192.168.1.1', '192.168.1.1', '10.0.0.x', '192.168.1.2'],
    'event': ['login success', 'login success', 'brute force', 'phish attempt'],
    'event_type': ['auth', 'auth', 'intrusion', 'malware'],
    'severity': [1, 1, 5, 4]
}
df = pd.DataFrame(data)
cleaned_df = clean_siim_logs(df)
cleaned_df.to_parquet('cleaned_siim_logs.parquet')  # Ready for data lake
print(cleaned_df)

		

TLDR?

This code normalizes timestamps / IPs, removes duplicates / NaNs, extracts event features, runs Isolation Forest to flag anomalies, keeps clean data and saves as Parquet for SIEM data lake.

Changes to make this production-grade?

Scale to millions of rows with Dask/Spark
Add regex-based parsers for 50+ log formats (Syslog, JSON, CEF)
Integrate real-time streaming (Kafka → Spark)
Use supervised models + labeled threat data
Add schema validation, audit logging, and automated retraining
Deploy as Docker + FastAPI endpoint with monitoring.

Sujeet Bambawale

Leave a comment Cancel reply

Protected: 📷

The eight that can’t wait today

The eight that can’t wait today

Trending

Protected: 📷

The eight that can’t wait today

The eight that can’t wait today

Cleaning data with Isolation Forest

Cleaning data with Isolation Forest

Share this:

Leave a comment Cancel reply

Trending