Sujeet Bambawale

Using AI for URL threat classification

What it does: Extracts 7 cybersecurity features from any URL (length, dots, hyphens, @, IP, HTTPS, suspicious TLD), trains Random Forest classifier on them, saves model, predicts THREAT/SAFE. Good at: Interpretable features for phishing detection, fast training/prediction, works on tiny datasets. Potential failures: Overfits on small sample data, misses advanced phishing (homoglyphs, JS redirects), no…

Sujeet Bambawale

April 27, 2026

2–3 minutes

ai, artificial-intelligence, classification, data-science, machine-learning, random forest, technology, threat, URL

Phishing URLs are getting better. Isolating common characteristics of suspect links gives us seven themes:

Length
Dots
Hyphens
The “at” symbol
An IP address
The lack of HTTPS
A “suspicious” three letter domain

A Random Forest Classifier is a supervised machine learning algorithm that constructs an “ensemble” of multiple decision trees to make more accurate and stable predictions. While a single decision tree is prone to overfitting, a Random Forest overcomes this by combining the results of many trees, typically through a majority voting system. It relies on two primary techniques to ensure the trees in the forest are diverse and uncorrelated:

Bootstrap Aggregating (Bagging): Each tree is trained on a random bootstrap sample of the original data (sampling with replacement). This means some data points appear multiple times in one tree’s training set while being excluded from another.
Feature Randomization: When splitting a node in a tree, the algorithm only considers a random subset of features (usually the square root of the total number of features). This prevents a single dominant feature from being used in every tree, ensuring the “forest” is less biased.
Majority Voting: To classify a new object, each individual tree provides a “vote.” The class that receives the most votes across all trees becomes the final prediction of the forest.

Applying Random Forest to get a “Threat” or “Safe” prediction for URLs could look like:

			
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pickle
def extract_url_features(url):
    return {
        'url_length': len(url),
        'num_dots': url.count('.'),
        'num_hyphens': url.count('-'),
        'has_at': int('@' in url),
        'has_ip': int(any(p.isdigit() and len(p)>2 for p in url.split('.'))),
        'is_https': int(url.startswith('https')),
        'suspicious_tld': int(any(t in url for t in ['.ru','.cn','.top','.xyz']))
    }
# Sample data (expand with real phishing dataset)
urls = ["http://bank-login.com", "https://google.com", "http://fake-paypal.ru", "https://secure.bank.com", "http://192.168.1.1/login", "http://secure-bank-login.ru"]
labels = [1, 0, 1, 0, 1, 1]
data = pd.DataFrame({'url': urls, 'label': labels})
X = pd.DataFrame([extract_url_features(u) for u in data['url']])
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
with open("cyber_threat_model.pkl", "wb") as f: pickle.dump(model, f)
def predict_threat(url):
    model = pickle.load(open("cyber_threat_model.pkl", "rb"))
    feats = pd.DataFrame([extract_url_features(url)])
    return "THREAT" if model.predict(feats)[0] else "SAFE"
print(predict_threat("http://secure-bank-login.ru"))

		

Potential failures: Overfits on small sample data, misses advanced phishing (homoglyphs, JS redirects), no content analysis, not production-grade.

Production-grade change considerations:

Train on large real datasets (PhishTank / Kaggle phishing URLs, 100k+ samples).
Add host-based features (WHOIS age, domain reputation via API).
Use deep model (URLNet or BERT embeddings) instead of Random Forest.
Deploy as FastAPI + Docker with rate limiting and logging.
Add live checks (VirusTotal/URLScan.io API) + continuous retraining pipeline.

Sujeet Bambawale

Leave a comment Cancel reply

Protected: 📷

The eight that can’t wait today

The eight that can’t wait today

Trending

Protected: 📷

The eight that can’t wait today

The eight that can’t wait today

Cleaning data with Isolation Forest

Using AI for URL threat classification

Share this:

Leave a comment Cancel reply

Trending