Phishing Analyzer — Tyler Droxler

Overview

Phishing is one of the most widespread and damaging tactics used by attackers, responsible for a large share of security breaches worldwide. This tool analyzes .eml email files to detect phishing indicators by combining rule-based keyword and URL analysis with external threat intelligence from the VirusTotal API.

The tool gives each email a numeric risk score, a clear verdict, and can optionally check link reputations against VirusTotal's database of known malicious URLs.

Detection Mechanisms

Suspicious Keywords

The body text is scanned for urgency and credential-harvesting language:

"verify your account", "urgent", "update your password"
"click here", "login now", "security alert"
"password expired", "confirm your identity", "unusual activity"
"restricted", "suspended", "invoice attached"

Dangerous Attachment Extensions

Attachments are flagged if they match high-risk file types:

Executables: .exe, .scr, .js, .vbs, .bat, .cmd, .ps1, .hta, .lnk, .jar
Macro-enabled Office: .docm, .xlsm
Archives: .zip, .rar, .7z, .iso

High-Risk Brands Monitored

Display name spoofing is detected when a sender claims to be one of these brands but the domain doesn't match:

PayPal · Microsoft · Google · Apple · Amazon · Netflix · Bank of America · Wells Fargo · Chase · Office · OneDrive · Outlook · Teams · Instagram · Facebook · TikTok · Coinbase · Binance · ADP · Workday

Risk Scoring

Indicator	Points
Display name mismatch (brand spoofing)	+2
Reply-To domain differs from From domain	+2
Phishing keyword in body	+1 each
Suspicious attachment type	+3
Suspicious URL (typosquat, raw IP, obfuscated)	+3
Link text mismatch or HTML obfuscation	+2
VirusTotal malicious flag	+4
VirusTotal suspicious flag	+2

Score	Verdict
≥ 6	Likely Phishing
≥ 3	Needs Review
< 3	Probably Safe

How It Works

The script performs seven detection passes on each .eml file:

Display Name Spoofing — Identifies when display names reference high-risk brands but the sender domain doesn't match
Reply-To Mismatch — Flags instances where the Reply-To domain differs from the From domain
Keyword Analysis — Searches body text for urgent or credential-stealing language
Attachment Inspection — Scans for dangerous executable and compressed file types
HTML Deception Analysis — Detects link text mismatches, JavaScript/data: schemes, and hidden CSS content
URL Analysis — Checks for typosquatting via Levenshtein distance, raw IP addresses, and obfuscation patterns
VirusTotal Integration — Optional API checks against known malicious URLs

Source Code

Python

import os
import re
import json
import base64
import argparse
from email import policy
from email.parser import BytesParser
from email.header import decode_header, make_header
from email.utils import parseaddr
from bs4 import BeautifulSoup
import tldextract
import requests
import os


# --------------------------
# Config
# --------------------------

SUSPICIOUS_KEYWORDS = [
    "verify your account", "urgent", "update your password", "click here",
    "login now", "security alert", "password expired", "confirm your identity",
    "unusual activity", "restricted", "suspended", "invoice attached"
]

SUSPICIOUS_EXTENSIONS = [
    ".exe", ".scr", ".js", ".vbs", ".bat", ".cmd", ".ps1", ".hta", ".lnk", ".jar",
    ".docm", ".xlsm", ".zip", ".rar", ".7z", ".iso"
]

KNOWN_GOOD_DOMAINS = {
    "microsoft.com", "google.com", "apple.com", "amazon.com", "outlook.com",
    "workday.com", "adp.com", "wellsfargo.com", "chase.com", "bankofamerica.com", "paypal.com"
}

HIGH_RISK_BRANDS = [
    "paypal", "microsoft", "google", "apple", "amazon", "netflix", "bankofamerica",
    "wellsfargo", "chase", "office", "onedrive", "outlook", "teams", "instagram",
    "facebook", "tiktok", "coinbase", "binance", "adp", "workday"
]

WEIGHTS = {
    "display_name_mismatch": 2,
    "replyto_mismatch": 2,
    "keyword": 1,
    "suspicious_attachment": 3,
    "suspicious_url": 3,
    "link_mismatch_or_obfuscation": 2,
    "vt_malicious": 4,
    "vt_suspicious": 2,
}

THRESHOLDS = {
    "likely_phish": 6,
    "needs_review": 3
}

# --------------------------
# Helpers
# --------------------------

def safe_decode(value):
    if not value:
        return ""
    try:
        return str(make_header(decode_header(value)))
    except Exception:
        return value

def extract_text_and_html(msg):
    text = ""
    html = ""
    if msg.is_multipart():
        for part in msg.walk():
            ctype = part.get_content_type()
            if ctype == "text/plain":
                try:
                    text += part.get_content()
                except Exception:
                    pass
            elif ctype == "text/html":
                try:
                    html += part.get_content()
                except Exception:
                    pass
    else:
        ctype = msg.get_content_type()
        if ctype == "text/plain":
            text = msg.get_content()
        elif ctype == "text/html":
            html = msg.get_content()
    if not text and html:
        soup = BeautifulSoup(html, "lxml")
        text = soup.get_text("\n")
    return text, html

def extract_urls(text, html):
    urls = set()
    for u in re.findall(r'(https?://[^\s"<>\)]+)', text, re.IGNORECASE):
        urls.add(u.strip(").,;\"'"))
    if html:
        soup = BeautifulSoup(html, "lxml")
        for a in soup.find_all("a", href=True):
            urls.add(a["href"])
        for tag in soup.find_all(src=True):
            urls.add(tag["src"])
    return list(urls)

def domain_from_url(url):
    ext = tldextract.extract(url)
    if ext.suffix:
        return f"{ext.domain}.{ext.suffix}".lower()
    return ext.domain.lower()

def levenshtein(a, b):
    if a == b:
        return 0
    if not a:
        return len(b)
    if not b:
        return len(a)
    prev = list(range(len(b) + 1))
    for i, ca in enumerate(a, 1):
        cur = [i]
        for j, cb in enumerate(b, 1):
            cost = 0 if ca == cb else 1
            cur.append(min(prev[j] + 1, cur[j-1] + 1, prev[j-1] + cost))
        prev = cur
    return prev[-1]

def looks_like_typosquat(domain):
    ext = tldextract.extract(domain)
    base = ext.domain.lower()
    full = f"{ext.domain}.{ext.suffix}".lower() if ext.suffix else base
    if full in KNOWN_GOOD_DOMAINS:
        return False, None
    if re.search(r'[10oOIl]', base):
        for b in HIGH_RISK_BRANDS:
            if levenshtein(base, b) <= 2:
                return True, b
    for b in HIGH_RISK_BRANDS:
        if base != b and levenshtein(base, b) <= 1:
            return True, b
    return False, None

# --------------------------
# Checks
# --------------------------

def check_display_name_spoof(from_header):
    score, indicators = 0, []
    name, addr = parseaddr(from_header or "")
    dom = addr.split("@")[-1].lower() if addr else ""
    if name and any(b in name.lower() for b in HIGH_RISK_BRANDS):
        if dom and dom not in KNOWN_GOOD_DOMAINS:
            score += WEIGHTS["display_name_mismatch"]
            indicators.append(f"Display-name spoofing: {name} <{addr}>")
    if dom and any(dom.endswith(f) for f in ["gmail.com", "outlook.com", "yahoo.com"]):
        if re.search(r"(support|billing|payroll|hr|helpdesk|security|admin)", (name or ""), re.I):
            score += WEIGHTS["display_name_mismatch"]
            indicators.append(f"Corporate-sounding display name on freemail: {name} <{addr}>")
    return score, indicators

def check_replyto_mismatch(from_header, replyto_header):
    score, indicators = 0, []
    _, faddr = parseaddr(from_header or "")
    _, raddr = parseaddr(replyto_header or "")
    if faddr and raddr:
        fdom = faddr.split("@")[-1].lower()
        rdom = raddr.split("@")[-1].lower()
        if fdom != rdom:
            score += WEIGHTS["replyto_mismatch"]
            indicators.append(f"Reply-To domain differs from From domain: {fdom} -> {rdom}")
    return score, indicators

def check_keywords(body_text):
    score, indicators = 0, []
    lower = (body_text or "").lower()
    for kw in SUSPICIOUS_KEYWORDS:
        if kw in lower:
            score += WEIGHTS["keyword"]
            indicators.append(f"Phishing language: '{kw}'")
    return score, indicators

def analyze_html_for_tricks(html):
    score, indicators = 0, []
    urls_from_html = []
    if not html:
        return score, indicators, urls_from_html
    soup = BeautifulSoup(html, "lxml")
    for a in soup.find_all("a", href=True):
        href = a["href"].strip()
        text = a.get_text(strip=True)
        if href.lower().startswith(("javascript:", "data:")):
            score += WEIGHTS["link_mismatch_or_obfuscation"]
            indicators.append(f"Suspicious link scheme: {href[:30]}...")
        if text and "@" in text and href.lower().startswith("http"):
            score += WEIGHTS["link_mismatch_or_obfuscation"]
            indicators.append("Visible email text links to http URL")
        if text:
            disp = re.findall(r'([a-z0-9\-]+\.[a-z\.]{2,})', text.lower())
            real = domain_from_url(href) if href.startswith("http") else None
            if real and disp and all(d not in real for d in disp):
                score += WEIGHTS["link_mismatch_or_obfuscation"]
                indicators.append(f"Displayed URL differs from destination: '{text}' -> {href}")
        if href.startswith(("http://", "https://")):
            urls_from_html.append(href)
    for tag in soup.find_all(True):
        styles = tag.get("style", "")
        if re.search(r"display\s*:\s*none|visibility\s*:\s*hidden", styles, re.I):
            score += WEIGHTS["link_mismatch_or_obfuscation"]
            indicators.append("Hidden content via CSS")
    return score, indicators, urls_from_html

def check_urls(urls):
    score, indicators, suspicious_urls = 0, [], []
    for url in urls:
        dom = domain_from_url(url)
        sus, brand = looks_like_typosquat(dom)
        if sus:
            score += WEIGHTS["suspicious_url"]
            indicators.append(f"Typosquatting suspected '{dom}' (brand: {brand})")
            suspicious_urls.append(url)
        if re.match(r"^https?://\d{1,3}(\.\d{1,3}){3}", url):
            score += WEIGHTS["suspicious_url"]
            indicators.append(f"URL uses raw IP: {url}")
            suspicious_urls.append(url)
        if url.count("%") > 5 or "@" in url:
            score += WEIGHTS["suspicious_url"]
            indicators.append(f"Obfuscated/unusual URL: {url}")
            suspicious_urls.append(url)
    return score, indicators, suspicious_urls

def check_attachments(msg):
    score, indicators = 0, []
    for part in msg.iter_attachments():
        filename = part.get_filename()
        if filename and any(filename.lower().endswith(ext) for ext in SUSPICIOUS_EXTENSIONS):
            score += WEIGHTS["suspicious_attachment"]
            indicators.append(f"Dangerous attachment: {filename}")
    return score, indicators

def vt_url_reputation(urls, api_key):
    score, indicators = 0, []
    if not api_key or not urls:
        return score, indicators
    session = requests.Session()
    session.headers.update({"x-apikey": api_key})
    for url in urls:
        try:
            url_id = base64.urlsafe_b64encode(url.encode()).decode().strip("=")
            r = session.get(f"https://www.virustotal.com/api/v3/urls/{url_id}", timeout=10)
            if r.status_code == 404:
                session.post("https://www.virustotal.com/api/v3/urls", data={"url": url}, timeout=10)
                continue
            if not r.ok:
                continue
            data = r.json().get("data", {}).get("attributes", {})
            stats = data.get("last_analysis_stats", {})
            if stats.get("malicious", 0) >= 1:
                score += WEIGHTS["vt_malicious"]
                indicators.append(f"VirusTotal: malicious for {url} ({stats.get('malicious')})")
            elif stats.get("suspicious", 0) >= 1:
                score += WEIGHTS["vt_suspicious"]
                indicators.append(f"VirusTotal: suspicious for {url} ({stats.get('suspicious')})")
        except Exception:
            continue
    return score, indicators

def analyze_eml(path, use_vt=True, verbose=False):
    with open(path, "rb") as f:
        raw = f.read()
    msg = BytesParser(policy=policy.default).parsebytes(raw)

    subject = safe_decode(msg["Subject"])
    from_h  = safe_decode(msg["From"])
    replyto_h = safe_decode(msg["Reply-To"])
    date_h  = safe_decode(msg["Date"])

    text, html = extract_text_and_html(msg)
    urls = extract_urls(text, html)

    total, ind = 0, []
    s, i = check_display_name_spoof(from_h); total += s; ind += i
    s, i = check_replyto_mismatch(from_h, replyto_h); total += s; ind += i
    s, i = check_keywords(text); total += s; ind += i
    s, i = check_attachments(msg); total += s; ind += i
    s, i, html_urls = analyze_html_for_tricks(html); total += s; ind += i
    urls = sorted(set(urls + html_urls))
    s, i, sus_urls = check_urls(urls); total += s; ind += i

    vt_key = os.environ.get("VT_API_KEY")
    if use_vt and vt_key:
        s, i = vt_url_reputation(urls, vt_key); total += s; ind += i

    verdict = (
        "Likely Phishing" if total >= THRESHOLDS["likely_phish"]
        else "Needs Review" if total >= THRESHOLDS["needs_review"]
        else "Probably Safe"
    )

    result = {
        "subject": subject, "from": from_h, "reply_to": replyto_h, "date": date_h,
        "score": int(total), "verdict": verdict, "indicators": ind,
        "urls": urls, "suspicious_urls": sus_urls
    }
    if verbose:
        result["body_preview"] = (text[:400] + "...") if text else ""
        result["html_present"] = bool(html)
        result["attachment_count"] = sum(1 for _ in msg.iter_attachments())
    return result

# --------------------------
# CLI
# --------------------------

def main():
    ap = argparse.ArgumentParser(description="Phishing analyzer — rule-based with optional VirusTotal URL reputation.")
    ap.add_argument("eml", help="Path to .eml file")
    ap.add_argument("--json", help="Write JSON report to file")
    ap.add_argument("--no-vt", action="store_true", help="Disable VirusTotal URL checks")
    ap.add_argument("--verbose", action="store_true", help="Extra details in output")
    args = ap.parse_args()
    res = analyze_eml(args.eml, use_vt=not args.no_vt, verbose=args.verbose)

    print(f"Subject:  {res['subject']}")
    print(f"From:     {res['from']}")
    if res.get("reply_to"):
        print(f"Reply-To: {res['reply_to']}")
    if res.get("date"):
        print(f"Date:     {res['date']}")
    print(f"Score:    {res['score']}")
    print(f"Verdict:  {res['verdict']}")
    print("Indicators:")
    for i in res["indicators"]:
        print(f"  - {i}")
    if res["urls"]:
        print("URLs found:")
        for u in res["urls"]:
            print(f"  - {u}")

    if args.json:
        with open(args.json, "w", encoding="utf-8") as f:
            json.dump(res, f, indent=2, ensure_ascii=False)

if __name__ == "__main__":
    main()

Test Results

Analysis 1 — Malware Sample

Verdict: Likely Phishing (Score: 7)

Obfuscated URL detected: +3 points
VirusTotal malicious flag: +4 points
Key finding: URL https://servervirto.com.co/ed/trn/update spoofing a legitimate domain

Analysis 2 — Advanced Phishing Sample

Verdict: Likely Phishing (Score: 11)

HTTP instead of HTTPS
Domain spoofing (.af TLD appended to legitimate brand)
URL redirect mismatch
VirusTotal malicious confirmation
Key finding: Multiple deception techniques layered together

Analysis 3 — Safe Email Test

Verdict: Probably Safe (Score: 2)

Plain text URL detected: +1 point
Key finding: The weighting system successfully prevents false positives — a single URL detection alone is not enough to flag a legitimate email

Lessons Learned

Phishing is the most common method attackers use to target individuals and organizations because humans are always the most vulnerable asset. This project demonstrates how automation and scripting can assist security analysts while protecting users from sophisticated attacks.

Proper scoring weights are critical — they reduce false positives while maintaining effective threat detection. A single suspicious indicator shouldn't condemn a legitimate email, but a combination of brand spoofing, suspicious URLs, and known-malicious VirusTotal hits creates a clear signal.