Overview
Phishing is one of the most widespread and damaging tactics used by attackers, responsible for a large share of security breaches worldwide. This tool analyzes .eml email files to detect phishing indicators by combining rule-based keyword and URL analysis with external threat intelligence from the VirusTotal API.
The tool gives each email a numeric risk score, a clear verdict, and can optionally check link reputations against VirusTotal's database of known malicious URLs.
Detection Mechanisms
Suspicious Keywords
The body text is scanned for urgency and credential-harvesting language:
- "verify your account", "urgent", "update your password"
- "click here", "login now", "security alert"
- "password expired", "confirm your identity", "unusual activity"
- "restricted", "suspended", "invoice attached"
Dangerous Attachment Extensions
Attachments are flagged if they match high-risk file types:
- Executables:
.exe,.scr,.js,.vbs,.bat,.cmd,.ps1,.hta,.lnk,.jar - Macro-enabled Office:
.docm,.xlsm - Archives:
.zip,.rar,.7z,.iso
High-Risk Brands Monitored
Display name spoofing is detected when a sender claims to be one of these brands but the domain doesn't match:
PayPal · Microsoft · Google · Apple · Amazon · Netflix · Bank of America · Wells Fargo · Chase · Office · OneDrive · Outlook · Teams · Instagram · Facebook · TikTok · Coinbase · Binance · ADP · Workday
Risk Scoring
| Indicator | Points |
|---|---|
| Display name mismatch (brand spoofing) | +2 |
| Reply-To domain differs from From domain | +2 |
| Phishing keyword in body | +1 each |
| Suspicious attachment type | +3 |
| Suspicious URL (typosquat, raw IP, obfuscated) | +3 |
| Link text mismatch or HTML obfuscation | +2 |
| VirusTotal malicious flag | +4 |
| VirusTotal suspicious flag | +2 |
| Score | Verdict |
|---|---|
| ≥ 6 | Likely Phishing |
| ≥ 3 | Needs Review |
| < 3 | Probably Safe |
How It Works
The script performs seven detection passes on each .eml file:
- Display Name Spoofing — Identifies when display names reference high-risk brands but the sender domain doesn't match
- Reply-To Mismatch — Flags instances where the Reply-To domain differs from the From domain
- Keyword Analysis — Searches body text for urgent or credential-stealing language
- Attachment Inspection — Scans for dangerous executable and compressed file types
- HTML Deception Analysis — Detects link text mismatches, JavaScript/data: schemes, and hidden CSS content
- URL Analysis — Checks for typosquatting via Levenshtein distance, raw IP addresses, and obfuscation patterns
- VirusTotal Integration — Optional API checks against known malicious URLs
Source Code
import os
import re
import json
import base64
import argparse
from email import policy
from email.parser import BytesParser
from email.header import decode_header, make_header
from email.utils import parseaddr
from bs4 import BeautifulSoup
import tldextract
import requests
import os
# --------------------------
# Config
# --------------------------
SUSPICIOUS_KEYWORDS = [
"verify your account", "urgent", "update your password", "click here",
"login now", "security alert", "password expired", "confirm your identity",
"unusual activity", "restricted", "suspended", "invoice attached"
]
SUSPICIOUS_EXTENSIONS = [
".exe", ".scr", ".js", ".vbs", ".bat", ".cmd", ".ps1", ".hta", ".lnk", ".jar",
".docm", ".xlsm", ".zip", ".rar", ".7z", ".iso"
]
KNOWN_GOOD_DOMAINS = {
"microsoft.com", "google.com", "apple.com", "amazon.com", "outlook.com",
"workday.com", "adp.com", "wellsfargo.com", "chase.com", "bankofamerica.com", "paypal.com"
}
HIGH_RISK_BRANDS = [
"paypal", "microsoft", "google", "apple", "amazon", "netflix", "bankofamerica",
"wellsfargo", "chase", "office", "onedrive", "outlook", "teams", "instagram",
"facebook", "tiktok", "coinbase", "binance", "adp", "workday"
]
WEIGHTS = {
"display_name_mismatch": 2,
"replyto_mismatch": 2,
"keyword": 1,
"suspicious_attachment": 3,
"suspicious_url": 3,
"link_mismatch_or_obfuscation": 2,
"vt_malicious": 4,
"vt_suspicious": 2,
}
THRESHOLDS = {
"likely_phish": 6,
"needs_review": 3
}
# --------------------------
# Helpers
# --------------------------
def safe_decode(value):
if not value:
return ""
try:
return str(make_header(decode_header(value)))
except Exception:
return value
def extract_text_and_html(msg):
text = ""
html = ""
if msg.is_multipart():
for part in msg.walk():
ctype = part.get_content_type()
if ctype == "text/plain":
try:
text += part.get_content()
except Exception:
pass
elif ctype == "text/html":
try:
html += part.get_content()
except Exception:
pass
else:
ctype = msg.get_content_type()
if ctype == "text/plain":
text = msg.get_content()
elif ctype == "text/html":
html = msg.get_content()
if not text and html:
soup = BeautifulSoup(html, "lxml")
text = soup.get_text("\n")
return text, html
def extract_urls(text, html):
urls = set()
for u in re.findall(r'(https?://[^\s"<>\)]+)', text, re.IGNORECASE):
urls.add(u.strip(").,;\"'"))
if html:
soup = BeautifulSoup(html, "lxml")
for a in soup.find_all("a", href=True):
urls.add(a["href"])
for tag in soup.find_all(src=True):
urls.add(tag["src"])
return list(urls)
def domain_from_url(url):
ext = tldextract.extract(url)
if ext.suffix:
return f"{ext.domain}.{ext.suffix}".lower()
return ext.domain.lower()
def levenshtein(a, b):
if a == b:
return 0
if not a:
return len(b)
if not b:
return len(a)
prev = list(range(len(b) + 1))
for i, ca in enumerate(a, 1):
cur = [i]
for j, cb in enumerate(b, 1):
cost = 0 if ca == cb else 1
cur.append(min(prev[j] + 1, cur[j-1] + 1, prev[j-1] + cost))
prev = cur
return prev[-1]
def looks_like_typosquat(domain):
ext = tldextract.extract(domain)
base = ext.domain.lower()
full = f"{ext.domain}.{ext.suffix}".lower() if ext.suffix else base
if full in KNOWN_GOOD_DOMAINS:
return False, None
if re.search(r'[10oOIl]', base):
for b in HIGH_RISK_BRANDS:
if levenshtein(base, b) <= 2:
return True, b
for b in HIGH_RISK_BRANDS:
if base != b and levenshtein(base, b) <= 1:
return True, b
return False, None
# --------------------------
# Checks
# --------------------------
def check_display_name_spoof(from_header):
score, indicators = 0, []
name, addr = parseaddr(from_header or "")
dom = addr.split("@")[-1].lower() if addr else ""
if name and any(b in name.lower() for b in HIGH_RISK_BRANDS):
if dom and dom not in KNOWN_GOOD_DOMAINS:
score += WEIGHTS["display_name_mismatch"]
indicators.append(f"Display-name spoofing: {name} <{addr}>")
if dom and any(dom.endswith(f) for f in ["gmail.com", "outlook.com", "yahoo.com"]):
if re.search(r"(support|billing|payroll|hr|helpdesk|security|admin)", (name or ""), re.I):
score += WEIGHTS["display_name_mismatch"]
indicators.append(f"Corporate-sounding display name on freemail: {name} <{addr}>")
return score, indicators
def check_replyto_mismatch(from_header, replyto_header):
score, indicators = 0, []
_, faddr = parseaddr(from_header or "")
_, raddr = parseaddr(replyto_header or "")
if faddr and raddr:
fdom = faddr.split("@")[-1].lower()
rdom = raddr.split("@")[-1].lower()
if fdom != rdom:
score += WEIGHTS["replyto_mismatch"]
indicators.append(f"Reply-To domain differs from From domain: {fdom} -> {rdom}")
return score, indicators
def check_keywords(body_text):
score, indicators = 0, []
lower = (body_text or "").lower()
for kw in SUSPICIOUS_KEYWORDS:
if kw in lower:
score += WEIGHTS["keyword"]
indicators.append(f"Phishing language: '{kw}'")
return score, indicators
def analyze_html_for_tricks(html):
score, indicators = 0, []
urls_from_html = []
if not html:
return score, indicators, urls_from_html
soup = BeautifulSoup(html, "lxml")
for a in soup.find_all("a", href=True):
href = a["href"].strip()
text = a.get_text(strip=True)
if href.lower().startswith(("javascript:", "data:")):
score += WEIGHTS["link_mismatch_or_obfuscation"]
indicators.append(f"Suspicious link scheme: {href[:30]}...")
if text and "@" in text and href.lower().startswith("http"):
score += WEIGHTS["link_mismatch_or_obfuscation"]
indicators.append("Visible email text links to http URL")
if text:
disp = re.findall(r'([a-z0-9\-]+\.[a-z\.]{2,})', text.lower())
real = domain_from_url(href) if href.startswith("http") else None
if real and disp and all(d not in real for d in disp):
score += WEIGHTS["link_mismatch_or_obfuscation"]
indicators.append(f"Displayed URL differs from destination: '{text}' -> {href}")
if href.startswith(("http://", "https://")):
urls_from_html.append(href)
for tag in soup.find_all(True):
styles = tag.get("style", "")
if re.search(r"display\s*:\s*none|visibility\s*:\s*hidden", styles, re.I):
score += WEIGHTS["link_mismatch_or_obfuscation"]
indicators.append("Hidden content via CSS")
return score, indicators, urls_from_html
def check_urls(urls):
score, indicators, suspicious_urls = 0, [], []
for url in urls:
dom = domain_from_url(url)
sus, brand = looks_like_typosquat(dom)
if sus:
score += WEIGHTS["suspicious_url"]
indicators.append(f"Typosquatting suspected '{dom}' (brand: {brand})")
suspicious_urls.append(url)
if re.match(r"^https?://\d{1,3}(\.\d{1,3}){3}", url):
score += WEIGHTS["suspicious_url"]
indicators.append(f"URL uses raw IP: {url}")
suspicious_urls.append(url)
if url.count("%") > 5 or "@" in url:
score += WEIGHTS["suspicious_url"]
indicators.append(f"Obfuscated/unusual URL: {url}")
suspicious_urls.append(url)
return score, indicators, suspicious_urls
def check_attachments(msg):
score, indicators = 0, []
for part in msg.iter_attachments():
filename = part.get_filename()
if filename and any(filename.lower().endswith(ext) for ext in SUSPICIOUS_EXTENSIONS):
score += WEIGHTS["suspicious_attachment"]
indicators.append(f"Dangerous attachment: {filename}")
return score, indicators
def vt_url_reputation(urls, api_key):
score, indicators = 0, []
if not api_key or not urls:
return score, indicators
session = requests.Session()
session.headers.update({"x-apikey": api_key})
for url in urls:
try:
url_id = base64.urlsafe_b64encode(url.encode()).decode().strip("=")
r = session.get(f"https://www.virustotal.com/api/v3/urls/{url_id}", timeout=10)
if r.status_code == 404:
session.post("https://www.virustotal.com/api/v3/urls", data={"url": url}, timeout=10)
continue
if not r.ok:
continue
data = r.json().get("data", {}).get("attributes", {})
stats = data.get("last_analysis_stats", {})
if stats.get("malicious", 0) >= 1:
score += WEIGHTS["vt_malicious"]
indicators.append(f"VirusTotal: malicious for {url} ({stats.get('malicious')})")
elif stats.get("suspicious", 0) >= 1:
score += WEIGHTS["vt_suspicious"]
indicators.append(f"VirusTotal: suspicious for {url} ({stats.get('suspicious')})")
except Exception:
continue
return score, indicators
def analyze_eml(path, use_vt=True, verbose=False):
with open(path, "rb") as f:
raw = f.read()
msg = BytesParser(policy=policy.default).parsebytes(raw)
subject = safe_decode(msg["Subject"])
from_h = safe_decode(msg["From"])
replyto_h = safe_decode(msg["Reply-To"])
date_h = safe_decode(msg["Date"])
text, html = extract_text_and_html(msg)
urls = extract_urls(text, html)
total, ind = 0, []
s, i = check_display_name_spoof(from_h); total += s; ind += i
s, i = check_replyto_mismatch(from_h, replyto_h); total += s; ind += i
s, i = check_keywords(text); total += s; ind += i
s, i = check_attachments(msg); total += s; ind += i
s, i, html_urls = analyze_html_for_tricks(html); total += s; ind += i
urls = sorted(set(urls + html_urls))
s, i, sus_urls = check_urls(urls); total += s; ind += i
vt_key = os.environ.get("VT_API_KEY")
if use_vt and vt_key:
s, i = vt_url_reputation(urls, vt_key); total += s; ind += i
verdict = (
"Likely Phishing" if total >= THRESHOLDS["likely_phish"]
else "Needs Review" if total >= THRESHOLDS["needs_review"]
else "Probably Safe"
)
result = {
"subject": subject, "from": from_h, "reply_to": replyto_h, "date": date_h,
"score": int(total), "verdict": verdict, "indicators": ind,
"urls": urls, "suspicious_urls": sus_urls
}
if verbose:
result["body_preview"] = (text[:400] + "...") if text else ""
result["html_present"] = bool(html)
result["attachment_count"] = sum(1 for _ in msg.iter_attachments())
return result
# --------------------------
# CLI
# --------------------------
def main():
ap = argparse.ArgumentParser(description="Phishing analyzer — rule-based with optional VirusTotal URL reputation.")
ap.add_argument("eml", help="Path to .eml file")
ap.add_argument("--json", help="Write JSON report to file")
ap.add_argument("--no-vt", action="store_true", help="Disable VirusTotal URL checks")
ap.add_argument("--verbose", action="store_true", help="Extra details in output")
args = ap.parse_args()
res = analyze_eml(args.eml, use_vt=not args.no_vt, verbose=args.verbose)
print(f"Subject: {res['subject']}")
print(f"From: {res['from']}")
if res.get("reply_to"):
print(f"Reply-To: {res['reply_to']}")
if res.get("date"):
print(f"Date: {res['date']}")
print(f"Score: {res['score']}")
print(f"Verdict: {res['verdict']}")
print("Indicators:")
for i in res["indicators"]:
print(f" - {i}")
if res["urls"]:
print("URLs found:")
for u in res["urls"]:
print(f" - {u}")
if args.json:
with open(args.json, "w", encoding="utf-8") as f:
json.dump(res, f, indent=2, ensure_ascii=False)
if __name__ == "__main__":
main()
Test Results
Analysis 1 — Malware Sample
Verdict: Likely Phishing (Score: 7)
- Obfuscated URL detected: +3 points
- VirusTotal malicious flag: +4 points
- Key finding: URL
https://servervirto.com.co/ed/trn/updatespoofing a legitimate domain
Analysis 2 — Advanced Phishing Sample
Verdict: Likely Phishing (Score: 11)
- HTTP instead of HTTPS
- Domain spoofing (.af TLD appended to legitimate brand)
- URL redirect mismatch
- VirusTotal malicious confirmation
- Key finding: Multiple deception techniques layered together
Analysis 3 — Safe Email Test
Verdict: Probably Safe (Score: 2)
- Plain text URL detected: +1 point
- Key finding: The weighting system successfully prevents false positives — a single URL detection alone is not enough to flag a legitimate email
Lessons Learned
Phishing is the most common method attackers use to target individuals and organizations because humans are always the most vulnerable asset. This project demonstrates how automation and scripting can assist security analysts while protecting users from sophisticated attacks.
Proper scoring weights are critical — they reduce false positives while maintaining effective threat detection. A single suspicious indicator shouldn't condemn a legitimate email, but a combination of brand spoofing, suspicious URLs, and known-malicious VirusTotal hits creates a clear signal.