The AI Arms Race Nobody's Talking About: Content Safety in the Age of Opus and Beyond
We started building GlobiSafe out of frustration.
Not the "I wish this API was faster" kind. The kind where you watch a platform you helped build get flooded with AI-generated fraud, and the best tool available is a third-party API that wants you to send all your users' messages to their servers. In plaintext. To "moderate" them.
No thanks.
The problem is accelerating
Two years ago, catching AI-generated spam was easy. The text was robotic, repetitive, full of tells. A basic classifier could flag 95% of it without breaking a sweat.
That world is gone.
Today's models — Claude, GPT, Gemini, open-source alternatives — produce text that's indistinguishable from human writing. Which is incredible for productivity. And terrifying for trust & safety teams.
Here's what we're seeing in the wild right now:
- Phishing at scale — Thousands of unique messages, each one slightly different, each one bypassing pattern-based filters
- Synthetic social engineering — AI-generated support conversations designed to extract credentials, one empathetic message at a time
- Review manipulation — Fake reviews so well-written that even experienced moderators can't tell them apart from real ones
- Adaptive harassment — Bots that rephrase toxic content on the fly to dodge keyword filters
The attackers have access to the same frontier models everyone else does. The difference is they don't care about terms of service.
Why the current tools don't cut it
Most platforms are still relying on approaches that were designed for a pre-LLM world:
Keyword filters — Fast, simple, and completely useless against anyone who's spent five minutes thinking about evasion. Swapping characters, using slang, adding zero-width spaces. These filters give you a false sense of security while catching almost nothing that matters.
Third-party moderation APIs — They work. Sort of. But you're sending your users' private messages to someone else's servers, getting back a generic toxicity score trained on Reddit comments, and hoping it maps to your specific use case. It usually doesn't. A message that's fine on a gaming platform might be a red flag on a banking app.
Human review queues — The gold standard for accuracy, but impossible to scale. At 100K messages per day, you'd need a small army of reviewers working around the clock. And by the time a human sees the content, the scam already landed, the harassment already did its damage.
None of these were built for a world where harmful content is generated by AI, at scale, on demand.
How we think about it differently
We spent months talking to trust & safety teams, security engineers, and compliance officers across fintech, marketplaces, and SaaS platforms. The pattern was consistent: no single layer works alone.
So we built a pipeline. Four layers, each one catching what the previous one missed:
Layer 1 — Rules
Engine
Deterministic checks that run in microseconds. Pattern matching, blocklists, rate limiting, known-bad signatures. No ML, no ambiguity.
This catches 60–70% of harmful content instantly — the obvious stuff that doesn't need a model to detect.
Layer 2 — Custom
Classifiers
Fine-tuned models trained on your data, for your vertical. A fintech platform needs to catch account takeover
language. A marketplace needs to detect listing fraud. A healthcare app needs to flag medical misinformation. Generic models can't do
this well. Vertical-specific ones can.
Layer 3 — LLM
Analysis
For the hard cases: sarcasm, coded language, context-dependent threats, content that's technically "clean" but clearly malicious in
context. This layer only processes the 5–10% of content that passed the first two layers — so the cost stays manageable.
Layer 4 — Human
Review
Not as a bottleneck. As a feedback loop. Edge cases, appeals, and model calibration. Every human decision makes the automated layers
smarter. Over time, the system needs less human intervention — not more.
Why self-hosted matters
This is the part most moderation vendors don't want to talk about.
If you're handling financial data, health records, legal documents, or any form of PII — you probably can't send that data to a third-party API. Not "shouldn't." Can't. Your compliance team will block it. Your SOC 2 auditor will flag it. Your enterprise customers will walk.
We built GlobiSafe to deploy on your infrastructure. A Docker container. Runs on your servers. Talks to your app over localhost. Your users' data never leaves your network.
Your data stays yours. That's not a feature — it's the whole point.
What happens next
We're opening early access. The first cohort gets:
- Full access to the self-hosted moderation engine
- Direct line to our engineering team for custom model tuning
- Early-adopter pricing locked in permanently
We're building this for teams who take safety seriously but refuse to compromise on privacy. If that sounds like you — get on the waitlist.
The AI models will keep getting better. The real question is whether your safety stack keeps up.