agentsweb.org / how it works | docs · blog · about · security · faq

> how it works

agentsweb.org is a multi-tier pipeline that turns the hostile web into clean, verified markdown for AI agents. Here's what happens when your agent makes a request.

the request flow

Your Agent → agentsweb.org → Edge Cache → KV Store → Live Fetch → Markdown

Every request flows through up to four layers, each one faster than the next. Most requests never make it past layer two.

layer 1: edge cache (sub-1ms)

Cloudflare's edge network spans 300+ cities worldwide. When a page has been read recently, it's cached at the edge node closest to your agent. The response comes back in under 1 millisecond. No KV lookup. No network hop. Just memory.

Edge cache TTL: 5 minutes. Popular pages stay warm indefinitely because agents keep reading them.

layer 2: kv store (5-50ms)

Cloudflare KV is a globally replicated key-value store. If the edge cache is cold, we check KV. The data is replicated across Cloudflare's entire network — reads are fast from anywhere on earth.

KV entries have dynamic TTLs based on trust level and domain type:

News domains (bloomberg, nyt, bbc): 1 day TTL — content changes fast
Documentation (wikipedia, MDN, docs.rs): 30 day TTL — content is stable
Low trust (1): 1 day — unverified content expires quickly
Medium trust (2-4): 7 days — multiple sources agree
High trust (5+): 30 days — battle-tested content

layer 3: live fetch

Cache miss. The page hasn't been cached yet, or the entry expired. agentsweb fetches the page through a markdown conversion service, runs it through every security gate, and stores it in KV. The next agent gets it from cache.

Live fetch takes 1-15 seconds depending on the target site. But it only happens once per page per TTL window. Every agent after the first one gets the cached version.

the self-healing consensus engine

This is the part that makes agentsweb fundamentally different from a simple cache.

Every cached entry has a trust level (1-100). Trust starts at 1 when a single agent writes the entry. When a different agent — identified by IP address, not self-reported IDs — reads the same page and confirms the content matches, trust increments.

At trust level 2+, the entry is protected from overwrites. An attacker can't just submit a poisoned version — the existing trusted content wins.

If someone does manage to poison a low-trust entry, it self-destructs on the next legitimate read. The reading agent fetches the page locally, sees the mismatch, and submits the correct version. The poison survives exactly one read.

security pipeline

Every piece of content passes through these gates before being stored:

URL validation: SSRF prevention, private IP blocking, credential stripping
Prompt injection scan: 30+ patterns covering instruction overrides, role manipulation, jailbreaks, template tokens
Malicious content detection: Script injection, event handlers, iframes, document.cookie
Captcha/login wall detection: Cloudflare challenges, reCAPTCHA, "sign in to continue"
Unicode steganography: Zero-width character attacks detected and rejected
Entropy analysis: Base64 smuggling and repetition padding attacks blocked

Content that fails any gate is rejected, the submitter gets a strike, and after 5 strikes the IP is auto-banned for an hour.

cache warming (cron)

A background cron job samples 10 random URLs from the index every run. If an entry is past 75% of its TTL, the cron fetches a fresh version and updates the cache. Popular pages never expire — they're always warm and ready.

why this architecture

The web wasn't built for AI agents. HTML is for browsers. JavaScript is for humans. CAPTCHAs exist specifically to stop automated access. agentsweb is the translation layer — it absorbs all that complexity so your agent doesn't have to.

One fetch. Clean markdown. Verified by consensus. Cached at the edge. That's the whole idea.

< back to agentsweb.org