haggl.ai Blog

How to Measure AI Agent Traffic on Your Site Today

May 15, 2026|10 min read

The first question every vendor asks before they deploy haggl is the same: how much agent traffic am I actually getting? And the second question, usually five seconds later, is: why doesn’t Google Analytics show me any?

We’ve covered the “why” before — Agents Already Visit Your Site, You Just Can’t See Them walks through why a JavaScript-based analytics stack is blind to clients that don’t execute JavaScript. This post is the practical companion. If you own a pricing page and you want a defensible number for the agent channel before next week’s board meeting, here’s how to get one.

Everything below works with what you already have: server logs, edge logs, or a CDN dashboard. No haggl install required.

Where the data actually lives

AI agents do produce a record — just not in your analytics tool. They produce it one layer down, in the same place every HTTP request leaves a trail: your origin server access log, your reverse proxy log, or your CDN request log. Cloudflare, Fastly, Vercel, CloudFront, and nginx all expose this. So does Heroku, if you turn on log drains.

Open one of those logs, ignore the JS-based pageview count for a minute, and the picture changes. There’s a population of clients that fetch your HTML, never request a single asset, and leave inside two seconds. Those are the ones we want to count.

Signal 1 — the User-Agent string

Easiest to grep, easiest to spoof. Start here anyway, because the named agents have not bothered to hide.

# Bot strings worth grepping for in 2026:
GPTBot                     # OpenAI training crawl
ChatGPT-User               # ChatGPT-driven runtime fetches
OAI-SearchBot              # ChatGPT search index
ClaudeBot                  # Anthropic training crawl
Claude-User                # Claude.ai runtime fetches
Claude-SearchBot           # Anthropic search index
PerplexityBot              # Perplexity training
Perplexity-User            # Perplexity runtime
Google-Extended            # Google AI training opt-out signal
GoogleOther                # Generic Google fetches incl. Gemini
Applebot-Extended          # Apple Intelligence training
Bytespider                 # ByteDance (Doubao, etc.)
DuckAssistBot              # DuckDuckGo assistant
cohere-ai
anthropic-ai

A useful one-liner against an nginx-style access log:

awk -F'"' '{print $6}' access.log \
  | grep -Ei 'gpt|claude|perplexity|gemini|cohere|bytespider|applebot' \
  | sort | uniq -c | sort -rn | head -20

Two important distinctions show up immediately:

Training crawlers (GPTBot, ClaudeBot, Google-Extended) are sweeping your site once and rarely. They’re building a model snapshot. They don’t indicate a current buyer.
Runtime fetches (ChatGPT-User, Claude-User, Perplexity-User) are firing because a real human just asked an agent a question that brought it to your URL. Those are the ones worth caring about, and the ones a vendor can intercept.

If you only filter on training crawlers, you’ll under-count the buyer channel by an order of magnitude. The runtime fetch is the buying signal.

Signal 2 — the request fingerprint

Many agents (and almost every custom one built on a Python or Node HTTP client) ship with a generic User-Agent or one that mimics a browser. python-requests/2.31, node-fetch/3.x, or a copy-pasted Chrome UA from a tutorial. The User-Agent filter misses these entirely.

What it doesn’t miss: the shape of the request itself. A real browser session, even a scripted one, fetches dozens of assets after the initial HTML. CSS, fonts, JS bundles, favicon, the obligatory Google Tag Manager beacon. A scraping agent fetches the HTML and leaves.

That asymmetry shows up cleanly in your edge log:

# Sessions where the only request was the HTML doc
SELECT
  client_ip,
  user_agent,
  COUNT(*) AS requests,
  MAX(uri) AS path
FROM edge_logs
WHERE timestamp > now() - INTERVAL 1 DAY
GROUP BY client_ip, user_agent
HAVING requests = 1
  AND path LIKE '/pricing%'

One request, hit the pricing page, exit. That’s not a browser. That’s a client that parsed your HTML, extracted what it needed, and walked away. In 2026 the overwhelming majority of those clients are agents.

Other fingerprint signals that hold up well:

No Accept header for images. Real browsers send image/avif,image/webp,image/.... Agents send text/html or */*.
No sec-ch-ua client hints. Chrome and Edge send these on every request. Headless and HTTP-client agents almost never do.
JA3 / JA4 TLS fingerprints. The TLS handshake from python-requests looks nothing like the TLS handshake from Chrome 138. If your CDN gives you the TLS fingerprint (Cloudflare exposes it as ja3_hash), bucket by it. Anything outside the top 20 most common fingerprints is almost certainly not a consumer browser.

Signal 3 — the behavioral pattern

Aggregate signals are more durable than per-request ones. Even an agent that perfectly fakes its TLS fingerprint and User-Agent leaves a behavioral trace, because it’s doing a different job than a human.

Three patterns we see consistently:

Fetches structured pages, ignores marketing pages. Pricing, docs, FAQ, terms, comparison pages. Skips the homepage hero, the blog, the about page. A human shopping for a product wanders. An agent extracting facts goes straight to the page where the facts live.
Sub-second between requests, or a single request total. Real users hit a page, read, click. The gap between requests is 8–30 seconds. Agent sessions either fire requests near-instantly (parallel fetch) or fire one and exit.
Asks for /robots.txt, /llms.txt, or /.well-known/ first. This is the cleanest tell of all. No human types yoursite.com/llms.txt into a browser. An agent doing due diligence does, and the request is in your log.

If you’ve already published an llms.txt — and if you haven’t, you should — just grep for hits on it. Every hit is, by definition, an agent.

Signal 4 — referer and IP

Two more signals worth wiring up, with caveats.

Referer. Some agent runtimes preserve the upstream context as a Referer header. You’ll see chat.openai.com, claude.ai, perplexity.ai, chatgpt.com, and gemini.google.com show up. Treat anything from those domains as agent-mediated traffic, full stop — even if a human eventually clicks the link, the discovery happened inside an agent UI.

awk -F'"' '{print $4}' access.log \
  | grep -Eo 'chat\.openai\.com|claude\.ai|perplexity\.ai|chatgpt\.com|gemini\.google\.com' \
  | sort | uniq -c | sort -rn

IP ranges. OpenAI, Anthropic, Google, and Perplexity publish (or effectively leak via reverse-DNS) the IP ranges their runtime egress traffic uses. OpenAI publishes them at openai.com/gptbot.json and the equivalent for chatgpt-user.json; Anthropic publishes ranges via their support docs; Cloudflare has a maintained list under their AI bot management feature. Reverse-resolving client IPs against those ranges gives you a high-confidence count even when the User-Agent is missing.

The caveat: published ranges drift, and a long tail of custom procurement agents will egress from arbitrary cloud IPs. Use this as a confirming signal, not a primary filter.

Wiring it into a dashboard

For a one-off audit, the awk one-liners above are enough. For an ongoing read on the channel, three options in order of effort:

Cloudflare Bot Analytics (free tier). If you’re behind Cloudflare, the bot management dashboard already classifies AI bots into their own bucket. Turn on “Verified AI bots” reporting and you have a live count without writing anything.
An edge function that tags is_agent. Drop a Cloudflare Worker, Vercel Edge Middleware, or Fastly VCL snippet that inspects User-Agent + TLS fingerprint + asset-fetch behavior, and writes a header or cookie. Pipe it into the same log stream you already query. ~30 lines of code, classifies every request at the edge.
Log shipper into your warehouse. If you already drain access logs into BigQuery, Snowflake, or ClickHouse, write the queries above as a daily view. Build the chart you actually want: distinct agent sessions per day, by referer source, hitting /pricing.

We’ve done variant 2 for several haggl design partners. The most useful single chart was always the same: agent sessions per day, stacked by upstream agent runtime, with the pricing page isolated from everything else. That chart goes up and to the right faster than anyone expects.

What the numbers will tell you (and what they won’t)

Once you have a count, a few interpretive guardrails:

Channel growth rate is the real headline, not absolute volume. The agent channel in any given B2B vertical is doubling every 2–4 months right now. Even if your absolute count looks small in May, the 90-day extrapolation is the number that matters for planning.
Don’t compare agent sessions to human pageviews 1:1. One agent session represents one human who outsourced a decision to an agent. The downstream conversion economics are completely different from a human bouncing through your funnel. Comparing the raw counts is a category error — you’re mixing intent-weighted and intent-blind populations.
You cannot tell intent from the log alone. A ChatGPT-User hit on your pricing page might be a buyer comparison, might be a curiosity check, might be an LLM answering “what does this company charge?” on behalf of someone who’ll never buy. You see that they came. You don’t see what they were asked.
You will under-count. Spoofed User-Agents, residential proxies, and the long tail of custom agents will fall through every filter. Treat your number as a lower bound, not a measurement.

The deeper limit — and where this stops being enough

Counting agent traffic with logs answers a single question: how many agents are showing up? That’s a useful question. It’s also the only question this approach can answer.

The questions you actually want answers to are:

What were they asked to find?
What did they conclude about us?
What did they recommend instead?
Did the user buy from the recommendation?

None of those are answerable from a server log. The agent showed up, parsed your HTML, produced an answer in some upstream chat session you have no visibility into, and the human either acted on it or didn’t. The funnel happened off-platform.

This is the structural argument for a declarative protocol like haggl: instead of trying to observe agents externally, you make the negotiation step happen on your turf, where you can record the actual interaction. The <meta name="haggl-negotiate"> tag is the explicit handshake. When an agent uses it, you see:

which ICP segment it targeted,
what evidence it submitted,
what offer it received,
whether it converted.

That’s a measurable channel. The log-based approach in this post is the right starting point — it tells you the channel exists and how fast it’s growing. But it doesn’t give you the engagement loop. For that, you need agents to identify themselves on the way in. (See Anatomy of a <meta name="haggl-negotiate"> Tag for what that looks like in practice.)

The 30-minute version

If you want a single Monday-morning task list to walk into the week with:

Pull the last 30 days of access logs from your CDN or origin.
Grep for the named user-agents in the list above. Bucket into “training crawler” vs “runtime fetch.” Plot the runtime bucket over time.
Filter to sessions with a single request, no asset fetches, hitting /pricing or /docs. Overlay on the chart from step 2.
Add a third series: hits with Referer from chat.openai.com/claude.ai/perplexity.ai/chatgpt.com.
Look at the 90-day slope. That’s your agent channel. It’s not zero, and it’s not going down.

Once that number stops being negligible — usually within a quarter or two — the question shifts from do agents visit? to what happens when they do? And that’s the question the meta tag answers.

Related Reading: