Web scraping for AI agents: building the data layer for LLM apps

The Astronomer, by Johannes Vermeer

If you're building an AI agent that browses the web (research assistants, market-intel bots, comparison shoppers, lead-gen agents), your scraping layer is your reliability ceiling. The model is good. The orchestrator is fine. The tool that fetches web pages is what determines whether the agent finishes the task or loops on garbage.

This post is about that tool. What it has to do, what it must not do, and how to architect it so the agent above stays well-behaved.

The agent loop, briefly#

A typical autonomous agent loop looks like:

  1. Plan a step
  2. Call a tool (fetch URL, search, calculator, etc.)
  3. Read the tool result
  4. Decide next step
  5. Loop

The web-fetch tool is called repeatedly, sometimes hundreds of times per task. Every call adds latency to the user-perceived response time, every byte returned competes for context window space, and every malformed result risks the agent making a bad decision.

Three constraints fall out:

  • Latency matters per-call. The agent waits synchronously.
  • Output format matters. Whatever you return goes into the agent's context window, where every token costs money and crowds out reasoning.
  • Error semantics matter. The agent has to know whether to retry, give up, or try a different URL, based on a structured signal rather than a regex over an error string.

What the scraping layer must provide#

1. Typed JSON, not Markdown#

Agents are LLMs. They can read Markdown. They read JSON faster.

If your scraping tool returns Markdown, the agent has to re-parse the Markdown to extract the values it needs. That's another LLM hop, with another opportunity to misread, on every fetch. Multiply by hundreds of tool calls per task and you've doubled your latency and cost for no functional gain.

Return JSON shaped to a schema the agent already knows. The agent's downstream logic can branch on data.price directly without an interpretation pass. We covered the schema design pattern in extracting structured JSON from any HTML.

2. Compact responses#

Agent context windows are finite. Even when they're large (1M tokens for Gemini, 200K for GPT-4 class), every token competes with the agent's own reasoning trace. A 50KB Markdown blob from one tool call can crowd out the planning context for the next ten steps.

Concrete numbers from a typical product-page extraction:

Output Tokens
Raw HTML ~30,000
Cleaned Markdown ~3,000
Schema-typed JSON ~150

The JSON form is 200× smaller than raw HTML and 20× smaller than Markdown. Across a 50-step agent task, that's the difference between fitting the trajectory in context and triggering a summarisation pass that loses information.

3. Honest error codes#

If a fetch fails, the agent needs to know why in machine-readable form:

  • URL_UNREACHABLE: don't retry, the URL is dead, try a different one
  • FETCH_BLOCKED: don't retry the same way, escalate or give up
  • RATE_LIMITED: back off, then retry
  • SCHEMA_INVALID: bug in the agent's tool call, surface to the user
  • Field-level null: the page exists but didn't have this field, decide how to proceed

A scraper that returns {"error": "Something went wrong"} forces the agent to guess. A scraper that returns {"error": {"code": "FETCH_BLOCKED", "retryable": true}} lets the agent's policy code branch correctly.

This is one of the biggest wins from using a scraping API with a typed error taxonomy. See the Runo error model for the full enum we expose.

4. Cancellation#

Agent tasks get cancelled. The user closes the chat, hits stop, or the orchestrator decides to abandon a branch. A scraping tool call that's mid-flight on a slow page should be cancellable cleanly with the unused work refunded.

Without cancellation, you pay for work the user no longer cares about, and the agent process holds resources for completed-but-unconsumed results. A DELETE /v1/jobs/{job_id} endpoint that refunds unused units (Runo does this) closes the loop cleanly.

5. Bypass that doesn't surface#

Cloudflare, Datadome, and PerimeterX are increasingly aggressive in 2026. A scraping tool that returns 403 or "Verifying you are human" pages forces the agent to handle bypass strategy itself, which agents are bad at. Either:

  • The tool handles bypass transparently, escalating internally and returning success or a typed FETCH_BLOCKED, or
  • You build a retry-with-different-strategy loop in your orchestrator and pay the latency.

Option one is strictly better for agent reliability. The general bypass landscape is covered in how to scrape Cloudflare-protected sites. The relevant point for agent builders is that a good scraping API handles bypass server-side, transparent to the caller.

Architecting the tool spec#

The agent's tool definition is part of the design. A good spec for a web-fetch tool looks something like:

{
  "name": "fetch_url_data",
  "description": "Fetches a URL and extracts structured data matching the provided schema. Returns typed JSON or a typed error.",
  "parameters": {
    "type": "object",
    "properties": {
      "url": { "type": "string", "format": "uri" },
      "schema": {
        "type": "array",
        "description": "Fields to extract. Each field needs a name, type, and example value.",
        "items": {
          "type": "object",
          "properties": {
            "field": { "type": "string" },
            "type": { "enum": ["string", "integer", "float", "boolean", "date"] },
            "example": {}
          },
          "required": ["field", "type", "example"]
        }
      }
    },
    "required": ["url", "schema"]
  }
}

This forces the agent to commit to a schema before the call. That's the leverage point: once the agent declares what it wants, the extraction is deterministic and the result is small.

Without a schema, the agent gets back a giant document and has to re-extract fields itself, which is the slow, expensive, error-prone path.

Latency: where to spend the budget#

For agentic workloads, the latency components stack like this:

  • DNS + TCP + TLS handshake: 50–500ms (avoidable with persistent connections)
  • HTTP fetch: 200ms–5s (depends on target)
  • HTML cleaning: 50–200ms
  • LLM extraction: 1–3s for a cheap-tier model at typical schema sizes
  • Type coercion + response: <50ms

Total: ~2–8 seconds for a typical fetch in 2026. Pre-warming the headless browser, persistent HTTP/2 connections, and a structured-data fast path (skip the LLM when JSON-LD/OG/Twitter Cards already cover the schema) shave ~3–5 seconds off the upper end.

Agents tolerate single-call latency better than humans do (the user sees the agent "working"), but you still want the per-call latency under ~3s if you can manage it. Above that, the user experience of a 50-step task degrades sharply.

Cost: the model that doesn't kill you#

Per-call cost for a typical extraction lands around $0.0004–$0.001 with the right stack (cheap-tier model as default, structured-data fast path when applicable, length-aware trimming for long pages). For an agent doing ~50 fetches per task, that's $0.02–$0.05 per task in scraping cost, well below the LLM reasoning cost for the agent's own model.

The cost lever you control is the schema. Specific schemas extract less, cost less, and return smaller responses. Vague schemas ("get all the data on this page") cost more and produce noisier responses. Be specific.

The other lever is caching. If your agents fetch the same URLs repeatedly across users (think: comparison shopping bots hitting the same product pages), a result cache with content-type-aware TTLs eliminates a meaningful fraction of LLM calls. Many scraping APIs ship this server-side; if you're rolling your own, build it.

What to avoid#

A few patterns we'd push back on:

Letting the agent write its own selectors or parsers. Agents are bad at this and the output rots. Schema-driven extraction over an LLM-backed pipeline is strictly better.

One huge fetch_html tool that returns raw HTML. The agent will choke on token count and re-parse with a second LLM call. Double the cost, halve the reliability.

Per-tool-call error strings. "Fetch failed: connection refused" is not a code. The agent needs URL_UNREACHABLE so it can route logic.

Synchronous batching when async would do. If the agent legitimately needs 10 URLs at once (e.g. comparison shopping across 10 retailers), use an async batch endpoint with concurrency. Sequential fetches kill latency. Runo's /batch endpoint runs concurrent fetches against a single schema.

Building bypass into the agent. This is a category error. Bypass is server-side infrastructure. The agent doesn't know about Cloudflare and shouldn't.

Picking a stack#

The architectural choices for the data layer behind an agent:

Choice When
Build it yourself (proxies + headless + LLM + plumbing) You have an in-house team that already does scraping ops
Crawl-and-Markdown API (Firecrawl, etc.) Your agent works on document content, not fields
AI extraction API (Runo, Diffbot's structured endpoints) Your agent needs typed fields per URL, the most common case
Browser-as-a-service (Browserless, Browserbase) Your agent literally needs to drive a browser session (form fills, multi-step workflows)

For most agent builders, the AI extraction category is the right starting point because the output format matches what the agent's downstream logic expects. We compared the three categories in Firecrawl vs Apify vs Runo.

A reference integration shape#

Three lines in your agent:

@tool
async def fetch_url_data(url: str, schema: list[dict]) -> dict:
    """Fetch a URL and extract structured data matching the schema."""
    return await runo_client.extract(url=url, schema=schema)

That's the integration. The agent decides the schema; Runo handles fetch, bypass, clean, extract, type coercion, and error taxonomy. You handle the orchestration.

The free tier is 500 requests/month, enough to wire your agent up and run a few real tasks before paying anything. The docs cover the full API surface.

TL;DR#

  • The scraping tool determines the agent's reliability ceiling. Pick it carefully.
  • Return typed JSON, not Markdown. JSON is ~20× smaller and skips an interpretation pass.
  • Use a schema-driven extraction API. Force the agent to commit to a schema before the call.
  • Errors must be typed (URL_UNREACHABLE, FETCH_BLOCKED, RATE_LIMITED) so the agent's retry logic can branch correctly.
  • Make tool calls cancellable; refund unused work.
  • Handle bypass server-side, never in the agent.
  • Per-call cost ~$0.0004–$0.001 with the right stack, usually well below the agent's own LLM cost.
  • Start with an AI extraction API like Runo; build your own only if you have an existing scraping team.
An Experiment on a Bird in the Air Pump
Engineering7 min read

LLM extraction vs CSS selectors: why selector-based scraping is dead at scale

Selectors break when sites redesign. LLMs extract by semantic meaning. Here's why the tradeoff has flipped, with cost numbers from real workloads.

Departure of William III from Hellevoetsluis
Guide9 min read

The complete guide to web scraping APIs in 2026

What a modern web scraping API actually does, how to evaluate one, and where each category (proxies, browsers, extractors) fits into a real pipeline.

Michelangelo in His Studio Visited by Pope Julius II, by Alexandre Cabanel
Comparison8 min read

Firecrawl vs Apify vs Runo: which scraping API to pick in 2026

An honest, side-by-side look at three popular scraping APIs. What each is built for, where each shines, and where each costs you time and money.