How to monitor competitor prices with a scraping API

Price monitoring is one of the most common reasons companies start scraping. The use case is concrete: pull competitor SKU prices on a schedule, alert when something changes, feed the data into pricing models. The problem is that "scrape product pages" makes it sound simpler than it is in 2026, when most retail sites are wrapped in anti-bot protection and the schema you actually want spans more than just price.

This is a working architecture for competitor price monitoring with an AI-extraction-API stack. It's the same shape we'd build internally and the shape that holds up under retailer DOM changes, weekly bypass updates, and the ten edge cases nobody warns you about.

What you actually need to extract#

"Price" is a misleading single word. A useful competitor pricing dataset has at minimum:

[
  { "field": "title",        "type": "string",  "example": "Acme Widget Pro 2026" },
  { "field": "sku",          "type": "string",  "example": "AWP-2026-BLK" },
  { "field": "price",        "type": "float",   "example": 29.99 },
  { "field": "salePrice",    "type": "float",   "example": 24.99 },
  { "field": "currency",     "type": "string",  "example": "USD" },
  { "field": "inStock",      "type": "boolean", "example": true },
  { "field": "stockMessage", "type": "string",  "example": "Only 3 left in stock" },
  { "field": "rating",       "type": "float",   "example": 4.5 },
  { "field": "reviewCount",  "type": "integer", "example": 142 },
  { "field": "shippingCost", "type": "float",   "example": 5.99 },
  { "field": "deliveryDays", "type": "integer", "example": 3 }
]

Why all of these matter:

price and salePrice separately so you can distinguish "they're discounting" from "they raised prices." Use a hint per field if needed: hint: "Original list price, not promotional" and hint: "Promotional price if shown, otherwise null".
stockMessage because "low stock" signals are competitive intelligence. A scarcity pattern often precedes a restock or a price hike.
rating + reviewCount because a competitor's product velocity (review accumulation rate) tells you about their unit volume.
shippingCost + deliveryDays because the customer-perceived price is price + shipping, and a competitor with cheaper shipping at the same listing price is winning.

If you only extract price, you're missing 80% of the actionable signal.

The architecture#

┌──────────────────────┐
│  SKU registry (DB)   │ ← URLs, identifiers, your SKU mapping
└──────────┬───────────┘
           │ scheduled (e.g. daily)
           ↓
┌──────────────────────┐
│  Scraping API call   │ ← /batch endpoint with all URLs + schema
│  (Runo, Firecrawl,   │
│   built in-house)    │
└──────────┬───────────┘
           │ typed JSON per URL
           ↓
┌──────────────────────┐
│  Snapshot store      │ ← time-series of extracted fields
└──────────┬───────────┘
           │ diff against previous
           ↓
┌──────────────────────┐
│  Change detector     │ ← flag price changes, stock changes
└──────────┬───────────┘
           │
           ├──→ Alerting (Slack, email)
           ├──→ Pricing model input
           └──→ BI dashboard

The interesting design choices live in three places: the scraping layer, the snapshot store, and the change-detection logic.

Scraping layer#

For a competitor monitoring workload, the right call shape is /batch (or whatever your vendor's batch endpoint is called). One schema, N URLs, concurrent extraction.

curl -X POST https://api.scrapewithruno.com/v1/batch \
  -H "X-API-Key: $RUNO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://competitor.com/p/sku1", "https://competitor.com/p/sku2", ...],
    "schema": [...],
    "options": { "concurrency": 8 }
  }'

Why batch: per-host concurrency is rate-limited by polite scrapers (4–8 parallel requests against one domain is sane), and the batch endpoint handles that for you. Sequential calls leave money on the table; uncoordinated parallel calls get you banned.

Why an AI extraction API rather than custom scrapers per retailer: every retailer redesigns their product page on their own schedule. Selector-based scrapers per site need maintenance every time anything changes. LLM-driven extraction with the schema above adapts to site changes by default. We covered this tradeoff in LLM extraction vs CSS selectors.

Why bypass matters specifically for retail: most major retail sites (Amazon, Walmart, Best Buy, the long tail of Shopify-on-Cloudflare DTC stores) sit behind some combination of Cloudflare, Datadome, and PerimeterX. Without the bypass stack from how to scrape Cloudflare-protected sites running server-side, you're going to see a wall of 403s.

Snapshot store#

Don't just keep the latest value. Keep a time series.

A schema like:

CREATE TABLE price_snapshots (
  sku_id TEXT,
  source_url TEXT,
  scraped_at TIMESTAMP,
  data JSONB,
  PRIMARY KEY (sku_id, source_url, scraped_at)
);

CREATE INDEX ON price_snapshots (sku_id, scraped_at DESC);

Reasons to keep history:

Change detection compares snapshot N to snapshot N-1
Trend reports show "competitor price has dropped 3 times in the last month"
Backtesting your pricing model against historical competitor data
Debugging: when extraction goes weird, you want to see the raw history

Storage cost is trivial for sane volumes. 10K SKUs × daily snapshots × 1 KB per snapshot = ~3.6 GB/year. At PostgreSQL pricing on a managed service, single-digit dollars per month.

If you're at higher volume (100K+ SKUs, multi-times-per-day), graduate to a columnar store (ClickHouse, DuckDB-as-OLAP, or BigQuery) and partition by date. The query patterns are scan-heavy and column-store wins decisively.

Change detection#

The naïve version is: "if price changed since last snapshot, alert." The reality is that you'll get alerts on rounding noise, transient stock-message updates, and cents-per-dollar fluctuations on currency-converted listings.

Better rules:

def is_significant_change(prev: dict, curr: dict) -> list[str]:
    alerts = []

    # Price changed by >1% (ignore rounding/currency noise)
    if prev["price"] and curr["price"]:
        delta_pct = abs(curr["price"] - prev["price"]) / prev["price"]
        if delta_pct > 0.01:
            alerts.append(f"price changed {prev['price']} → {curr['price']}")

    # Stock state flipped (boolean change is always significant)
    if prev["inStock"] != curr["inStock"]:
        alerts.append(f"stock changed {prev['inStock']} → {curr['inStock']}")

    # Sale price appeared or disappeared
    if (prev["salePrice"] is None) != (curr["salePrice"] is None):
        alerts.append("sale price toggled")

    # Title changed (probably means SKU has been replaced or remerchandised)
    if prev["title"] != curr["title"]:
        alerts.append(f"title changed, SKU may have been replaced")

    return alerts

The "title changed" check is one nobody thinks of and it catches a real failure mode: retailers sometimes reuse a URL for a different product, and your "price went up by $200" alert is actually "this SKU is now a different product."

Frequency and politeness#

Be honest with yourself about how often you need fresh data.

Daily is fine for most pricing intelligence. Most retailers don't change prices more than once a day.
Hourly if you're in a flash-sale-sensitive category (consumer electronics, fashion drops, scarcity merch).
Continuous (sub-hourly) rarely justified. At that frequency you're approaching what the retailer might consider abuse.

The polite default is daily, jittered across the day so you're not hammering all SKUs at midnight. Within a single host, cap concurrency at 4–8 parallel requests with 200–800ms jitter between them. Your scraping API should do this for you; if it doesn't, do it yourself.

A quick legal note: scraping public product pages for competitive intelligence is generally legal in most jurisdictions, but the boundary is fuzzy and case law (HiQ v. LinkedIn and successors) has been unpredictable. The defensible posture is: respect robots.txt, identify your scraper with a stable User-Agent, don't bypass authentication, don't extract personal data, don't impersonate a real customer (no fake order placement). This isn't legal advice; consult counsel for your specific situation.

What to alert on#

Once the data flows, the question is what's worth a human's attention.

Signal	Alert?
Price drop on a high-volume SKU	Yes (pricing team)
Price increase on a high-volume SKU	Yes (pricing team)
Out-of-stock on competitor product where we're in stock	Yes (marketing/sales team)
New product launched in a category we cover	Yes (product team)
Review velocity spike on a competitor product	Yes (analytics)
Daily price tick of <1%	No (noise)
Title change without price change	Yes (flags possible SKU replacement)

Route alerts by stakeholder. Send pricing alerts to a #pricing-intel Slack channel; send product launch alerts to product. Don't put everything in one firehose. It gets ignored within a week.

What to do with the data#

The dataset is the input to a few different things:

Dynamic repricing. Feed competitor prices into a model that adjusts your own listing prices. Even simple rules ("match the lowest competitor price within 2%") generate real margin improvements.
Margin analysis. Your COGS plus competitor pricing tells you which SKUs have headroom for price increases without losing share.
Product launch monitoring. Knowing a competitor launched a new SKU same-day is the difference between a 6-week reaction time and a 6-hour reaction time.
MAP (Minimum Advertised Price) compliance enforcement. For brands selling through resellers, watching whether resellers honour MAP.

Each of these is a separate pipeline downstream of the snapshot store. The scraping layer doesn't need to know about any of them. It just keeps the snapshot table fresh.

A reasonable starter implementation#

Pseudocode for a daily monitor:

import asyncio, httpx
from datetime import datetime

SCHEMA = [
    {"field": "title",       "type": "string",  "example": "Acme Widget Pro 2026"},
    {"field": "sku",         "type": "string",  "example": "AWP-2026-BLK"},
    {"field": "price",       "type": "float",   "example": 29.99},
    {"field": "salePrice",   "type": "float",   "example": 24.99},
    {"field": "inStock",     "type": "boolean", "example": True},
    {"field": "rating",      "type": "float",   "example": 4.5},
    {"field": "reviewCount", "type": "integer", "example": 142},
]

async def daily_run():
    urls = await db.fetch_active_skus()  # list of competitor URLs
    async with httpx.AsyncClient(timeout=120) as client:
        r = await client.post(
            "https://api.scrapewithruno.com/v1/batch",
            headers={"X-API-Key": API_KEY},
            json={"urls": urls, "schema": SCHEMA, "options": {"concurrency": 8}},
        )
        results = r.json()["results"]

    now = datetime.utcnow()
    for result in results:
        if result["status"] != "success":
            await log_failure(result["url"], result["error"])
            continue

        prev = await db.fetch_latest_snapshot(result["url"])
        await db.insert_snapshot(result["url"], result["data"], scraped_at=now)

        if prev:
            alerts = is_significant_change(prev["data"], result["data"])
            for alert in alerts:
                await slack.post(f"{result['url']}: {alert}")

That's the entire monitoring loop in 30 lines. Schedule it on cron (or your orchestrator of choice), point at your real SKU list, and you have a working price intelligence pipeline.

Cost#

For a 1,000-SKU daily monitor running 365 days/year:

1,000 × 365 = 365,000 requests/year
At $0.0009/request (Runo Pro tier blended), ~$330/year scraping cost
Plus storage: trivial (~$10/year)
Plus alerting: $0 (Slack webhook is free)

For about $30/month all-in, you have a working competitor price intelligence pipeline against 1,000 SKUs. Compare to the engineer-quarter you'd spend building the scraping layer yourself.

TL;DR#

Don't extract just price. Extract title, SKU, salePrice, currency, inStock, stockMessage, rating, reviewCount, shipping. The non-price fields are where the signal lives.
Use a /batch endpoint with concurrency 4–8 per host. Don't fan out uncoordinated parallel calls.
Use LLM-based extraction (e.g. Runo), not selectors. Retailer DOMs change too often for selector maintenance to scale.
Keep a time-series snapshot store, not just the latest value. Storage is cheap; history is gold.
Alert on >1% price changes, stock state flips, sale toggles, and title changes (which often signal SKU replacement). Ignore sub-1% noise.
Daily polling is right for most use cases; hourly for flash-sale categories.
Total cost for a 1,000-SKU daily monitor is roughly $30/month, all-in.