Scraping Built On Facts: Engineering Choices That Match The Real Web

225 Views

Modern crawlers succeed or fail on the strength of their assumptions.

Building those assumptions on measurable web data keeps systems stable, keeps bans rare, and keeps costs in check.

The web today looks different from the one many legacy scrapers were designed for, and the numbers make that clear.

HTTPS and multiplexed protocols are the default path

The open web is effectively encrypted by default. Chrome’s telemetry shows that over 95% of page loads happen over HTTPS, which means certificate handling, SNI, and ALPN negotiation are no longer optional details. On top of that, network traces collected at scale show protocol shifts that affect scraper architecture: more than two thirds of requests are served over HTTP/2, and HTTP/3 already accounts for a meaningful double‑digit share. Practical takeaway: a crawler that does not speak HTTP/2 is leaving throughput on the table, and one that mishandles connection coalescing or prioritization will trip rate limits despite modest request volumes.

IPv6 matters for reachability and block avoidance

Google’s measurement of IPv6 usage consistently hovers around 40% of user traffic. Many hosts, CDNs, and anti‑bot systems now publish AAAA records and apply different controls by address family. A dual‑stack proxy fleet improves origin reachability, reduces NAT contention, and creates more granular routing choices across regions. It also avoids false positives in blocklists that cluster by IPv4 subnets. Scrapers that never resolve AAAA, or that tunnel IPv6 through IPv4‑only egress, miss these advantages and face unnecessary refusal rates.

CDNs are everywhere, and they change the rules

About 30% of websites use a CDN. For scrapers, that single fact reshapes connection strategy. Many CDNs enforce per‑IP, per‑ASN, and per‑path limits, and they react differently to header shape, TLS ciphers, and protocol downgrade attempts. A crawler that opens a large number of short‑lived connections will trigger challenges sooner than one that reuses a few multiplexed sessions with realistic headers and pacing. Because CDNs front multiple origins on shared IPs, the cost of looking like a bot is amplified: poor hygiene can affect access to several unrelated targets at once.

Payload economics: pages are heavy, JavaScript is heavier than expected

HTTP Archive data shows the median mobile page weighs over 2 MB, and JavaScript alone commonly contributes around 450 KB. That size profile hurts scrapers twice: first in bandwidth, second in render time for any workflow that executes scripts. Headless execution should be a last resort reserved for pages where primary content truly requires it. When static fetches suffice, reduce transfer cost by requesting only essential resources, following rel=canonical carefully, and avoiding image endpoints. Small math helps scope budgets: at a 2 MB median, 100,000 pages represent roughly 200 GB of transfer; shaving even 20% saves dozens of gigabytes and noticeable time on modest links.

Geography and latency influence ban rates

CDN footprints compress latency for nearby users, but cross‑continent round trips still add hundreds of milliseconds. That delay stacks with JavaScript execution and third‑party calls. For scrapers, long tails in latency push retries into overlap and look like bursts from the server’s perspective. Region‑aware routing, smaller per‑region concurrency, and protocol reuse lower perceived burstiness. Connecting from the same geography as the site’s primary audience also avoids some geo‑based bot heuristics that penalize distant traffic patterns.

Headers, fingerprints, and small correctness details

When most traffic is HTTPS and multiplexed, little correctness issues are loud signals. Mismatched Accept‑Language, missing compression, impossible viewport strings, or ALPN downgrades stand out more than they used to. Returning 304 flows properly when ETags are present prevents wasteful full transfers and mirrors real browsers. Honoring robots.txt and crawl‑delay avoids server‑side throttles that accumulate silently until a block lands. All of these are operational rather than philosophical points, but they show up directly in success rates and cost per page.

Data hygiene starts with proxy hygiene

Good results depend on clean, predictable egress. Standardizing authentication formats, protocols, and address families across vendors prevents hard‑to‑trace spikes in 407s, 5xx ladders, or TLS failures. A small investment in tooling pays off: normalize and validate endpoints before they ever reach the scheduler, make dual‑stack behavior explicit, and tag entries by geography and ASN to spread risk intelligently. If you maintain lists from many suppliers, a simple way to reduce friction is to pass them through a reliable proxy formatter so every component gets what it expects.

Putting the numbers to work

Treat the modern web’s baselines as constraints, not curiosities. Build for HTTPS everywhere and multiplexed protocols. Expect IPv6. Assume heavy pages, and avoid headless work unless it is the only path to the data. Shape traffic with geography and pacing in mind, because a substantial slice of the web sits behind CDNs that react quickly to anomalies. The figures above are not abstract; they are what your crawler meets minute by minute. Aligning with them is the simplest way to raise yield and lower noise.