SEO • Ecommerce

    Crawl Budget for Large Ecommerce Sites: A Practical Playbook

    Crawl budget is the number of URLs Google is willing and able to crawl on your site in a given window. On a large store it gets eaten alive by faceted navigation, filter and sort parameters, and variant URLs, so your money pages go uncrawled. Large stores fix it by blocking junk URLs in robots.txt, canonicalising filters, flattening crawl depth, and proving the gains with log files.

    Source: Google, Crawl Budget Management, updated 2025.

    Chris Coussons | Visionary MarketingPublished: 28 June 202614 min read

    40%

    Share of URLs Google crawls on an unoptimised store (Botify)

    1%

    Share crawled on one 10m-page marketplace (Botify)

    +0.35 / -0.34

    Rank correlation: managed vs unmanaged facets (VM 2026)

    What crawl budget actually is (and why only big stores need to care)

    Crawl budget is the set of URLs Google can and wants to crawl on a host, set by two things: crawl capacity limit (how hard Google can hit your server without slowing it down) and crawl demand (how much Google wants your pages, driven by popularity, freshness and perceived inventory). Source: Google, Crawl Budget Management, updated 2025.

    Google itself is clear that most sites do not need to worry about this. The guidance is aimed at sites with over a million pages that change weekly, sites over ten thousand pages that change daily, or sites with a large share of URLs sitting in "Discovered, currently not indexed". Large catalogues hit all three.

    This article is the playbook that sits under our wider technical SEO for ecommerce checklist, focused entirely on getting Google to spend its crawl on the pages that earn.

    The crawl budget reference table

    A one-screen summary of the concept, the drains and the fix levers before we go deeper.

    Attribute Value
    DefinitionThe set of URLs Googlebot can and will crawl on a host in a given window, governed by crawl capacity limit and crawl demand. Source: Google, 2025.
    What consumes it on ecommerceFaceted navigation, filter and sort parameters, internal search result pages, session IDs, infinite pagination, and a separate URL per variant.
    How you detect itServer log file analysis (Googlebot hits by URL type), plus the Crawl Stats report in Google Search Console.
    Primary fix leversrobots.txt disallow rules for parameters, canonical tags on filtered pages, flatter crawl depth, clean XML sitemaps, 404/410 for dead URLs, faster server response.
    Revenue impactUncrawled product and category pages cannot be indexed, so they cannot rank or earn. On an unoptimised store Google crawls roughly 40% of URLs. Source: Botify.
    Thresholds by catalogue sizeUnder ~10k URLs: rarely an issue. 10k to 1m with frequent change: monitor. 1m+ URLs: active management required. Source: Google, 2025.

    What drains crawl budget on a large catalogue

    The top culprit is faceted navigation. Filter, sort and variant combinations multiply your URL count into the millions without adding a single new product.

    A store with 5,000 products and twenty filterable attributes can spin up millions of addresses, most near-duplicate. The six biggest drains we see on real catalogues:

    • Filter and sort parameters (colour, size, price brackets, sort order).
    • Faceted navigation combinations (multiple filters applied together).
    • Internal search result pages crawlable and indexable.
    • Session IDs and tracking parameters appended to URLs.
    • Infinite or unbounded pagination chains.
    • A unique URL per variant (every colour and size of one product).

    On the facet point, the deep treatment is in our faceted navigation SEO guide. On the near-duplicate URL point, see ecommerce index bloat.

    How to detect crawl waste with log file analysis

    The only way to see what Google actually crawls, rather than what you think it crawls, is your server log files.

    A log line records the requesting user agent, the URL, the status code, the timestamp and the bytes returned. Verify it really is Googlebot with reverse DNS rather than trusting the user-agent string (Google, verifying Googlebot guidance, 2025).

    The practical method: pull a representative window of logs, isolate verified Googlebot, group requests by URL type (product, category, filter/parameter, search, other), and calculate the share of crawl spent on each. The headline metric is the percentage of crawl landing on parameter and search URLs that should never be crawled.

    Pair this with the Crawl Stats report in Search Console, which shows total crawl requests, average response time and host availability problems. That is your capacity-limit signal. Site architecture also shapes what gets crawled, which we cover in ecommerce site architecture.

    66.249.66.1 "GET /shop/jackets/oxford-jacket HTTP/1.1" 200 → wanted: product

    66.249.66.1 "GET /shop/jackets/category HTTP/1.1" 200 → wanted: category

    66.249.66.1 "GET /shop/jackets?colour=red&sort=price HTTP/1.1" 200 → wasted: filter

    66.249.66.1 "GET /shop/jackets?colour=red&size=L&sort=price HTTP/1.1" 200 → wasted: facet combo

    66.249.66.1 "GET /search?q=red+jacket HTTP/1.1" 200 → wasted: internal search

    Verify Googlebot by reverse DNS, not the user-agent string. Source: Google, 2025.

    The crawl budget calculator

    Plug in your catalogue size and Googlebot's daily rate, then see how long Google needs to crawl every URL at your current waste level versus a healthier 10%.

    Interactive

    Crawl Budget Calculator: how long to crawl your whole catalogue

    Days to full crawl as-is

    125 days

    After fixing waste to 10%

    56 days

    Days saved

    69 days

    At this crawl rate, cutting waste from 60% to 10% gets your full catalogue crawled 69 days sooner.

    Indicative estimate. Real crawl rate varies by server response, site authority and demand. Crawl-rate concept and capacity limit per Google, Crawl Budget Management, 2025.

    Crawl depth in plain terms is the number of clicks from the homepage to a page. Pages buried six clicks deep get crawled rarely. Our rule: every revenue page within three or four clicks of the homepage.

    How to reclaim crawl budget in seven steps

    You fix crawl budget by removing junk from the crawl, then concentrating it on revenue pages.
    1. Block parameter and sort URLs in robots.txt. Disallow the filter, sort and session parameters that create near-duplicate URLs. Google will not shift freed budget elsewhere unless it is already hitting your serving limit, so do this to stop waste, not as a magic boost. Source: Google, 2025.
    2. Canonicalise filtered pages to the parent category. Point every filtered or sorted version at the clean category URL so signals consolidate on one page.
    3. Do not rely on noindex for crawl control. Google still has to request a noindexed page before it sees the tag, which spends budget. Use robots.txt to keep it out of the crawl entirely. Source: Google, 2025.
    4. Return 404 or 410 for dead URLs, and flatten redirect chains. A 404 is a strong signal to stop crawling a URL. Long redirect chains waste crawl and bleed PageRank at every hop. Source: Google, 2025.
    5. Flatten crawl depth and fix orphan pages. Get revenue pages within three or four clicks of the homepage and make sure every important page has internal links pointing at it.
    6. Keep XML sitemaps to clean, indexable, canonical URLs only, with accurate lastmod values so Google can prioritise what changed.
    7. Speed up server response and re-crawl to confirm. Faster responses raise the crawl capacity limit, so Google can read more. Then re-pull logs and confirm the parameter-crawl share has dropped. Source: Google, 2025.

    First-party data: what the numbers say

    Faceted navigation handling correlates with ecommerce rank at +0.35 when managed and -0.34 when left unmanaged, and unmanaged faceted navigation wastes 2.7x more crawl budget on parameter URLs. Source: Visionary 2026 Ecommerce SEO Ranking Factor Study, 100,000 pages crawled Q1 2026. Combine with the Botify floor of 40% URLs crawled on unoptimised stores.

    Days to crawl a 250,000-URL catalogue, by crawl rate and waste level. Computed from total URLs divided by effective crawl rate. Crawl-rate model per Google, 2025.

    How this playbook beats the standard advice

    What you get Google's own doc Typical SEO tool guide This playbook
    Definition of crawl budgetYesYesYes
    Ecommerce-specific drains namedPartialPartialYes, all six
    Log file analysis walkthroughNoSometimesYes, step by step
    Interactive crawl budget calculatorNoNoYes
    First-party correlation dataNoNoYes, 100k-page study
    UK ecommerce framing and GBPNoNoYes

    Where this fits, and where to get help

    Crawl budget is the first technical fix to make on any large store, ahead of speed and schema work. A fast page Google never crawls earns nothing.

    This is exactly the work we do inside our ecommerce SEO services. Log-led crawl audits are the first step on large catalogues.

    Methodology and sources

    • Google, Crawl Budget Management, Google Search Central, updated 2025.
    • Google, verifying Googlebot guidance, Google Search Central, 2025.
    • Botify, crawl budget research (carried via our technical SEO for ecommerce pillar).
    • Visionary 2026 Ecommerce SEO Ranking Factor Study, 100,000 pages crawled Q1 2026.

    Frequently Asked Questions

    Crawl budget is the number of URLs Googlebot is willing and able to crawl on your store in a given window. It is set by crawl capacity limit (how hard Google can hit your server safely) and crawl demand (how much Google wants your pages). On a large catalogue, filter and variant URLs eat into it. Source: Google, Crawl Budget Management, updated 2025.

    Usually not. Google says crawl budget management is for sites with over a million pages that change weekly, sites over ten thousand pages that change daily, or sites with many URLs stuck in 'Discovered, currently not indexed'. A small store crawled the same day it publishes does not need to worry. Source: Google, 2025.

    Faceted navigation. Filters, sort orders and variant combinations spin up huge numbers of near-duplicate URLs that soak up crawl and leave less for product and category pages. In our 2026 study, unmanaged faceted navigation wasted 2.7x more crawl on parameter URLs. Source: Visionary 2026 Ecommerce SEO Ranking Factor Study.

    Server log file analysis. Filter your logs to verified Googlebot, then group requests by URL type to see how much crawl lands on product pages versus filter and search URLs. Cross-reference with the Crawl Stats report in Google Search Console for total requests and server response time. Source: Google, 2025.

    robots.txt for crawl control. Google still has to request a noindexed page before it can read the tag, which spends budget. To keep junk URLs out of the crawl entirely, disallow them in robots.txt. Source: Google, 2025.

    Crawl depth is the number of clicks from the homepage to a page. Pages buried deep get crawled less often and rank worse. Keep every revenue page within three or four clicks of the homepage.

    It varies by catalogue size and crawl rate. Re-pull your logs a few weeks after the fix and check whether the share of crawl spent on parameter URLs has dropped and the share on revenue pages has risen. That shift is the leading indicator, before rankings move.

    Work With Visionary Marketing

    Reclaim the crawl Google is wasting

    We run log-led crawl audits on large UK ecommerce catalogues. Find the waste, fix it, and prove the recovery in the next log pull.

    Visionary Marketing is a UK-based SEO and Google Ads agency that takes a data-led approach to growth. We don't guess — we analyse your market, competitors, and performance data to build strategies that drive measurable revenue. Every campaign is grounded in real numbers, not assumptions.

    Data-led strategy — every decision backed by real performance data
    Senior specialists only — no junior account managers
    No contracts — month-to-month, cancel anytime
    Revenue-first — we track ROAS, not vanity metrics
    Get a free crawl audit

    About the Author

    Chris Coussons, Founder of Visionary Marketing

    Chris Coussons

    Founder · Visionary Marketing

    Chris is the founder of Visionary Marketing, a world-leading, award-winning UK SEO and Google Ads agency named in Digital Reference's Best UK Digital Marketing Agencies 2026. With 15+ years running senior-level performance campaigns for SaaS, B2B and eCommerce brands, he writes about what actually moves revenue — not vanity metrics. Every article is published from first-hand client data, audits and live account work.