SEO • Ecommerce
Crawl Budget for Large Ecommerce Sites: A Practical Playbook
Crawl budget is the number of URLs Google is willing and able to crawl on your site in a given window. On a large store it gets eaten alive by faceted navigation, filter and sort parameters, and variant URLs, so your money pages go uncrawled. Large stores fix it by blocking junk URLs in robots.txt, canonicalising filters, flattening crawl depth, and proving the gains with log files.
Source: Google, Crawl Budget Management, updated 2025.
40%
Share of URLs Google crawls on an unoptimised store (Botify)
1%
Share crawled on one 10m-page marketplace (Botify)
+0.35 / -0.34
Rank correlation: managed vs unmanaged facets (VM 2026)
What crawl budget actually is (and why only big stores need to care)
Crawl budget is the set of URLs Google can and wants to crawl on a host, set by two things: crawl capacity limit (how hard Google can hit your server without slowing it down) and crawl demand (how much Google wants your pages, driven by popularity, freshness and perceived inventory). Source: Google, Crawl Budget Management, updated 2025.
Google itself is clear that most sites do not need to worry about this. The guidance is aimed at sites with over a million pages that change weekly, sites over ten thousand pages that change daily, or sites with a large share of URLs sitting in "Discovered, currently not indexed". Large catalogues hit all three.
This article is the playbook that sits under our wider technical SEO for ecommerce checklist, focused entirely on getting Google to spend its crawl on the pages that earn.
The crawl budget reference table
A one-screen summary of the concept, the drains and the fix levers before we go deeper.
| Attribute | Value |
|---|---|
| Definition | The set of URLs Googlebot can and will crawl on a host in a given window, governed by crawl capacity limit and crawl demand. Source: Google, 2025. |
| What consumes it on ecommerce | Faceted navigation, filter and sort parameters, internal search result pages, session IDs, infinite pagination, and a separate URL per variant. |
| How you detect it | Server log file analysis (Googlebot hits by URL type), plus the Crawl Stats report in Google Search Console. |
| Primary fix levers | robots.txt disallow rules for parameters, canonical tags on filtered pages, flatter crawl depth, clean XML sitemaps, 404/410 for dead URLs, faster server response. |
| Revenue impact | Uncrawled product and category pages cannot be indexed, so they cannot rank or earn. On an unoptimised store Google crawls roughly 40% of URLs. Source: Botify. |
| Thresholds by catalogue size | Under ~10k URLs: rarely an issue. 10k to 1m with frequent change: monitor. 1m+ URLs: active management required. Source: Google, 2025. |
What drains crawl budget on a large catalogue
The top culprit is faceted navigation. Filter, sort and variant combinations multiply your URL count into the millions without adding a single new product.
A store with 5,000 products and twenty filterable attributes can spin up millions of addresses, most near-duplicate. The six biggest drains we see on real catalogues:
- Filter and sort parameters (colour, size, price brackets, sort order).
- Faceted navigation combinations (multiple filters applied together).
- Internal search result pages crawlable and indexable.
- Session IDs and tracking parameters appended to URLs.
- Infinite or unbounded pagination chains.
- A unique URL per variant (every colour and size of one product).
On the facet point, the deep treatment is in our faceted navigation SEO guide. On the near-duplicate URL point, see ecommerce index bloat.
How to detect crawl waste with log file analysis
The only way to see what Google actually crawls, rather than what you think it crawls, is your server log files.
A log line records the requesting user agent, the URL, the status code, the timestamp and the bytes returned. Verify it really is Googlebot with reverse DNS rather than trusting the user-agent string (Google, verifying Googlebot guidance, 2025).
The practical method: pull a representative window of logs, isolate verified Googlebot, group requests by URL type (product, category, filter/parameter, search, other), and calculate the share of crawl spent on each. The headline metric is the percentage of crawl landing on parameter and search URLs that should never be crawled.
Pair this with the Crawl Stats report in Search Console, which shows total crawl requests, average response time and host availability problems. That is your capacity-limit signal. Site architecture also shapes what gets crawled, which we cover in ecommerce site architecture.
66.249.66.1 "GET /shop/jackets/oxford-jacket HTTP/1.1" 200 → wanted: product
66.249.66.1 "GET /shop/jackets/category HTTP/1.1" 200 → wanted: category
66.249.66.1 "GET /shop/jackets?colour=red&sort=price HTTP/1.1" 200 → wasted: filter
66.249.66.1 "GET /shop/jackets?colour=red&size=L&sort=price HTTP/1.1" 200 → wasted: facet combo
66.249.66.1 "GET /search?q=red+jacket HTTP/1.1" 200 → wasted: internal search
Verify Googlebot by reverse DNS, not the user-agent string. Source: Google, 2025.
The crawl budget calculator
Plug in your catalogue size and Googlebot's daily rate, then see how long Google needs to crawl every URL at your current waste level versus a healthier 10%.
Interactive
Crawl Budget Calculator: how long to crawl your whole catalogue
Days to full crawl as-is
125 days
After fixing waste to 10%
56 days
Days saved
69 days
At this crawl rate, cutting waste from 60% to 10% gets your full catalogue crawled 69 days sooner.
Indicative estimate. Real crawl rate varies by server response, site authority and demand. Crawl-rate concept and capacity limit per Google, Crawl Budget Management, 2025.
Crawl depth in plain terms is the number of clicks from the homepage to a page. Pages buried six clicks deep get crawled rarely. Our rule: every revenue page within three or four clicks of the homepage.
How to reclaim crawl budget in seven steps
You fix crawl budget by removing junk from the crawl, then concentrating it on revenue pages.
- Block parameter and sort URLs in robots.txt. Disallow the filter, sort and session parameters that create near-duplicate URLs. Google will not shift freed budget elsewhere unless it is already hitting your serving limit, so do this to stop waste, not as a magic boost. Source: Google, 2025.
- Canonicalise filtered pages to the parent category. Point every filtered or sorted version at the clean category URL so signals consolidate on one page.
- Do not rely on noindex for crawl control. Google still has to request a noindexed page before it sees the tag, which spends budget. Use robots.txt to keep it out of the crawl entirely. Source: Google, 2025.
- Return 404 or 410 for dead URLs, and flatten redirect chains. A 404 is a strong signal to stop crawling a URL. Long redirect chains waste crawl and bleed PageRank at every hop. Source: Google, 2025.
- Flatten crawl depth and fix orphan pages. Get revenue pages within three or four clicks of the homepage and make sure every important page has internal links pointing at it.
- Keep XML sitemaps to clean, indexable, canonical URLs only, with accurate lastmod values so Google can prioritise what changed.
- Speed up server response and re-crawl to confirm. Faster responses raise the crawl capacity limit, so Google can read more. Then re-pull logs and confirm the parameter-crawl share has dropped. Source: Google, 2025.
First-party data: what the numbers say
Faceted navigation handling correlates with ecommerce rank at +0.35 when managed and -0.34 when left unmanaged, and unmanaged faceted navigation wastes 2.7x more crawl budget on parameter URLs. Source: Visionary 2026 Ecommerce SEO Ranking Factor Study, 100,000 pages crawled Q1 2026. Combine with the Botify floor of 40% URLs crawled on unoptimised stores.
Days to crawl a 250,000-URL catalogue, by crawl rate and waste level. Computed from total URLs divided by effective crawl rate. Crawl-rate model per Google, 2025.
How this playbook beats the standard advice
| What you get | Google's own doc | Typical SEO tool guide | This playbook |
|---|---|---|---|
| Definition of crawl budget | Yes | Yes | Yes |
| Ecommerce-specific drains named | Partial | Partial | Yes, all six |
| Log file analysis walkthrough | No | Sometimes | Yes, step by step |
| Interactive crawl budget calculator | No | No | Yes |
| First-party correlation data | No | No | Yes, 100k-page study |
| UK ecommerce framing and GBP | No | No | Yes |
Where this fits, and where to get help
Crawl budget is the first technical fix to make on any large store, ahead of speed and schema work. A fast page Google never crawls earns nothing.
This is exactly the work we do inside our ecommerce SEO services. Log-led crawl audits are the first step on large catalogues.
Methodology and sources
- Google, Crawl Budget Management, Google Search Central, updated 2025.
- Google, verifying Googlebot guidance, Google Search Central, 2025.
- Botify, crawl budget research (carried via our technical SEO for ecommerce pillar).
- Visionary 2026 Ecommerce SEO Ranking Factor Study, 100,000 pages crawled Q1 2026.
Frequently Asked Questions
Work With Visionary Marketing
Reclaim the crawl Google is wasting
We run log-led crawl audits on large UK ecommerce catalogues. Find the waste, fix it, and prove the recovery in the next log pull.
Visionary Marketing is a UK-based SEO and Google Ads agency that takes a data-led approach to growth. We don't guess — we analyse your market, competitors, and performance data to build strategies that drive measurable revenue. Every campaign is grounded in real numbers, not assumptions.
Related Services
How We Can Help
Ecommerce SEO Agency
Crawl, indexation, content and links for stores with thousands of URLs.
Learn MoreTechnical SEO
Architecture, schema, render and Core Web Vitals work, grounded in log data.
Learn MoreTechnical SEO Ecommerce Checklist
The full pillar this playbook sits inside.
Learn MoreFaceted Navigation SEO
How to decide which facets to index, canonical or block.
Learn More