Every week we see sites with 30-60% of pages missing from Google's index. The root cause is almost never a single setting. Use this structured checklist to isolate crawl errors, misconfigured tags, and content quality issues. No fluff. Only actionable diagnostics.
Pages go missing at three gates: crawlability, indexability, and quality threshold. Most SEOs jump straight to content. That is a mistake. The first gate kills the highest number of pages. If a page cannot be found by Googlebot, no amount of rewriting will help.
A common situation we see: an agency inherits a 10,000-page site, runs a site: search, and sees only 4,000 results. They rewrite half the content. Three months later the index count is 4,050. The other 6,000 pages were blocked by a misconfigured robots.txt or a noindex tag inherited from a redesign. Crawl first. Index second. Content last.
This checklist follows the three-gate model. Each gate has 3-5 checks. Run them in order. Skip nothing.
Check robots.txt, sitemap submission, server response codes (200 vs 3xx/5xx), and internal linking depth.
Scan for noindex tags, canonical confusion, X-Robots-Tag headers, and login walls.
Evaluate uniqueness, length, E-E-A-T signals, internal linking density, and thin content risk.
Use Google Search Console's URL Inspection tool per page or bulk via the Index Coverage report.
Implement fixes, request indexing via GSC, and re-check after 2-4 weeks.
| Failure Mode | Root Cause | Diagnostic Signal | Fix Timeline | Hidden Risk |
|---|---|---|---|---|
| Blocked by robots.txt Entire section excluded | Disallow rule too broad or left from staging | URL Inspection shows 'Blocked by robots.txt' | 5 minutes to fix, 1-2 weeks to recrawl | Accidental disallow of /blog/ or /products/ |
| Noindex tag present Page returns 200 but has noindex | Template-level noindex inherited from dev site | View page source or use browser extension; meta robots noindex present | 30 minutes to audit + fix, 2-4 weeks to disappear from index | Noindex on pagination pages or filter URLs |
| Canonical to different URL Self-canonical missing or pointing elsewhere | CMS plugin misconfiguration or canonical set to homepage | Check in HTML; GSC shows 'Alternate page with proper canonical' | 1-2 hours to correct via template, 1-4 weeks to see index change | Canonical chains where A canon to B and B canon to C |
| Thin or duplicate content Page lacks unique value | Scraped content, auto-generated summaries, or affiliate pages with no original text | GSC 'Discovered - currently not indexed' for weeks; manual review shows <300 words | 2-4 weeks to rewrite + gain authority | Google may soft-404 the page instead of indexing it |
| Server errors (5xx) Page intermittently fails to load | Resource limits, CDN misconfiguration, or database timeout | GSC shows 'Server error (5xx)'; crawl log shows 503s | 1-3 days for engineering fix | Partial errors only on mobile or during high traffic |
Open robots.txt in browser. Check that the page URL is not disallowed. Pay attention to wildcards.
Submit a clean XML sitemap to Google Search Console. Verify that the sitemap includes the target page.
Run a crawl with Screaming Frog or Sitebulb. Filter by status code: look for 3xx, 4xx, 5xx on the target pages.
Check internal linking depth. Any page more than 3 clicks from the homepage may not be crawled regularly.
Inspect the page in GSC URL Inspection tool. Confirm the status is 'URL is available to Google'.
Situation: An ecommerce site with 8,000 product pages. Only 5,600 were indexed. Competitor analysis showed similar sites at 90%+ indexation.
Diagnosis:
noindex tag. Root cause: the CMS applied 'noindex' to any product with 'out of stock' status. The tag was in the head but also duplicated in a plugin.Fix: Removed noindex from out-of-stock products (redirected to similar in-stock instead). Rewrote 200 thin pages to 400+ words each with original copy. Requested reindexing via GSC Indexing API. After 5 weeks, index count rose to 7,400.
Noindex tags are the number one cause of missing pages. But there are subtler traps. Canonical confusion is common: a page has a self-canonical but also another canonical pointing to a filter page. Google follows the first canonical it sees, which may not be the one you intend. Another trap is the X-Robots-Tag: noindex HTTP header. This overrides meta tags. We once found a CDN injecting a noindex header on all PDFs. The client had no idea.
For technical validation, refer to Google's own documentation on snippet and appearance controls. It confirms that the robots meta tag and headers are the primary signals. If you need to accelerate indexation after fixing these issues, some practitioners use automated verification tools; for a comparison of such services, see this index backlinks service comparison which breaks down turnaround times and verification methods.
Edge case: pages behind a login wall are treated as 'blocked by robots.txt' even if they are public. A 'noindex' tag on a login page is fine, but if a login wall blocks the content, Google cannot index it. Use rel="nofollow" on login links, not a blanket block.
Start by exporting the Index Coverage report from Google Search Console. Filter by 'Excluded' and 'Error' statuses. Then run the checklist gate-by-gate: first check crawlability (robots, sitemap, status codes), then indexability (noindex, canonicals), then content quality. Document each failed check and assign a fix owner. Repeat the audit monthly.
WordPress sites often inherit a noindex tag from the 'Search Engine Visibility' setting in Settings > Reading. Also common: SEO plugins like Yoast or Rank Math applying noindex to categories, tags, or post types by default. Check the 'Advanced' tab in the post editor. Some themes inject a noindex via functions.php without notice.
This status means Google found the URL but chose not to index it yet, often due to low perceived content quality or insufficient crawl budget. Check word count (aim for 800+), internal links, and whether the page has been live less than 4 weeks. If the issue persists, improve the content and request indexing via GSC.
Use Screaming Frog SEO Spider: set the configuration to 'Check robots.txt' and crawl. Filter the results by 'Blocked by robots.txt'. Export the list. Alternatively, in Google Search Console, under 'Index > Index Coverage', filter by 'Blocked by robots.txt' to see all affected URLs. This is faster for large sites.
Noindex tells Google 'do not include this page in the index at all'. Canonical tells Google 'this URL is a duplicate, prefer the canonical instead'. If you use noindex, the page will not be indexed regardless of its content. If you use a canonical pointing elsewhere, the page may still be indexed if Google ignores the canonical.
After removing the noindex tag and requesting indexing via GSC URL Inspection, it usually takes 1-4 weeks for Google to recrawl and index the page. Factors: crawl budget of the site, page authority, and how quickly Google discovers the change. For large sites with high crawl budget, it can be as fast as 3 days.
Yes. Pages with zero internal links are often not discovered by Googlebot. Even if they are in the sitemap, deep pages (4+ clicks from homepage) have lower crawl priority. A practical rule: every important page should have at least 2-3 internal links from other indexed pages. Use breadcrumbs and related posts modules to distribute link equity.
For small sites (under 10k pages): Google Search Console + Screaming Frog. For mid-scale (10k-100k): Sitebulb or DeepCrawl for automated crawl analysis. For enterprise (100k+): Botify or Oncrawl with log file analysis. All of these can export lists of pages with specific statuses like 'noindex', 'blocked by robots', or 'crawled but not indexed'.
Yes. Site:search is not comprehensive. A page can be indexed but not shown in a site: query due to ranking factors or if it is in a supplementary index. Use GSC URL Inspection to confirm index status definitively. If it says 'URL is on Google', the page is indexed regardless of site: results.
Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.