user@None

Pruning URLs for News Websites. Guiding Bots to the Right Pages

When indexed URLs dropped from 2 million to 350,000, organic traffic grew.

None
Published Mar 16, 2026 · Updated Mar 16, 2026 · 4 min read

The Standard Playbook Works… Until It Doesn't

The standard SEO playbook does work, but in a highly competitive environment might not be enough. Besides making sure all technical SEO requirements and best practices are met, here are some unique common issues I've noticed when dealing with large news websites and publishing networks.

These networks might be a combination of many recently acquired websites, each with its own CMS, folder structure, taxonomy, and SEO history. There might be a lot of overlapping content and repeated, redundant URLs. In my experience, it is not uncommon to see a website with thousands of URLs but just a small percentage actually indexing. Looking at Google Search Console and seeing:

  • Crawled but currently not indexed
  • Discovered but currently not indexed
  • Duplicate without user-selected canonical

In this sea of URLs, we must help direct Google into the best content, aka, correct URLs.

Crawl budget is a thing: tag pages, pagination and filters, author pages, archives, tracking parameters... In some extreme cases, it is possible to accidentally create several URLs overnight and many will have very little SEO value.

Understanding What Google Actually Sees

When I found a site where only a small fraction of URLs were indexing, step one was to understand what Google is “seeing”, how it is interpreting, perceiving the website. What pages and template types it chooses to index, which ones it chooses not to.

From there, I created a hard-coded sitemap with the most strategic URLs. Updated all those pages to deep link into main and subcategories and updated robots to restrict several low-value tag pages and article URLs with parameters causing duplicates.

Deciding Which URLs Make the Cut

Choosing which URLs belong in that hard-coded sitemap came down to a handful of concrete signals: Google Analytics organic traffic numbers, search health (position, indexed keywords), amount of inbound links, the URL itself, page depth from the home page, and the content on the page.

The last time I looked at a URL and decided it didn't make the cut, the questions were straightforward. Was there a better version of that same page or content already available live? Was this page getting more traffic? If page content is thin, if page traffic is low, if there is a better version, it would get cut.

If there was not a better version, we would look into business goals and keyword traffic data to see if it would be worthwhile to create such content.

Fewer Indexed URLs, More Traffic

This was a few years ago and I don't have the exact numbers. From memory, I remember that in Google Search Console's Index Coverage report the amount of indexed URLs dropped from roughly 2 million to approximately 350,000. 

Between robots.txt blocking bad pages, canonicals on duplicates, and 301 redirects from older content to fresh versions, we shaved many low-quality URLs and didn't impact traffic. By also feeding Google with previously hidden-but-good pages, we managed to grow organic while cutting hundreds of thousands of URLs.

In this particular case, we began seeing changes within about 2 weeks, though timelines can vary significantly depending on site size and crawl frequency. Google and other search engines were no longer wasting time on thousands of extremely low-quality and similar pages. Quality pages that could not be seen were being found and indexed.

Getting Buy-In and Getting Started

On the technical side, I recommend a Screaming Frog deep scan to find as many URL versions as possible. Add that to a large export from Search Console and you should have a hefty list of URLs to analyze.

To get buy-in, you can create reports such as: "These 200,000 URLs gave us 23 impressions and no clicks last year. Google is wasting time visiting the wrong pages."

In my case, I was allowed to test on smaller editions first — or a category with less traffic first.

When Everything Is Right and You're Still Losing

Competition was the problem. You can meet all technical requirements and still not be among the top 10 best user experiences for that search query. This is the moment where hard KPIs are replaced by vague terms such as "content quality" or "thin content." You did the checklist: pages index, Core Web Vitals are fine, schema is clean, internal links make sense… That’s when you realize the problem isn’t technical anymore, it’s relative competitiveness.

Also time to step back, be honest with yourself and determine if your page, does, indeed, belong among the top 10 best user experiences for that search query, that user's intent…. Are there better pages? OK, no problem!

This realization is actually great news… Means that we are ready for next steps… Reverse-engineer the SERP experiences that are outranking you.

Frequently Asked Questions

Why do news and editorial websites struggle with SEO indexing?
Large publishing networks often consist of many recently acquired websites, each with its own CMS, folder structure, taxonomy, and SEO history. This leads to overlapping content, redundant URLs, and crawl budget waste from tag pages, pagination, filters, author pages, and tracking parameters — resulting in only a small percentage of URLs actually being indexed by Google.
What does 'crawled but currently not indexed' mean in Google Search Console?
This status means Google has visited and crawled the URL but decided not to include it in its search index. For news and editorial sites, this often indicates issues like thin content, duplicate pages, or low perceived value, and it's a common signal that a site has too many low-quality URLs competing for crawl budget.
How do you decide which URLs to include in a sitemap for a large publishing site?
The decision is based on concrete signals including Google Analytics organic traffic numbers, search health (position and indexed keywords), inbound link count, URL structure, page depth from the home page, and the quality of content on the page. If a better version of the same content exists, or if the page has thin content and low traffic, it gets cut from the sitemap.
Can reducing the number of indexed URLs actually increase organic traffic?
Yes. By cutting low-quality URLs through robots.txt blocking, canonical tags on duplicates, and 301 redirects from older content to fresh versions, you can focus Google's attention on your best content. In one case, indexed URLs dropped from over 2 million to around 350,000 without impacting traffic, and organic traffic actually grew because previously hidden but high-quality pages surfaced.
What is crawl budget and why does it matter for news websites?
Crawl budget refers to the number of pages Google will crawl on your site within a given timeframe. For news and editorial sites, tag pages, pagination, filters, author pages, archives, and tracking parameters can accidentally create millions of URLs overnight — many with little SEO value — which wastes crawl budget and prevents Google from discovering and indexing your most important content.
What is a hard-coded sitemap and how does it help SEO for publishers?
A hard-coded sitemap is a manually curated sitemap containing only the most strategic URLs rather than an auto-generated list of every page on the site. It helps direct Google toward your best content by prioritizing pages with strong traffic, good search health, and quality content, while excluding low-value or duplicate URLs that waste crawl budget.
How quickly can SEO improvements take effect on large editorial websites?
In the case study described, the impact of consolidating URLs, blocking low-quality pages, implementing canonicals, and setting up redirects became clear within just 2 weeks. This relatively fast turnaround was achieved by helping Google focus its crawling and indexing on the site's strongest content.