The Standard Playbook Works… Until It Doesn't
The standard SEO playbook does work, but in a highly competitive environment might not be enough. Besides making sure all technical SEO requirements and best practices are met, here are some unique common issues I've noticed when dealing with large news websites and publishing networks.
These networks might be a combination of many recently acquired websites, each with its own CMS, folder structure, taxonomy, and SEO history. There might be a lot of overlapping content and repeated, redundant URLs. In my experience, it is not uncommon to see a website with thousands of URLs but just a small percentage actually indexing. Looking at Google Search Console and seeing:
- Crawled but currently not indexed
- Discovered but currently not indexed
- Duplicate without user-selected canonical
In this sea of URLs, we must help direct Google into the best content, aka, correct URLs.
Crawl budget is a thing: tag pages, pagination and filters, author pages, archives, tracking parameters... In some extreme cases, it is possible to accidentally create several URLs overnight and many will have very little SEO value.
Understanding What Google Actually Sees
When I found a site where only a small fraction of URLs were indexing, step one was to understand what Google is “seeing”, how it is interpreting, perceiving the website. What pages and template types it chooses to index, which ones it chooses not to.
From there, I created a hard-coded sitemap with the most strategic URLs. Updated all those pages to deep link into main and subcategories and updated robots to restrict several low-value tag pages and article URLs with parameters causing duplicates.
Deciding Which URLs Make the Cut
Choosing which URLs belong in that hard-coded sitemap came down to a handful of concrete signals: Google Analytics organic traffic numbers, search health (position, indexed keywords), amount of inbound links, the URL itself, page depth from the home page, and the content on the page.
The last time I looked at a URL and decided it didn't make the cut, the questions were straightforward. Was there a better version of that same page or content already available live? Was this page getting more traffic? If page content is thin, if page traffic is low, if there is a better version, it would get cut.
If there was not a better version, we would look into business goals and keyword traffic data to see if it would be worthwhile to create such content.
Fewer Indexed URLs, More Traffic
This was a few years ago and I don't have the exact numbers. From memory, I remember that in Google Search Console's Index Coverage report the amount of indexed URLs dropped from roughly 2 million to approximately 350,000.
Between robots.txt blocking bad pages, canonicals on duplicates, and 301 redirects from older content to fresh versions, we shaved many low-quality URLs and didn't impact traffic. By also feeding Google with previously hidden-but-good pages, we managed to grow organic while cutting hundreds of thousands of URLs.
In this particular case, we began seeing changes within about 2 weeks, though timelines can vary significantly depending on site size and crawl frequency. Google and other search engines were no longer wasting time on thousands of extremely low-quality and similar pages. Quality pages that could not be seen were being found and indexed.
Getting Buy-In and Getting Started
On the technical side, I recommend a Screaming Frog deep scan to find as many URL versions as possible. Add that to a large export from Search Console and you should have a hefty list of URLs to analyze.
To get buy-in, you can create reports such as: "These 200,000 URLs gave us 23 impressions and no clicks last year. Google is wasting time visiting the wrong pages."
In my case, I was allowed to test on smaller editions first — or a category with less traffic first.
When Everything Is Right and You're Still Losing
Competition was the problem. You can meet all technical requirements and still not be among the top 10 best user experiences for that search query. This is the moment where hard KPIs are replaced by vague terms such as "content quality" or "thin content." You did the checklist: pages index, Core Web Vitals are fine, schema is clean, internal links make sense… That’s when you realize the problem isn’t technical anymore, it’s relative competitiveness.
Also time to step back, be honest with yourself and determine if your page, does, indeed, belong among the top 10 best user experiences for that search query, that user's intent…. Are there better pages? OK, no problem!
This realization is actually great news… Means that we are ready for next steps… Reverse-engineer the SERP experiences that are outranking you.