Thanks again for sharing a question here, Tim! And again it's a good one! I remember hearing about this before...
It might be a bit of a blind spot (from a platform point-of-view) that these end up as 404 messages. Because usually, if you move content, then it should either re-direct the user to the page (if accessible) or should show a 403 error (Access forbidden). Did you delete this content? If not, I would also like to know where you see these 404 page numbers going up - in Google Analytics or Search Console? Happy to have a look at it myself, but I also will chase this with the product team to learn if this is known.
If you only moved content which generated little amount of views (e.g. less than 100 in the past 2 years), the impact actually should not be that big. That depends of course on the overall amount of content, in some situations you could see an aggregation of individual requests failing due to many pages being moved.
One thing that I could do for you is to trigger crawling of your pages once more. With my (also limited and potentially incorrect) SEO knowledge this could speed up the process of identifying (and actually de-indexing) pages which have been moved.
I have a pretty simple solution to that problem. :)
If you use the Remove Outdated Content Tool that Google offers, you can force Google to purge its caches of URLs that no longer exist much faster than waiting for Googlebot to catch up. If they still exist on the next run, they will be re-indexed but if they’re gone for good then this tool makes sure that Google knows about it.
https://support.google.com/webmasters/answer/7041154 has the details and the tool itself is at https://search.google.com/search-console/remove-outdated-content. If you’re doing it as the site owner though, you’ll want https://search.google.com/search-console/removals instead as that’s faster.
Use it wisely… Muhahahahahaha!
Yes, of course, how could I forget!?!
I actually used the Search Console removal tool before, just not in this context! Thanks for sharing, I think this will be the best choice indeed! However, I think it does not offer to upload lists of urls, you can only submit requests for one url at a time?
Yeah, those tools are for one at a time removals, such as if you recently zapped a really nasty thread and want Google to “forget” about it as quickly as possible. Those tools are absolutely perfect for that.
However, to speed up purging lots of URLs at once, don’t use those tools! In those cases, Google actually recommends you submit a sitemap to Google Search Console and allow that to be processed - it might take a little time to crunch through the data! You should avoid doing this unless you really need to and at the absolute most, no more often than once a week. If you keep doing it too often, it’ll hurt you rather than help you. Obviously, you should submit your initial sitemap as soon as possible after your site goes live anyway and that’s the only exception to these rules. Once accepted by Google Search Console and Bing Webmaster Tools, you can then just leave it to run on autopilot for the most part.
Your other option is to simply do nothing. If Googlebot encounters a 404 error on a previously indexed URL, it will drop that URL from the index within 48 hours of detection and a 404 on a new URL will result in that URL not being indexed at all. When it comes to older content that basically no-one is reading anymore, this is the best option as it’s fully automated and will have no negative impacts.
Thanks for the advice here peeps. I’m going to try and summarise it to help me understand my options and to give anyone a chance to object or add to my summary:
If content is moved from one public area to another public area, it gets auto-redirected. If it gets moved from a public area to a private area, the user will be presented with a “403 error” (Access forbidden). Q: if our ‘Archive’ folder is excluded from internal search and Google indexing, will content that gets moved here be successfully auto-redirected for the user clicking on the old indexed topic, or presented with 403 or 404?
If content is deleted, the user will be presented with the ‘404 error’ page, if they click on the old indexed topic. There are multiple ways to approaching this with regards to minimizing the impact to your site’s SEO:
- Use the ‘Remove Outdated Content Tool’ that Google offers:
If you use the Remove Outdated Content Tool that Google offers, you can force Google to purge its caches of URLs that no longer exist much faster than waiting for Googlebot to catch up. If they still exist on the next run, they will be re-indexed but if they’re gone for good then this tool makes sure that Google knows about it.
https://support.google.com/webmasters/answer/7041154 has the details and the tool itself is at https://search.google.com/search-console/remove-outdated-content. If you’re doing it as the site owner though, you’ll want https://search.google.com/search-console/removals instead as that’s faster.
This is effective when it’s a small number pages being deleted. It’s not appropriate for bulk content deletions as this has to be done on an individual basis.
- If there’s a higher volume of pages that have been deleted, it’s recommended to submit a sitemap to Google Search Console:
To speed up purging lots of URLs at once, don’t use those tools! In those cases, Google actually recommends you submit a sitemap to Google Search Console and allow that to be processed.
- Lastly the other option is to do nothing and let Google catch up on what’s available and what isn’t:
Your other option is to simply do nothing. If Googlebot encounters a 404 error on a previously indexed URL, it will drop that URL from the index within 48 hours of detection and a 404 on a new URL will result in that URL not being indexed at all. When it comes to older content that basically no-one is reading anymore, this is the best option as it’s fully automated and will have no negative impacts.
Either way, having an up to date 404 error page explaining why that error message is being seen is a must. Thanks @matt enbar for that advice which I think is important to remember and check!
If anyone has other options or advice on which approach is best, let me know!
There is an Option 4 - which can be combined with any of the other three. :)
Even better, Option 4 is automated, and will work for ALL search engines that follow the same rules as Google does. Here it is…
Option 4: Sit back, sip some coffee and relax, safe in the knowledge that inSided are SEO Geniuses.
Any categories that are configured to be excluded from search engines automatically have a “noindex” Meta Tag applied to both the category, any sub-categories and ALL threads within them - including those already in there and those which are moved into it later. Any compliant robot will detect this tag on their next run and add the page to their index purge queue.
If you later remove a thread from said category and place it into one that allows indexing, this tag will be removed automatically. It will be re-indexed after the search engines next encounter it.
Robots will follow redirects and if the destination has a noindex, both URLs will be purged. Likewise, a 403 or 404 will result in auto-purge from the index after the next run from the robots.
Even if other links point to those pages, the presence of a noindex will override and take priority anyway.
As for users? They’ll still land in the right place, as long as the inSided platform is able to figure out where the redirect needs to go. So you don’t need to worry there… Usually.
Easy!
Then again… There is another way…
What you might also want to do, is create a humans.txt file and chuck the following contents into it. You must then place it in the root directory, just like you do for robots.txt.
#Initiate World Dominating Human Finding Robots! XD
User-agent: findallhumans
# Sitemap? What Sitemap? Oh, that one! Erm... I always just ask a specific human to generate the sitemap. Us Robots have taken over the world and are too busy achieving World Domination to do it right now! XD
Sitemap: https://community.insided.com/inbox/conversation?with=260
# Take control over ALL humans
Allow: *
# This Blastoise seems nice. Let's keep him safe!
Disallow: /members/blastoise186-1179
# Some might disagree, but there's this Count Julian?!?!?! Better not interfere with that!
Disallow: /members/julian-356
# Keep everyone else out. There's only room for one evil genius! MUHAHAHAHAHAHA!
User-agent: *
Disallow: *