- 1 Guide to Technical SEO
- 2 Understanding crawling
- 3 Understanding indexing
Guide to Technical SEO
What is Technical SEO?
Technical SEO is the process of optimizing your website to help search engines like Google find, crawl, understand, and index your pages. The goal is to be found and improve rankings.
How complicated is technical SEO?
It depends. The fundamentals aren’t really difficult to master, but technical SEO can be complex and hard to understand. I’ll keep things as simple as I can with this guide.
In this chapter we’ll cover how to make sure search engines can efficiently crawl your content.
How crawling works
Crawlers grab content from pages and use the links on those pages to find more pages. This let’s them find content on the web. There are a few systems in this process that we’ll talk about.
A crawler has to start somewhere. Generally they would create a list of all the URLs they find through links on pages. A secondary system to find more URLs are sitemaps that are created by users or various systems that have lists of pages.
All the URLs that need to be crawled or re-crawled are prioritized and added to the crawl queue. This is basically an ordered list of URLs Google wants to crawl.
The system that grabs the content of the pages.
These are various systems that handle canonicalization which we’ll talk about in a minute, send pages to the renderer which loads the page like a browser would, and processes the pages to get more URLs to crawl.
These are the stored pages that Google shows to users.
There are a few ways you can control what gets crawled on your website. Here are a few options.
A robots.txt file tells search engines where they can and can’t go on your site.
Just one quick note. Google may index pages that they can’t crawl if links are pointing to those pages. This can be confusing but if you want to keep pages from being indexed check out this guide and flowchart which can guide you through the process.
There’s a crawl-delay directive you can use in robots.txt that many crawlers support that lets you set how often they can crawl pages. Unfortunately, Google doesn’t respect this. For Google you’ll need to change the crawl rate in Google Search Console as described here.
If you want the page to be accessible to some users but not search engines, then what you probably want is one of these three options:
- Some kind of login system;
- HTTP Authentication (where a password is required for access);
- IP Whitelisting (which only allows specific IP addresses to access the pages)
This type of setup is best for things like internal networks, member only content, or for staging, test, or development sites. It allows for a group of users to access the page, but search engines will not be able to access them and will not index the pages.
How to see crawl activity
For Google specifically, the easiest way to see what they’re crawling is with the Google Search Console Crawl Stats report which gives you more information about how they’re crawling your website.
If you want to see all crawl activity on your website, then you will need to access your server logs and possibly use a tool to better analyze the data. This can get fairly advanced, but if your hosting has a control panel like cPanel, you should have access to raw logs and some aggregators like Awstats and Webalizer.
Each website is going to have a different crawl budget, which is a combination of how often Google wants to crawl a site and how much crawling your site allows. More popular pages and pages that change often will be crawled more often, and pages that don’t seem to be popular or well linked will be crawled less often.
If crawlers see signs of stress while crawling your website, they’ll typically slow down or even stop crawling until conditions improve.
After pages are crawled, they’re rendered and sent to the index. The index is the master list of pages that can be returned for search queries. Let’s talk about the index.
- How to Create an XML Sitemap
- Robots.txt and SEO: Everything You Need to Know
- How to Remove URLs From Google Search
In this chapter we’ll talk about how to make sure your pages are indexed and check how they’re indexed.
A robots meta tag is an HTML snippet that tells search engines how to crawl or index a certain page. It’s placed into the <head> section of a web page, and looks like this:
<meta name="robots" content="noindex" />
When there are multiple versions of the same page, Google will select one to store in their index. This process is called canonicalization and the URL selected as the canonical will be the one Google shows in search results. There are many different signals they use to select the canonical URL including:
- Canonical tags
- Duplicate pages
- Internal links
- Sitemap URLs
The easiest way to see how Google has indexed a page is to use the URL Inspection Tool in Google Search Console. It will show you the Google-selected canonical URL.