How To Get Google To Crawl and Index Your Website Fully

by Apr 28, 2020

Businesses depend a great deal on how well their website ranks in Google search. To help business owners improve the ranking of their website in search engine results pages (SERPs), here is a brief guide. In this guide, the following topics will be covered:

Essential Terms to Know

Crawling – At present, a central place, where data on websites can be found, is absent. Hence, Google created a crawler (a program which is also known as a web spider or Googlebot), which continually searches for new websites. The Googlebot searches new sites through different sources such as hyperlink, sitemap, and more.

Googlebot – Google’s main crawler is known as Googlebot, which works in two ways. First, the crawler discovers a website and then scans it. It is of two different types – desktop (majority crawler) and mobile (minority crawler). The crawling process starts with the previously generated list of URLs. As the crawler scans links present on webpages, it records new websites, modified websites, and broken links.

Indexing – After the web spider has crawled the web page, the Google search algorithm analyzes it. The Google search algorithm goes through the HTML code of the webpage and analyzes different elements such as written content, images, videos, meta information, and more. This process is known as indexing.

Serving or Ranking – The Google search algorithm factors in different criteria to serve the best possible results for the user. Different criteria include location, device type, regional language, webpage loading speed, mobile-friendliness, the usefulness of the content, freshness of content, and more.

Important Google Crawling Factors

Sitemap – Ideally an txt file, the sitemap assists the web spider to crawl the website more efficiently. A sitemap contains a list of all the webpages on the website listed according to the time of publishing. A sitemap contains additional data such as last modified, alternate version, and more.

Crawl Requests – For new websites, it may take a week or two for Google to identify the site and crawl. To speed up the process, the web admin can request indexing in the Search Console. However, only one indexing request can be made as multiple requests don’t push the web spider to crawl quickly.

URL Structure – Google prefers different elements of the web that are presented in a human-friendly way. For instance, the URL structure needs to be presented in such a way that it is easier for a human to remember – for example, https://www.google.com/search/howsearchworks/.

robots.txt – There are web pages that are not relevant for a human being and are generally created dynamically by the content management system. Most of the time, such web pages are not needed to fulfill a business goal. This is where the file robots.txt comes in. Using this file, a web admin can use the noindex directive to block those webpages that the web spider shouldn’t crawl.

hreflang – For relevancy, the search algorithm considers the location before showing results. To build a website that follows the necessary principles of localization, the hreflang tag is used. Using this tag, the web spider can learn about alternate web pages that present information in different languages.

Canonical – It is mostly unavoidable for a website with many pages to publish similar information. Few genuine cases include – having the same information on two different pages for separate devices – web and mobile. Unless mentioned explicitly, the search algorithm may consider any page to be more relevant.

Sitemap – Ideally an txt file, the sitemap assists the web spider to crawl the website more efficiently. A sitemap contains a list of all the webpages on the website listed according to the time of publishing. A sitemap contains additional data such as last modified, alternate version, and more.

Crawl Requests – For new websites, it may take a week or two for Google to identify the site and crawl. To speed up the process, the web admin can request indexing in the Search Console. However, only one indexing request can be made as multiple requests don’t push the web spider to crawl quickly.

URL Structure – Google prefers different elements of the web that are presented in a human-friendly way. For instance, the URL structure needs to be presented in such a way that it is easier for a human to remember – for example, https://www.google.com/search/howsearchworks/.

robots.txt – There are web pages that are not relevant for a human being and are generally created dynamically by the content management system. Most of the time, such web pages are not needed to fulfill a business goal. This is where the file robots.txt comes in. Using this file, a web admin can use the noindex directive to block those webpages that the web spider shouldn’t crawl.

SEO Glossary

Wanna know the meaning of the technical SEO related terms? Check out our SEO Glossary and become SEO Pro.
Read More

Important Google Indexing and Ranking Factors

Structured Data – The search engine results page offers greater visibility to those who leverage the capabilities of structured data. To understand which vocabulary is required, one may follow schema.org markup and follow the structured data guidelines by Google. For better understanding, consider the following examples – cast information in a movie, ratings given to a book or movie, product prices, and more.

Content Tags – Google maintains an extensive database of information through its crawler. Among the vast array of information, the crawler goes through different elements of the HTML code. Few such tags and attributes include the title tag, head tag, alt attribute, meta content, and more. Depending on the tags and attributes it finds, the crawler adjusts the importance and relevance of a web page accordingly.

Validation – Google tries its best to educate SEO professionals and small business owners to improve the ranking of a website. Important areas to consider are crawling, robots.txt, HTML, page speed, and mobile-friendliness. Here is a list of tools through which a person can validate:

Stay Away from Penalties – There are numerous ways through which one may spam the search results. One may do this unknowingly, but it is essential to provide a better user experience to the website visitor. There are two things to consider:

  • Unwanted Results – The crawler may index unwanted pages such as results page, tags, calendar appointments, and more. By using noindex in the robots.txt file, such results can be avoided.
  • Spamming – A web admin is responsible for taking the necessary steps to safeguard the website from malicious attempts to hack the site. If the website gets hacked, then the original content should be immediately restored. Furthermore, if Google may take action against the website, if it is found to be involved in activities, such as content scraping, hidden content, link building schemes, and more.

Step by Step Guide to Crawl and Index a Website

1 – Adding the Domain Name in Search Console

Open the Google Search Console and click Start now. This will open the Select property type popup window. There are two properties – Domain and URL prefix. The two property types are further elaborated below:

  • Domain – Selecting this property type will give domain-level access to the Search Console.
  • URL prefix – Selecting this property will provide a specific protocol or subdomain-level access to Search Console. For example, http://example.com and https://example.com will be considered as two different domain names.

2 – Verification of the Domain Name

Depending on the type of property selected, the verification methods will vary. For instance, if the Domain property type is selected, then the primary verification method is through the DNS record. On the other hand, if the URL prefix property is selected, then verification methods include HTML tag, Google Analytics, Google Tag Manager, DNS, and more. All the methods are briefly explained below:

  • DNS record – To verify using this method, the web admin has to create a new TXT record in the DNS configuration and add the generated code. The code will be in the format: google-site-verification=.
  • HTML file – The search console will generate an HTML file which the web admin has to upload it into the server to verify.
  • HTML tag – Search Console will generate a meta tag which the web admin has to paste in the tag. The meta tag will be in the format: <meta name="google-site-verification" content="" />.
  • Google Analytics – Assuming that the web admin has added the Google Analytics tracking ID, select this verification method, and follow the on-screen instructions. Alternatively, place the Google Analytics code in the <head> section and follow the on-screen instructions.
  • Google Tag Manager – After placing the code within <noscript> generated by Google Tag Manager at the beginning of the <body> tag, follow the on-screen instructions.

3 – Creating a Sitemap for the Website

 
A sitemap can be a regular file that can be created manually or generated using a third-party tool. The selection depends entirely on the content management system being used to manage the website.

For example, if the website is using a WordPress content management system, then many third-party plugins can be installed and used. The most common plugins are Jetpack, Yoast, and more.

Alternatively, if the web admin doesn’t wish to dwell on third-party plugins, then there is an option to create the sitemap manually. One has to follow specific guidelines while creating the sitemap. Here are few such guidelines:

  • The maximum sitemap filesize can be 50 MB
  • A sitemap file can hold a maximum of 50,000 URLs
  • The name of the file can be anything but the extension should be .txt
  • Separate sitemap files for blog posts, pages, images, news, and videos can be created

4 – Adding a Sitemap in the Search Console

 
In the vertical menu on the left of the Search Console, select Sitemaps under Index. The first box Add a new sitemap will shortly open. Enter the sitemap URL and click the Submit button on the right. The web admin has to provide the sitemap file URL. For example – the window will show the domain name, such as https://example.com/Enter sitemap URL.

5 – Get Google to Crawl and Index the Website

Once the sitemap file is mentioned in the Google Search Console, there are two different methods to get Google to crawl and index the website. Both ways are explained briefly:

  • For less number of URL – The URL inspection tool can be used to request indexing.
  • For more URLs – One of the two ways to request indexing is by submitting the sitemap. Given enough time, the web spider will visit the website. Alternatively, the web admin can use the ping method. To use the ping method, the web admin needs to provide the sitemap parameter on http://www.google.com/ping. For example – http://www.google.com/ping?sitemap=<location of sitemap>. Once done successfully, a Sitemap Notification Received message will be displayed.

6 – Check the Reports and Identify Errors

 
The Overview page on Google Search Console provides easy access to the overall health of the entire website. The Overview has three different reports sections – Performance, Coverage, and Enhancements.

  • Performance Report – It will show primarily four key performance indicators (KPIs) such as Total Clicks, Total Impressions, Average CTR, and Average Position. Additional filtering of the data is provided in the Search Console with fields such as Queries, Pages, Countries, Devices, Search Appearance, and Dates.
  • Coverage Report – This report provides insights into four different areas, such as Errors, Valid with warnings, Valid, and Excluded. This report enables the web admin to know pages that are – discovered but currently not indexed, canonical tag, Soft 404, Not Found (404), crawl anomaly, and more.
  • Enhancement Report – The enhancement report is further classified into four additional reports, such as Mobile Usability, AMP, Sitelinks search box, and Speed. The Mobile Usability report shows different errors like text too small to read, clickable elements too close together, content wider than the screen, and more.

Conclusion

 
The search engine giant, Google, continually makes changes to push the online content creators to produce meaningful information for humans. In this guide, we went through three primary areas such as crawling, indexing, and step by step guide to crawl and index the website.