How crawling works
Crawling URLs is a task carried out by a computer program known as a crawler or a spider. The job of the crawler is to visit web pages and extract the HTML content it finds. One of the primary things a crawler looks for is links.
Every web page has a single unique identifier, its URL. Enter the URL into your browser address bar, and you’ll go to the web page. Web pages themselves consist of content that’s marked up in HTML.
HTML is a machine-readable language, so an external program like a crawler can visit a URL, extract the HTML, and access the content in a structured manner. Importantly, it can differentiate between text and hyperlinks.
When crawlers examine the HTML code for a page like this, which contains the article you’re reading, it will find each paragraph is offset by a piece of code called the paragraph element or p-tag at the beginning and at the end. This identifies a block of paragraph text—the p-tag at the start opens the paragraph element, and the p-tag at the end closes it. Although you don’t see this code unless you inspect the page, the crawler sees it and understands that this page contains text content that’s designed for visitors to read.
Links are also visible and interpreted by crawlers because of their HTML code. Programmers code links with an anchor element at the beginning and at the end. Links also include an “attribute” that provides the destination of the hyperlink, and “anchor text.” Anchor text is the linked text seen by readers, often displayed in browsers in blue with an underline.
It’s a straightforward task for a crawler to process this block of HTML and separate out the text from the link. However, on a single web page, there’s a lot more than a paragraph and a link. To see this sort of data yourself, visit any web page in your browser, right-click anywhere on the screen, then click “View Source” or “View Page Source.” On most pages, you’ll find hundreds of lines of code.
For every web page that a crawler encounters, it will parse the HTML, which means it breaks the HTML up into its component parts to process further. The crawler extracts all the links it finds on a given page, then schedules them for crawling. In effect, it builds itself a little feedback loop:
Crawl URL → Find links to URLs → Schedule URLs for crawling → Crawl URL
So you can give a crawler a single URL as a source to start crawling from, and it will keep going until it stops finding new URLs to crawl—this could be thousands or even millions of URLs later.
In short, crawling is a method of discovery. Search engines determine what's out there by sending out web crawlers to find web pages using links as signposts for the next place to look.
This is why internal links on your website are important, as they allow search engine crawlers to discover all the pages on your site. Through external links, they'll discover other websites as they explore the network of interconnected pages that make up the internet.