A spider is a computer program used by a search engine to systematically browse and index websites on the World Wide Web.
Spiders go by many names such as web crawler, bot, web spider, ant, and automatic indexer. Regardless of the name used, spiders have one goal: to browse websites and create a thorough index of the content and links present on the site.
The way a webcrawler goes about the business of browsing websites depends on a combination of four policies:
- Selection: Search engines cannot keep up with all of the new content that appears on the web, or with the sheer massive size of the web. As a result, search engines, and the spiders they send out, must prioritize which sites to crawl and in what order.
- Revisit: Some websites are updated regularly, while others are only updated every few years if at all. Search engines have to determine the time interval at which they should revisit a webpage and update the indexed record of the site.
- Politeness: Because spiders don't require the same sort of time a human would require to process a webpage, they can request pages much more quickly than a normal website visitor. As a result, if a spider doesn't choose to behave politely, it can send multiple requests every second, and cripple a web server in the process. Search engines must program their spiders to be appropriately polite and to space their requests out in time intervals that a web server can handle.
- Parallelization: All search engines have multiple web crawling processes in place at the same time, so they must have policies in place to maximize the effectiveness of their spiders and to avoid having different processes crawling the same sites.
Every search engine prioritizes and crafts these strategies a little differently. The result is that the search results generated by one search engine will always look a little different than those generated by a different search engine.
Frequently Asked Questions
Why are spiders used to build search engine databases as opposed to manually built databases?
Spiders can make requests from a server incredibly quickly – sending multiple requests per second if not programmed to be polite as was already mentioned. Humans, on the other hand, need several seconds or several minutes to scan a webpage and create a directory record based on what they see. In addition, spiders are better at providing detailed and extensive records, and are much cheaper than a human database-building workforce.
At one time, websites were found by looking in web directories, which were manually compiled databases in which websites were organized by subject. However, once spider-indexed search engines came along, directories simply couldn't keep up with the massive size of spider-indexed databases, or the detailed records that were searchable by spider-indexed search engines, and directories quickly fell by the wayside.
Are there any dangers associated with spiders automatically crawling a site?
Spiders are very thorough, to the point that they can unintentionally expose security vulnerabilities. A practice known as Google Hacking involves making very specific advanced queries in a search engine to identify websites with certain security vulnerabilities. The identified websites are then targeted with hacking techniques based on the vulnerabilities exposed.