Notes on Web Crawlers for 2020

date: April 9th, 2020

There are numerous robots continuously crawling web sites. This will focus on robots that are of more relevance and interest to web developers. There is some slant towards (or emphasis on) the English language and the US. For the most part, frequently observed spam bots, malware, and the likes will be omitted.

Googlebot: Google's primary crawler.
Googleweblight: A Google service that transcodes web pages mobile devices. Pages are optimized for load time and data usage.
Bingbot: This is Bing's crawler. In the U.S. hold, roughly, 10%-30% market share, at least, when including Yahoo.
Msnbot: This is run by Microsoft, and appears sometimes in log files.
YandexBot: A leading search engine in Russia.
Seznam: Czech search engine which now rivals Google in the Czech republic.
DuckDuckBot: The main crawler for Duck Duck Go(oose). Duck Duck Go pitches itself as the privacy search engine. In other words, they don't track you and maintain data about you as other general purpose search engines do.
DuckDuckBot-Favicons-Bot: Another bot from Duck Duck Go that appears to look just for favicons.
Baidu: The biggest search engine in China and serves what is easily argued to be the largest sector of internet users. I see Baidu a lot in log files, and the spiders originate in China. I can't find an IP range for them, and forward DNS lookups always fail. In other words, I can't verify that these spiders really belong to Baidu.
Sogou: Another major Chinese search engine. They are verifiable. However, they can have very busy crawlers. They don't use 8-bit IP blocks, and they crawl from a large number of addresses.
facebookexternalhit: "The Facebook Crawler scrapes the HTML of a website that was shared on Facebook ... ...IP addresses change often." see https://developers.facebook.com/docs/sharing/webmasters/crawler
Pinterestbot: This is Pinterest's bot. It is verfiable with r-DNS/f-DNS lookups and uses 8-bit IP blocks.
Yahoo: I not seen Yahoo or Slurp in a long time. These days, Yahoo search is being powered by Bing.
Naver: The most widely used search engine in South Korea.
Daum: A major Korean search engine.
Qwant: American search engine.
Exabot: French search engine.
Qihoo: Chinese search engine.
Soso: Chinese search engine.
ia_archiver: Amazon claims that "Alexa crawls the web in order to identify and classify web content and to discover backlinks". They make no mention of their AI product "Alexa".
AppleBot: It appears that AppleBot is being used for searches from Siri and Spotlight. Details are scant though.
BingPreview: "...used to generate page snapshots".
AdIdxBot: Used by Bing ads.