본문 바로가기
Web/Etc

[Web] About Robots.txt

by llHoYall 2020. 10. 7.

robots.txt is an international recommendation that allows or restricts search robots to collect sites and web pages.

The robots.txt file must always be located in the root directory of the site and must be written as a plain text file that complies with the robot exclusion standard.

Search robots, etc. for illegal purposes may not comply with the robots.txt rule.

Therefore, the information that should be protected should use different methods of blocking.

If you make scrapers or crawlers, make sure you check these rules to make sure that nothing unsavory happens.

Locations of robots.txt

Let's check it out on Google, the most famous website.

https://www.google.com/robots.txt

As you can see, the robots.txt rule is located in the root directory of the site.

Rules of robots.txt

  • he file must be named robots.txt.
  • There must be only one robots.txt file on the site.
  • A robots.txt file can apply to a subdomain or on non-standard ports.
  • Comments are any content after a # symbol.
  • robots.txt file must be a UTF-8 encoded text file.
  • robots.txt file consists of one or more groups.
  • Each group consists of multiple rules or directives, one directive per line.
  • A group gives the following information:
    • Who the group applies to the user agent.
    • which directories or files that agents can access, and/or
    • which directories of files that agents cannot access.
  • Groups are processed from top to bottom, and a user agent can match only one rule set, which is the first, most-specific rule that matches a given user agent.
  • The default assumption is that a user agent can crawl any page or directory not blocked by a Disallow: rule.
  • Rules are case-sensitive.

Directives of robots.txt

User-agent:

It is required and one or more per group.

The name of a search engine robot (web crawler) that the rule applies to.

This is the first line for any rule.

Disallow:

There should be at least one or more Disallow: or Allow: entries per rule.

A directory or page, relative to the root domain, that should not be crawled by the user agent.

If a page, it should be the full page name as shown in the browser.

If a directory, it should end in a / symbol.

Supports the * (wildcard) for a path prefix, suffix, or entire string.

Allow:

There should be at least one or more Disallow: or Allow: entries per rule.

A directory or page, relative to the root domain, that should be crawled by the user agent.

This is used to override Disallow: to allow crawling of a subdirectory or page in a disallowed directory.

If a page, it should be the full page name as shown in the browser.

If a directory, it should end in a / symbol.

Supports the * (wildcard) for a path prefix, suffix, or entire string.

Sitemap:

It is an optional, zero, or more per file.

The location of a sitemap for this website.

Must be a fully-qualified URL.

Sitemaps are a good way to indicate which content it can or cannot crawl.

Host:

Some crawlers support a Host: directive, allowing websites with multiple mirrors to specify their preferred domain.

Crawl-delay:

The Crawl-delay: value is supported by some crawlers to throttle their visits to the host.

Since this value is not part of the standard, its interpretation is dependent on the crawler reading it.

For example, Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once.

Examples of robots.txt

Disallow crawling of the entire website.

User-agent: *
Disallow: /

Disallow crawling of a directory and its contents

User-agent: *
Disallow: /calendar/
Disallow: /junk/

Allow access to a single crawler

User-agent: Googlebot-news
Allow: /

User-agent: *
Allow: /

Allow access to all but a single crawler

User-agent: AdsBot-Google
Disallow: /

User-agent: *
Allow: /

Disallow crawling of a single webpage

User-agent: *
Disallow: /private_file.html

Block a specific image from Google Images

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg

Block all images on your site from Google Images

User-agent: Googlebot-Image
Disallow: /

Disallow crawling of files of a specific file type

User-agent: Googlebot
Disallow: /*.gif$

Disallow crawling of the entire site, but show AdSense ads on those pages

User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /

Match URLs that end with a specific string

User-agent: Googlebot
Disallow: /*.xls$

Apply the Sitemap

Sitemap: http://www.example.com/sitemap.xml

Apply the Host

Host: hosting.example.com

Apply the Crawl-delay

User-agent: bingbot
Allow: /
Crawl-delay: 10

댓글