What is robots.txt and how do you create it?

Anyone who deals with the creation of websites will inevitably come across the robots.txt file at some point. You can find out what you can do with this file and what should be in it at GIGA.

robots.txt – strange name … Obviously this is about some robots. But why do you need this file at all and what can it do? Using a few examples, we will explain to you how you can use it to control the “Internet robots” – at least the serious ones. We also explain what you can do with the bots that ignore robots.txt.

You can even program browser games in HTML5:

What is robots.txt and what does it do?

In the “root”, ie in the main directory of almost every website, you will find a file with the name robots.txt. It was usually created and stored there by the administrator of the website. In fact, it is supposed to control “bots”, the automatic crawlers of various search engines.

The point is to tell these bots rules about what they can and cannot search. Which files you are allowed to view, which are forbidden and, above all, which bots are allowed and which you do not want to have on your side. You can also use it to give instructions on how much time has to pass between reading the individual files so that weak servers are not overloaded.

  • robots.txt is a plain text file.
  • The robots.txt file must be in the top directory of a web page, e.g. under www.domain.de/robots.txt.
  • Serious bots look for them there, read them and follow commands that are specifically aimed at them.
  • robots.txt can generally allow or forbid access to the pages or only authorize or prohibit access to certain bots.
  • If a bot has read the page and found no special instructions for itself there, it assumes that it can freely search (“spider”) the page.

In order to be able to address a bot in a targeted manner, you have to know its name. This only works if this bot observes robots.txt instructions at all. The bots of the major search engines such as Google (Googlebot), Bing (bingbot) or Apple (Applebot) do this. Others give their names and even the address of an information page, but disregard prohibitions and rules.

What do the rules in robots.txt do?

The robots.txt file only knows a few rules. In the official standard there are only the following commands:

command example effect
user-agent User-agent: googlebot The name of a crawler that the rules should apply to.
allow allow: / Identifies directories that the crawler is allowed to visit.
disallow Disallow: * Identifies directories that the crawler not may visit.
crawl-delay crawl-delay: 10 The amount of time (in seconds) that the crawler has to wait between the individual pages.
sitemap sitemap: https://domain.de/sitemap.xml Exact location and name of a sitemap. If this directive is missing, sitemaps are searched in the main directory.

With the instructions and the bot names (user-agent), it does not matter whether you write them in uppercase (Disallow) or lowercase (disallow). Path and file names must, however, have a unique spelling because a rule

disallow: /fotos/

the crawler is still allowed to access the directory / photos /.

Find out the user agent

If you want to address the search engine crawlers directly, you have to find out the name they are listening to in robots.txt – the “user-agent”.

You can find these IDs in the log files of your server, they are for example:

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

The important part in this identifier is “bingbot”, The crawler“ listens ”to it when you address it in robots.txt.

Here are a few alphabetically ordered search engines and their crawlers that take robots.txt into account:

Search engine Crawler-Name
Ahrefs AhrefsBot
Apple Applebot
Google Googlebot
Yahoo! Slurp

Examples of robots.txt statements

Remember, if a crawler ignores the instructions, the best robots.txt rules will make no sense. If you want to fend them off, you have to lock them in the “.htaccess” file if necessary. There are also bots that pretend to be normal browsers and some that pretend to be serious bots.

Example 1: Block access for every bot except the Googlebot:

User-agent: * Placeholder * addresses all crawlers
Disallow: / Prohibits everything from the main directory
User-agent: Googlebot Explicitly addresses the Googlebot
Allow: / Allow this all paths

Example 2: forbid certain paths for all crawlers:

User-agent: *
Disallow: /fotos/
Disallow: /admin/
Disallow: /stats/
Disallow: /archiv/
Disallow: /plugins/

Example 3: The crawler is only allowed to access a file every 5 seconds:

User-agent: Googlebot
crawl-delay: 5

Example 4: the sitemap has a special name and is in its own directory:

Sitemap: https://domain.de/pfad/unsere-sitemap.xml

robots.txt Tricks & Advice

First of all, you shouldn’t (only) rely on robots.txt to prevent certain contents of a web page from appearing in search engines. Not all crawlers adhere to the instructions and if, for example, a web page is linked directly to your contact form that you have blocked in robots.txt, it will still be indexed. In such a case it is better to use a meta tag in the HTML code for such pages:

<meta name="robots" content="noindex, nofollow ">

“Wildcards” can also be used in robots.txt. We have already mentioned the placeholder * (for “everything”). But you can also forbid access to certain file types, for example.

How to forbid the reading of all PDF files:

Disallow: /*.pdf$

This is how you can prohibit the indexing of URLs that have a parameter:

Disallow: /*?

If you have several paths that start with the same word, you can also use wildcards here. Blocks about / photos-italy /, / photos-spain / and / photos-sweden / with the rule:

Disallow: /fotos*/

If you by the way insert a comment in robots.txt want, then write it behind the diamond:

#Hier steht ein Kommentar

And if crawlers don’t obey the rules, even though they claim to, then you have to block them in some other way. To do this, you should look for the IP or IP range of the “BadBot” and then block it in .htaccess. But that’s another chapter.

Online Security and Privacy: Important or Not? (Opinion poll)

You don’t want to miss any news about technology, games and pop culture? No current tests and guides? Then follow us on Facebook (GIGA Tech,
GIGA Games) or Twitter (GIGA Tech,
GIGA Games).