robots.txt

_{The following article is our English interpretation of an post originally published on Baidu Webmaster Help pages in Chinese language. If you'd like to check out the original article, you can find it here:}^{https://ziyuan.baidu.com/college/courseinfo?id=267&page=10}

The robots.txt file is a vital communication tool between your website and search engine crawlers (also known as spiders or bots). It allows you to control which parts of your site search engines can or cannot access. Through this file, you can specify:

Which pages or directories should not be crawled
Which areas are open to specific crawlers only

When a search engine like Baidu sends its spider (Baiduspider) to your site, it first looks for a robots.txt file in the root directory. This file tells the crawler what it’s allowed to access and what it should ignore.

⚠️ Important: You only need a robots.txt file if you want to restrict access to parts of your website.
If you want search engines to crawl your entire site freely, you don’t need this file at all.

Where Should You Place the robots.txt File?

The file must be located in the root directory of your site.

Example:
If your website is www.example.com, Baiduspider will try to access www.example.com/robots.txt. If the file exists, it will read the rules inside to determine its crawling behavior.

You can use Baidu’s robots management tool to:

Create, validate, or update your robots.txt file
Check whether the file is currently active and effective on Baidu

Baidu’s tool supports file validation for robots.txt files up to 48KB and URL paths up to 250 characters.
You can also test robots settings for unverified sites through the tool.

robots.txt File Syntax & Structure

A typical robots.txt file is made up of one or more records, each separated by a blank line (using CR, CR/NL, or NL line endings).
Each line follows the format:

php-templateCopyEdit<directive>: <value>

You can also include comments using #, just like in UNIX.

Each record generally includes:

One or more User-agent lines (specifying which crawler the rules apply to)
Followed by one or more Disallow or Allow directives

Directives Explained

User-agent:
Specifies which crawler (bot) the rule applies to.
- User-agent: * means the rule applies to all bots.
- You can also target a specific bot, such as User-agent: Baiduspider.
Disallow:
Tells the crawler not to access certain URLs.
You can list full or partial paths.
- Example:
  Disallow: /help blocks access to /help.html, /help/index.html, etc.
  Disallow: (empty) allows access to everything (default behavior if no disallow rules exist).
- A robots.txt file must include at least one Disallow directive to be valid.
Allow:
Specifies which URLs can be crawled, even if they fall under a Disallow path.
- Example:
  Allow: /public allows access to /public/info.html, /public/index.html, etc.
- Use Allow in combination with Disallow to fine-tune access.

**Wildcard Support (`*` and `$`)**

Baiduspider supports wildcards in URL matching:

* matches any number of characters (including none)
$ matches the end of a URL

Example:
Disallow: /*.pdf$ blocks all URLs ending in .pdf

Case Sensitivity

Baiduspider strictly follows the robots.txt standard and performs case-sensitive matching.
Make sure your rules accurately match the capitalization of the URLs or directories you want to control.

If not properly configured, your robots.txt file may not be effective.

robots.txt Use Case Examples

Would you like me to continue with specific usage examples or best practices for common scenarios (like blocking login pages, duplicate content, or private folders)? Let me know and I can format that next!

Looking for help? Schedule Your Free Consultation Call!