The following article is our English interpretation of an post originally published on Baidu Webmaster Help pages in Chinese language. If you'd like to check out the original article, you can find it here: https://ziyuan.baidu.com/college/courseinfo?id=267&page=10
The robots.txt file is a vital communication tool between your website and search engine crawlers (also known as spiders or bots). It allows you to control which parts of your site search engines can or cannot access. Through this file, you can specify:
- Which pages or directories should not be crawled
- Which areas are open to specific crawlers only
When a search engine like Baidu sends its spider (Baiduspider) to your site, it first looks for a robots.txt file in the root directory. This file tells the crawler what it’s allowed to access and what it should ignore.
⚠️ Important: You only need a robots.txt file if you want to restrict access to parts of your website.
If you want search engines to crawl your entire site freely, you don’t need this file at all.
Where Should You Place the robots.txt File?
The file must be located in the root directory of your site.
Example:
If your website is www.example.com
, Baiduspider will try to access www.example.com/robots.txt
. If the file exists, it will read the rules inside to determine its crawling behavior.
You can use Baidu’s robots management tool to:
- Create, validate, or update your robots.txt file
- Check whether the file is currently active and effective on Baidu
Baidu’s tool supports file validation for robots.txt files up to 48KB and URL paths up to 250 characters.
You can also test robots settings for unverified sites through the tool.
robots.txt File Syntax & Structure
A typical robots.txt file is made up of one or more records, each separated by a blank line (using CR, CR/NL, or NL line endings).
Each line follows the format:
php-templateCopyEdit<directive>: <value>
You can also include comments using #
, just like in UNIX.
Each record generally includes:
- One or more
User-agent
lines (specifying which crawler the rules apply to) - Followed by one or more
Disallow
orAllow
directives
Directives Explained
- User-agent:
Specifies which crawler (bot) the rule applies to.User-agent: *
means the rule applies to all bots.- You can also target a specific bot, such as
User-agent: Baiduspider
.
- Disallow:
Tells the crawler not to access certain URLs.
You can list full or partial paths.- Example:
Disallow: /help
blocks access to/help.html
,/help/index.html
, etc.Disallow:
(empty) allows access to everything (default behavior if no disallow rules exist). - A robots.txt file must include at least one Disallow directive to be valid.
- Example:
- Allow:
Specifies which URLs can be crawled, even if they fall under a Disallow path.- Example:
Allow: /public
allows access to/public/info.html
,/public/index.html
, etc. - Use
Allow
in combination withDisallow
to fine-tune access.
- Example:
Wildcard Support (*
and $
)
Baiduspider supports wildcards in URL matching:
*
matches any number of characters (including none)$
matches the end of a URL
Example:
Disallow: /*.pdf$
blocks all URLs ending in
Case Sensitivity
Baiduspider strictly follows the robots.txt standard and performs case-sensitive matching.
Make sure your rules accurately match the capitalization of the URLs or directories you want to control.
If not properly configured, your robots.txt file may not be effective.
robots.txt Use Case Examples
Would you like me to continue with specific usage examples or best practices for common scenarios (like blocking login pages, duplicate content, or private folders)? Let me know and I can format that next!