The following article is our English interpretation of an post originally published on Baidu Webmaster Help pages in Chinese language. If you'd like to check out the original article, you can find it here: https://ziyuan.baidu.com/college/courseinfo?id=267&page=9
The Crawl Diagnostics Tool allows webmasters to view how Baiduspider (Baidu’s search bot) sees their website, helping diagnose whether the content Baiduspider retrieves matches what you expect.
- Each website is allowed up to 70 diagnostic requests per week.
- The tool displays up to the first 200KB of content visible to Baiduspider.
What Can the Crawl Diagnostics Tool Do?
Currently, this tool offers several key functions:
1. Verify if Crawled Content Matches Expectations
Some content—like pricing details on product pages—is often loaded using JavaScript, which Baiduspider may not interpret. As a result, key information may be missing in search results. You can use this tool to re-test pages after making adjustments.
2. Check for Hidden Spam Links or Malicious Content
If your site has been hacked, it might contain hidden links (often visible only to crawlers like Baiduspider). This tool can detect such unauthorized content.
3. Test Connectivity Between Your Site and Baidu
If Baiduspider is accessing your site using outdated or incorrect IP information, the tool can alert Baidu to update it.
Common Crawl Errors Explained
1. URL Format Issues
- Baidu supports URLs up to 1024 characters in length.
- Overly long URLs may not be indexed. Simplify URLs where possible without breaking functionality.
2. Redirect Errors
- If a URL redirects too many times (more than 5), or the final destination URL is too long, crawling fails.
3. Server Connection Errors
These errors happen when:
- The server responds too slowly
- Baiduspider is unintentionally blocked
Common errors include:
- Timeout
- Connection failed
- Connection refused
- No response
- Connection reset
- Truncated header or response
How to Fix:
- Reduce redundant dynamic URLs. Identical content across different URLs (e.g.,
?size=7&color=red
vs.?color=red&size=7
) can slow response time. - Ensure your server is stable and well-configured. Contact your hosting provider if problems persist.
- Check if Baiduspider IPs are accidentally blocked by firewalls, DoS protection, CMS misconfigurations, or DNS issues. Coordinate with your hosting provider if needed.
4. Robots.txt Blocking
If the diagnostic tool shows a crawl failure due to robots.txt restrictions:
- Review your robots.txt file to ensure you’re not unintentionally blocking important pages.
- If it’s an error, correct the file immediately to prevent drops in indexing and traffic.
5. DNS Issues
DNS errors occur when:
- Your DNS server is down
- There’s a routing issue between Baiduspider and your domain
Solutions:
- Use the diagnostic tool on key pages (e.g., homepage) to confirm accessibility.
- For persistent issues, contact your DNS provider (often your hosting company).
- Make sure your server correctly returns
404
or500
for invalid hostnames.
6. 404 Not Found
If Baiduspider encounters a deleted or renamed page without a redirect, it returns a 404. Common causes:
- Broken or outdated links
- Page name changes without 301 redirects
7. Access Denied
This typically happens when:
- Your site requires login for content access
- Proxy authentication is required
- Your host blocks Baiduspider
Make sure public-facing content is accessible without login and Baiduspider is not being blocked at the server level.
8. Parameter Errors
Malformed requests or requests that violate server-side rules will cause the server to reject them. Ensure URL parameters are structured properly.
9. Socket Read/Write Errors
These occur when there’s a failure in TCP communication between Baiduspider and your server. Check server logs and firewall settings.
10. Failed HTTP Header or Content Retrieval
Sometimes your server receives Baiduspider’s request but only returns partial data, causing the crawl to fail. This could be a truncated HTTP header or incomplete response body.