Crawl Diagnostics

The following article is our English interpretation of an post originally published on Baidu Webmaster Help pages in Chinese language. If you'd like to check out the original article, you can find it here: https://ziyuan.baidu.com/college/courseinfo?id=267&page=9

The Crawl Diagnostics Tool allows webmasters to view how Baiduspider (Baidu’s search bot) sees their website, helping diagnose whether the content Baiduspider retrieves matches what you expect.

  • Each website is allowed up to 70 diagnostic requests per week.
  • The tool displays up to the first 200KB of content visible to Baiduspider.

What Can the Crawl Diagnostics Tool Do?

Currently, this tool offers several key functions:

1. Verify if Crawled Content Matches Expectations

Some content—like pricing details on product pages—is often loaded using JavaScript, which Baiduspider may not interpret. As a result, key information may be missing in search results. You can use this tool to re-test pages after making adjustments.

2. Check for Hidden Spam Links or Malicious Content

If your site has been hacked, it might contain hidden links (often visible only to crawlers like Baiduspider). This tool can detect such unauthorized content.

3. Test Connectivity Between Your Site and Baidu

If Baiduspider is accessing your site using outdated or incorrect IP information, the tool can alert Baidu to update it.


Common Crawl Errors Explained

1. URL Format Issues

  • Baidu supports URLs up to 1024 characters in length.
  • Overly long URLs may not be indexed. Simplify URLs where possible without breaking functionality.

2. Redirect Errors

  • If a URL redirects too many times (more than 5), or the final destination URL is too long, crawling fails.

3. Server Connection Errors

These errors happen when:

  • The server responds too slowly
  • Baiduspider is unintentionally blocked

Common errors include:

  • Timeout
  • Connection failed
  • Connection refused
  • No response
  • Connection reset
  • Truncated header or response

How to Fix:

  • Reduce redundant dynamic URLs. Identical content across different URLs (e.g., ?size=7&color=red vs. ?color=red&size=7) can slow response time.
  • Ensure your server is stable and well-configured. Contact your hosting provider if problems persist.
  • Check if Baiduspider IPs are accidentally blocked by firewalls, DoS protection, CMS misconfigurations, or DNS issues. Coordinate with your hosting provider if needed.

4. Robots.txt Blocking

If the diagnostic tool shows a crawl failure due to robots.txt restrictions:

  • Review your robots.txt file to ensure you’re not unintentionally blocking important pages.
  • If it’s an error, correct the file immediately to prevent drops in indexing and traffic.

5. DNS Issues

DNS errors occur when:

  • Your DNS server is down
  • There’s a routing issue between Baiduspider and your domain

Solutions:

  • Use the diagnostic tool on key pages (e.g., homepage) to confirm accessibility.
  • For persistent issues, contact your DNS provider (often your hosting company).
  • Make sure your server correctly returns 404 or 500 for invalid hostnames.

6. 404 Not Found

If Baiduspider encounters a deleted or renamed page without a redirect, it returns a 404. Common causes:

  • Broken or outdated links
  • Page name changes without 301 redirects

7. Access Denied

This typically happens when:

  • Your site requires login for content access
  • Proxy authentication is required
  • Your host blocks Baiduspider

Make sure public-facing content is accessible without login and Baiduspider is not being blocked at the server level.

8. Parameter Errors

Malformed requests or requests that violate server-side rules will cause the server to reject them. Ensure URL parameters are structured properly.

9. Socket Read/Write Errors

These occur when there’s a failure in TCP communication between Baiduspider and your server. Check server logs and firewall settings.

10. Failed HTTP Header or Content Retrieval

Sometimes your server receives Baiduspider’s request but only returns partial data, causing the crawl to fail. This could be a truncated HTTP header or incomplete response body.