Conversation with Gemini

You said

who is baiduspider, and why is it hammering my web server?

Gemini said

Baiduspider is the official web-crawling bot for Baidu, the dominant search engine in China. Just as Google uses Googlebot to index the web, Baidu uses Baiduspider to find and rank content for its users.

If it feels like it’s "hammering" your server, you aren't alone. Baiduspider is notorious among system administrators for its aggressive crawling patterns.

Why is it hitting your server so hard?

There are three common reasons why Baiduspider might be causing high load:

  1. High Crawl Frequency: By default, Baiduspider can be much more aggressive than Google or Bing. It may attempt to crawl thousands of pages in a very short window, which can spike CPU usage or exhaust database connections.

  2. Lack of "Crawl-Delay" Support: Historically, Baidu ignored the Crawl-delay directive in robots.txt (unlike Bing or Yandex). While some recent reports suggest they are starting to support it, it is often unreliable compared to other bots.

  3. Aggressive Mobile & Render Crawling: Baidu uses separate crawlers for desktop, mobile, and "render" (JavaScript execution). If your site is complex, these multiple bots hitting simultaneously can feel like a small DDoS attack.


How to Tame the Spider

If the traffic is affecting your site performance, you have a few options ranging from "polite request" to "hard block."

1. The robots.txt Method (The Polite Way)

You can try to slow it down or block it entirely from certain directories. To block it completely:

Plaintext
User-agent: Baiduspider
Disallow: /

Note: If you want Chinese traffic/SEO, do not use the block above.

2. Use Baidu Webmaster Tools (The Official Way)

Baidu provides a Crawl Frequency (抓取频次) tool within their Webmaster platform.

  • You can log in and use a slider to manually limit the maximum number of requests per second.

  • The Catch: The interface is entirely in Chinese, and you must verify ownership of your site first.

3. Server-Level Blocking (The "Enough is Enough" Way)

Since you're comfortable with Unix-like systems and tools like ipfw or htaccess, you can drop the hammer at the gate.

Via Apache (.htaccess):

Apache
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC]
RewriteRule .* - [F,L]

Via Nginx:

Nginx
if ($http_user_agent ~* "Baiduspider") {
    return 403;
}

4. Verify it's actually Baidu

Sometimes "bad bots" spoof their User-Agent to look like Baiduspider to bypass simple filters. You can verify a real Baidu bot by performing a reverse DNS lookup on the IP address. A legitimate Baidu bot will resolve to a domain ending in *.baidu.com or *.baidu.jp.


Would you like me to help you craft a specific ipfw rule or a rate-limiting configuration for your web server to throttle this traffic without blocking it entirely?

You said

Yes

Gemini said

Since you're managing this on a system where you already utilize ipfw, we can handle this directly at the firewall level. This is often more efficient than letting the traffic reach your web server (Nginx/Apache) only to be rejected there.

You have two main approaches depending on whether you want a hard block or a throttle.

Option 1: The Hard Block (Lowest Overhead)

If you don't care about SEO in the Chinese market, the most resource-efficient method is to drop the packets entirely. You can add a rule to your ipfw configuration to deny any traffic matching the Baiduspider User-Agent, though ipfw typically filters by IP rather than string.

However, since Baidu uses specific IP ranges, you can block their primary ASNs (Autonomous System Numbers) or known CIDR blocks.

Bash
# Example: Blocking a common Baidu netblock
ipfw add 1000 deny ip from 180.76.0.0/16 to any

Option 2: Rate Limiting with ipfw (The "Drip Feed")

If you want to allow the bot but prevent it from "hammering" your resources, you can use pipes to limit the bandwidth or states to limit the connection count.

To limit Baidu to a very thin pipe (e.g., 10Kbit/s):

  1. Create the pipe: ipfw pipe 1 config bw 10Kbit

  2. Direct the traffic (assuming you've identified their IP range): ipfw add pipe 1 ip from 180.76.0.0/16 to any dst-port 80,443


Option 3: Nginx Rate Limiting (The "Smart" Way)

If you prefer to keep the logic within your web server configuration, Nginx's limit_req module is excellent for this. You can define a specific zone for Baidu and give it a very low "burst" allowance.

Nginx
# Define the limit zone in the 'http' block
limit_req_zone $anti_spider zone=baidu_limit:10m rate=1r/m;

# Map the User-Agent to the limit variable
map $http_user_agent $anti_spider {
    default "";
    ~*Baiduspider "baidu";
}

# Apply to your 'server' or 'location' block
server {
    ...
    limit_req zone=baidu_limit burst=2 nodelay;
}

This configuration forces Baiduspider to a crawl rate of 1 request per minute.

A Note on Your Hardware

If your web server is running on the same machine as your other services (like your solar monitoring or media tools), aggressive crawling can actually cause "jitter" in local data collection if the CPU spikes. Implementing the block at the firewall (ipfw) level is the best way to ensure your system's overall responsiveness remains stable.

Would you like me to help you find a current list of Baidu's IP ranges to make the ipfw rules more accurate?

You said

Yes

Gemini said

To effectively rein in Baiduspider at the firewall level, you can target the primary address blocks owned by Baidu Netcom Science and Technology.

Based on current 2026 network data, the majority of their crawling activity originates from a few massive "superblocks." Blocking or throttling these will cover the vast majority of their bots.

Primary Baiduspider IP Ranges (CIDR)

These three blocks account for over 90% of their typical crawling infrastructure:

  • 180.76.0.0/16 (The most common range for Baiduspider)

  • 220.181.0.0/16 (Primary Beijing data center)

  • 116.179.32.0/20 (Mobile and render-specific bots)


Implementation for ipfw

Since you're comfortable with ipfw, here is how you can implement these. I recommend placing these high up in your ruleset (e.g., in the 1000–2000 range) to save CPU cycles.

To Block Completely:

Bash
# Block the major Baidu netblocks
ipfw add 1100 deny ip from 180.76.0.0/16 to any dst-port 80,443
ipfw add 1110 deny ip from 220.181.0.0/16 to any dst-port 80,443
ipfw add 1120 deny ip from 116.179.32.0/20 to any dst-port 80,443

To Throttle (The "Polite" Way):

If you still want your site indexed but don't want the performance hit, use a dummynet pipe. This forces the spider to wait, effectively slowing its crawl rate without a hard 403 error.

Bash
# 1. Create a 56k-style tiny pipe
ipfw pipe 10 config bw 50Kbit

# 2. Funnel Baidu into that pipe
ipfw add pipe 10 ip from 180.76.0.0/16 to any dst-port 80,443
ipfw add pipe 10 ip from 220.181.0.0/16 to any dst-port 80,443

Verification Tip

Before you apply a permanent block, you can quickly verify if a specific "hammering" IP is legitimate by running this in your terminal: host [THE_IP_ADDRESS]

A real Baidu bot will always resolve back to a hostname ending in https://www.google.com/search?q=.baidu.com or .baidu.jp. If it resolves to something generic like a VPS provider, it’s a fake bot and you should definitely block it.

Would you like me to show you how to automate this check in a shell script so your server can auto-ban "fake" spiders?

profile picture


Gemini is AI and can make mistakes.