Rate Limits

Rate limits help ensure fair usage and maintain service quality for all users. This guide explains HelpingAI's rate limiting system and how to work within these limits effectively.

Rate Limit Tiers

Free Tier

  • Requests per minute: 60
  • Tokens per minute: 10,000
  • Concurrent requests: 5

Pro Tier

  • Requests per minute: 3,000
  • Tokens per minute: 500,000
  • Concurrent requests: 50

Enterprise Tier

  • Requests per minute: Custom (typically 10,000+)
  • Tokens per minute: Custom (typically 2,000,000+)
  • Concurrent requests: Custom (typically 200+)

Understanding Rate Limits

Request-Based Limits

Limits the number of API calls you can make per minute, regardless of size.

python
# Each of these counts as 1 request {#each-of-these-counts-as-1-request}
client.chat.completions.create(
    model="Dhanishtha-2.0-preview",
    messages=[{"role": "user", "content": "Hi"}]
)

client.chat.completions.create(
    model="Dhanishtha-2.0-preview",
    messages=[{"role": "user", "content": "This is a much longer message..."}]
)

Token-Based Limits

Limits the total number of tokens (input + output) processed per minute.

python
# This uses ~20 tokens (input + output) {#this-uses-20-tokens-input-output}
response = client.chat.completions.create(
    model="Dhanishtha-2.0-preview",
    messages=[{"role": "user", "content": "Hello!"}]  # ~2 tokens
)
# Response: "Hello! How can I help you today?" # ~8 tokens {#response-hello-how-can-i-help-you-today-8-tokens}
# Total: ~10 tokens {#total-10-tokens}

Concurrent Request Limits

Limits how many requests can be processed simultaneously.

Rate Limit Headers

Every API response includes rate limit information in the headers:

http
HTTP/1.1 200 OK
X-RateLimit-Limit-Requests: 60
X-RateLimit-Remaining-Requests: 59
X-RateLimit-Reset-Requests: 1640995200
X-RateLimit-Limit-Tokens: 10000
X-RateLimit-Remaining-Tokens: 9950
X-RateLimit-Reset-Tokens: 1640995200

Header Meanings

HeaderDescription
X-RateLimit-Limit-RequestsMaximum requests per minute
X-RateLimit-Remaining-RequestsRemaining requests in current window
X-RateLimit-Reset-RequestsUnix timestamp when request limit resets
X-RateLimit-Limit-TokensMaximum tokens per minute
X-RateLimit-Remaining-TokensRemaining tokens in current window
X-RateLimit-Reset-TokensUnix timestamp when token limit resets

Handling Rate Limits

429 Error Response

When you exceed rate limits, you'll receive a 429 status code:

json
{
  "error": {
    "message": "Rate limit exceeded. Please try again in 30 seconds.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Python Example with Retry Logic

python
import time
import requests
from helpingai import HelpingAI
from helpingai.exceptions import RateLimitError

client = HelpingAI(api_key="your-api-key")

def make_request_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="Dhanishtha-2.0-preview",
                messages=messages
            )
            return response
            
        except RateLimitError as e:
            if attempt < max_retries - 1:
                # Extract wait time from error message or use exponential backoff
                wait_time = 2 ** attempt
                print(f"Rate limited. Waiting {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                raise e
    
    return None

# Usage {#usage}
response = make_request_with_retry([
    {"role": "user", "content": "Hello!"}
])

JavaScript Example with Retry Logic

javascript
import { HelpingAI } from 'helpingai';

const client = new HelpingAI({
  apiKey: 'your-api-key'
});

async function makeRequestWithRetry(messages, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await client.chat.completions.create({
        model: 'Dhanishtha-2.0-preview',
        messages: messages
      });
      return response;
      
    } catch (error) {
      if (error.status === 429 && attempt < maxRetries - 1) {
        const waitTime = Math.pow(2, attempt) * 1000; // Exponential backoff
        console.log(`Rate limited. Waiting ${waitTime}ms...`);
        await new Promise(resolve => setTimeout(resolve, waitTime));
      } else {
        throw error;
      }
    }
  }
}

// Usage
const response = await makeRequestWithRetry([
  {role: 'user', content: 'Hello!'}
]);

Rate Limiting Strategies

1. Request Queuing

Implement a queue to manage requests within rate limits:

python
import asyncio
from collections import deque
import time

class RateLimitedClient:
    def __init__(self, client, requests_per_minute=60):
        self.client = client
        self.requests_per_minute = requests_per_minute
        self.request_times = deque()
        
    async def make_request(self, **kwargs):
        await self._wait_if_needed()
        
        response = await self.client.chat.completions.create(**kwargs)
        self.request_times.append(time.time())
        
        return response
    
    async def _wait_if_needed(self):
        now = time.time()
        
        # Remove requests older than 1 minute
        while self.request_times and now - self.request_times[0] > 60:
            self.request_times.popleft()
        
        # If we're at the limit, wait
        if len(self.request_times) >= self.requests_per_minute:
            wait_time = 60 - (now - self.request_times[0])
            if wait_time > 0:
                await asyncio.sleep(wait_time)

# Usage {#usage}
rate_limited_client = RateLimitedClient(client, requests_per_minute=60)
response = await rate_limited_client.make_request(
    model="Dhanishtha-2.0-preview",
    messages=[{"role": "user", "content": "Hello!"}]
)

2. Token Budget Management

Track token usage to stay within limits:

python
class TokenBudgetManager:
    def __init__(self, tokens_per_minute=10000):
        self.tokens_per_minute = tokens_per_minute
        self.token_usage = deque()
    
    def estimate_tokens(self, text):
        # Rough estimation: ~4 characters per token
        return len(text) // 4
    
    def can_make_request(self, messages, max_tokens=150):
        # Estimate input tokens
        input_tokens = sum(self.estimate_tokens(msg['content']) for msg in messages)
        estimated_total = input_tokens + max_tokens
        
        # Check current usage
        now = time.time()
        current_usage = sum(
            tokens for timestamp, tokens in self.token_usage
            if now - timestamp < 60
        )
        
        return current_usage + estimated_total <= self.tokens_per_minute
    
    def record_usage(self, usage):
        self.token_usage.append((time.time(), usage.total_tokens))
        
        # Clean old entries
        now = time.time()
        self.token_usage = deque([
            (timestamp, tokens) for timestamp, tokens in self.token_usage
            if now - timestamp < 60
        ])

# Usage {#usage}
budget_manager = TokenBudgetManager(tokens_per_minute=10000)

messages = [{"role": "user", "content": "Hello!"}]

if budget_manager.can_make_request(messages):
    response = client.chat.completions.create(
        model="Dhanishtha-2.0-preview",
        messages=messages
    )
    budget_manager.record_usage(response.usage)
else:
    print("Would exceed token budget, waiting...")

3. Batch Processing

Process multiple requests efficiently:

python
async def batch_process(requests, batch_size=10, delay=1.0):
    results = []
    
    for i in range(0, len(requests), batch_size):
        batch = requests[i:i + batch_size]
        
        # Process batch concurrently
        tasks = [
            client.chat.completions.create(**request)
            for request in batch
        ]
        
        batch_results = await asyncio.gather(*tasks, return_exceptions=True)
        results.extend(batch_results)
        
        # Delay between batches to respect rate limits
        if i + batch_size < len(requests):
            await asyncio.sleep(delay)
    
    return results

# Usage {#usage}
requests = [
    {
        "model": "Dhanishtha-2.0-preview",
        "messages": [{"role": "user", "content": f"Request {i}"}]
    }
    for i in range(100)
]

results = await batch_process(requests, batch_size=10, delay=6.0)

Monitoring Rate Limits

Real-time Monitoring

python
def monitor_rate_limits(response):
    """Extract and display rate limit information"""
    if hasattr(response, '_headers'):
        headers = response._headers
        
        print("Rate Limit Status:")
        print(f"Requests: {headers.get('x-ratelimit-remaining-requests', 'N/A')}/{headers.get('x-ratelimit-limit-requests', 'N/A')}")
        print(f"Tokens: {headers.get('x-ratelimit-remaining-tokens', 'N/A')}/{headers.get('x-ratelimit-limit-tokens', 'N/A')}")
        
        reset_time = headers.get('x-ratelimit-reset-requests')
        if reset_time:
            reset_datetime = datetime.fromtimestamp(int(reset_time))
            print(f"Resets at: {reset_datetime}")

# Usage {#usage}
response = client.chat.completions.create(
    model="Dhanishtha-2.0-preview",
    messages=[{"role": "user", "content": "Hello!"}]
)
monitor_rate_limits(response)

Dashboard Monitoring

Monitor your usage through the HelpingAI dashboard:

  • Real-time rate limit status
  • Historical usage patterns
  • Usage alerts and notifications
  • Rate limit increase requests

Best Practices

1. Implement Exponential Backoff

python
import random

def exponential_backoff(attempt, base_delay=1, max_delay=60):
    """Calculate delay with jitter"""
    delay = min(base_delay * (2 ** attempt), max_delay)
    jitter = random.uniform(0, delay * 0.1)  # Add 10% jitter
    return delay + jitter

2. Use Streaming for Long Responses

Streaming doesn't reduce token usage but provides better user experience:

python
stream = client.chat.completions.create(
    model="Dhanishtha-2.0-preview",
    messages=[{"role": "user", "content": "Write a long story"}],
    stream=True,
    max_tokens=1000
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

3. Optimize Token Usage

  • Use shorter, more specific prompts
  • Set appropriate max_tokens limits
  • Use lower temperature for factual responses
  • Implement conversation trimming for long chats

4. Cache Responses

Cache responses for repeated queries:

python
import hashlib
import json

class ResponseCache:
    def __init__(self):
        self.cache = {}
    
    def get_cache_key(self, messages, **kwargs):
        # Create hash of request parameters
        request_data = {
            'messages': messages,
            **kwargs
        }
        return hashlib.md5(json.dumps(request_data, sort_keys=True).encode()).hexdigest()
    
    def get(self, messages, **kwargs):
        key = self.get_cache_key(messages, **kwargs)
        return self.cache.get(key)
    
    def set(self, messages, response, **kwargs):
        key = self.get_cache_key(messages, **kwargs)
        self.cache[key] = response

# Usage {#usage}
cache = ResponseCache()

def cached_request(messages, **kwargs):
    # Check cache first
    cached_response = cache.get(messages, **kwargs)
    if cached_response:
        return cached_response
    
    # Make request if not cached
    response = client.chat.completions.create(
        model="Dhanishtha-2.0-preview",
        messages=messages,
        **kwargs
    )
    
    # Cache the response
    cache.set(messages, response, **kwargs)
    return response

Upgrading Your Limits

When to Upgrade

Consider upgrading when you:

  • Consistently hit rate limits
  • Need higher throughput for production
  • Require lower latency
  • Want priority support

How to Upgrade

  1. Pro Tier: Upgrade through the dashboard
  2. Enterprise: Contact enterprise@helpingai.co
  3. Temporary Increases: Contact support for special events

Custom Rate Limits

Enterprise customers can request:

  • Higher request/token limits
  • Custom concurrent request limits
  • Burst allowances for peak usage
  • Regional rate limit allocation

Troubleshooting

Common Issues

"Rate limit exceeded" errors:

  • Check your current tier limits
  • Implement retry logic with backoff
  • Consider upgrading your plan

Inconsistent rate limiting:

  • Rate limits are per-minute rolling windows
  • Multiple API keys share the same limits
  • Check for other applications using your key

Unexpected token usage:

  • Monitor actual vs. estimated token usage
  • Consider conversation context accumulation
  • Use hideThink=true to reduce reasoning tokens

Getting Help


Need higher limits? Upgrade your plan or contact enterprise sales for custom solutions.