Audience: Backend engineers, CTOs Reading time: 20 minutes
Telegram bots look simple until you have real users. A weekend project becomes a production system, and suddenly you're dealing with rate limit errors at 2 AM, messages arriving out of order, webhook timeouts that silently drop updates, and a state management model that made sense for 100 users but breaks at 50,000.
We've built and operated Telegram bots at various scales - from internal tools handling a few hundred messages per day to public bots processing 20,000+ messages per hour. Here's what we learned that's not in the Telegram API documentation.
The Fundamental Problem: Telegram's Delivery Guarantees
Before diving into architecture, it's worth being precise about what Telegram guarantees and what it doesn't.
What Telegram guarantees: your webhook will be called with each update. If your webhook returns non-200, Telegram retries with exponential backoff for up to 24 hours.
What Telegram doesn't guarantee: the order of delivery, exactly-once delivery, or that your webhook is the only one processing a given update.
The retry behavior sounds helpful until you realize: if your bot goes down for 30 minutes, Telegram accumulates 30 minutes of updates and then delivers them all at once when you come back. If your bot usually handles 50 messages per minute, recovering from a 30-minute outage means handling 1,500 messages in a burst - plus the ongoing normal traffic.
Without architectural preparation for this, the recovery burst overwhelms the bot, it starts returning 500 errors, Telegram retries, and you get a cascade failure that's harder to recover from than the original outage.
The Architecture That Handles Scale
The pattern that works across the bots we've built:
`` Telegram API | v Webhook endpoint (Express/Fastify) | v (enqueue immediately, return 200) Message Queue (BullMQ / Redis) | v Worker pool (N workers, configurable) | ├── Business logic ├── State management (Redis/PostgreSQL) └── Telegram API calls (rate-limited) ``
The webhook endpoint does exactly one thing: validate the update, enqueue it, return 200. It takes under 5ms. Telegram gets its 200 response immediately, the actual processing happens asynchronously.
This solves the recovery burst problem: incoming updates queue up in Redis, workers process them at a controlled rate. The queue absorbs the burst; the workers process at whatever rate your business logic and rate limits allow.
The Webhook Handler
```typescript import { FastifyInstance } from 'fastify'; import { Queue } from 'bullmq'; import crypto from 'crypto';
const updateQueue = new Queue('telegram-updates', { connection: redis });
export function registerWebhook(app: FastifyInstance) { app.post('/webhook/:token', async (req, reply) => { // Validate webhook token (simple but effective) if (req.params.token !== process.env.WEBHOOK_SECRET) { return reply.status(403).send(); }
const update = req.body as TelegramUpdate;
// Enqueue with deduplication key await updateQueue.add('process', update, { jobId: update:${update.update_id}, // BullMQ deduplicates by jobId attempts: 3, backoff: { type: 'exponential', delay: 1000 } });
return reply.status(200).send({ ok: true }); }); } ```
Using update_id as the job ID gives you deduplication: if Telegram retries delivery of the same update (it does this), the second enqueue is a no-op. The update is processed exactly once.
The Worker
```typescript import { Worker, Job } from 'bullmq'; import { Bot } from 'grammy'; // or telegraf, aiogram equivalent
const bot = new Bot(process.env.BOT_TOKEN!);
const worker = new Worker('telegram-updates', async (job: Job) => { const update = job.data as TelegramUpdate;
try { await bot.handleUpdate(update); } catch (error) { // Log with context for debugging logger.error({ update_id: update.update_id, chat_id: update.message?.chat?.id, error: error.message }); throw error; // BullMQ will retry } }, { connection: redis, concurrency: 10 // Process 10 updates simultaneously }); ```
Rate Limits: The Real Numbers
Telegram's documented rate limits:
- 30 messages per second globally across all chats
- 1 message per second to a specific chat
- 20 messages per minute to a specific group
These limits get you in two ways: gradual violation and burst violation.
Gradual violation happens when you're sending notifications to many users. If you have 10,000 subscribers and send a broadcast, that's 10,000 sendMessage calls. At 30/sec, that takes 5.5 minutes minimum. If you try to send faster, you'll get 429 errors and need to retry, extending the time further.
Burst violation happens when multiple users interact simultaneously and your bot sends responses. 50 users click a button at the same time - 50 sendMessage calls in under a second - rate limit errors for some of them.
Rate-Limited Sender
```typescript import PQueue from 'p-queue';
// Global rate limiter: 25 messages/second (buffer below 30) const globalQueue = new PQueue({ intervalCap: 25, interval: 1000 });
// Per-chat rate limiter: 1 message/second per chat const chatQueues = new Map
function getChatQueue(chatId: number): PQueue { if (!chatQueues.has(chatId)) { chatQueues.set(chatId, new PQueue({ intervalCap: 1, interval: 1000 })); } return chatQueues.get(chatId)!; }
async function sendMessage(chatId: number, text: string, options = {}) { const chatQueue = getChatQueue(chatId);
return chatQueue.add(() => globalQueue.add(() => telegram.sendMessage(chatId, text, options) ) ); } ```
Two-level rate limiting: per-chat queue ensures one message per second per chat, global queue ensures total throughput stays under 30/sec.
For broadcasts, go lower - 20/sec gives headroom for interactive messages from users happening simultaneously.
Handling 429 Errors
Even with rate limiting, you'll get 429 errors occasionally (other bot instances, transient spikes). The right response is to wait and retry, not to skip the message:
``typescript async function sendWithRetry(chatId: number, text: string, maxRetries = 3) { for (let attempt = 0; attempt < maxRetries; attempt++) { try { return await sendMessage(chatId, text); } catch (error) { if (error.response?.error_code === 429) { const retryAfter = error.response.parameters?.retry_after ?? 5; await sleep(retryAfter * 1000); continue; } throw error; // Non-rate-limit error, don't retry here } } throw new Error(Failed to send after ${maxRetries} attempts); } ``
retry_after in the 429 response tells you exactly how long to wait. Use it.
Message Ordering: Why It's Harder Than It Looks
Users expect that if they send two messages quickly, the bot processes them in order. With concurrent workers, this isn't guaranteed.
User sends: "set language to English" then immediately "what's my language?"
If worker 1 processes message 2 before worker 1 finishes processing message 1, the response to "what's my language?" returns the old language. The user is confused.
The solution: process messages from the same chat serially, not in parallel.
``typescript // When enqueueing, add a serial group per chat_id await updateQueue.add('process', update, { jobId: update:${update.update_id}, // BullMQ group: messages from same chat process one at a time group: { id: chat:${getChatId(update)}, limit: { max: 1 } // Only 1 concurrent job per group } }); ``
BullMQ groups ensure that messages from the same chat process sequentially, while messages from different chats still process in parallel. This is the key architectural decision that prevents ordering bugs.
State Management: Redis vs PostgreSQL
Every non-trivial bot needs to remember things: what step of a conversation flow a user is on, their preferences, pending confirmations.
Two approaches, different tradeoffs.
Redis for conversation state:
```typescript interface ConversationState { step: 'awaiting_name' | 'awaiting_email' | 'confirm' | 'done'; data: Record
async function getState(chatId: number): Promisestate:${chatId}); return raw ? JSON.parse(raw) : null; }
async function setState(chatId: number, state: ConversationState) { // TTL of 1 hour - if user abandons the flow, state expires await redis.set(state:${chatId}, JSON.stringify(state), 'EX', 3600); }
async function clearState(chatId: number) { await redis.del(state:${chatId}); } ```
Redis state is fast (sub-millisecond reads) and automatically expires. The downside: if Redis restarts without persistence configured, all in-progress conversations are lost. For most bots, this is acceptable - users restart the flow.
PostgreSQL for durable state:
For bots where losing in-progress state is unacceptable (booking flows, payments, multi-day processes), store state in PostgreSQL:
```sql CREATE TABLE conversation_states ( chat_id bigint PRIMARY KEY, step text NOT NULL, data jsonb DEFAULT '{}', updated_at timestamptz DEFAULT now(), expires_at timestamptz );
CREATE INDEX idx_states_expires ON conversation_states (expires_at) WHERE expires_at IS NOT NULL; ```
A scheduled job cleans expired states: DELETE FROM conversation_states WHERE expires_at < now().
Our default: Redis for temporary conversation state, PostgreSQL for business data that needs to survive restarts. Don't use PostgreSQL for every state transition - the write overhead adds up at scale.
Horizontal Scaling
A single Node.js worker process handles roughly 500-1,000 messages per minute before CPU becomes a bottleneck (depending on what processing happens per message). To go higher, run multiple worker instances.
Two things to ensure before scaling horizontally:
Idempotent message processing. If two workers race and both try to process the same update (shouldn't happen with BullMQ job IDs, but defense in depth): the second attempt should be safe. Use database transactions with INSERT ... ON CONFLICT DO NOTHING for any state changes.
Shared state lives in Redis/PostgreSQL, not in memory. Any state stored in the Node.js process memory is invisible to other instances. User state, rate limit counters, active session data - all must be in the shared store.
Kubernetes is one deployment option; for most bots it's overkill until you're running 10+ worker instances. ECS Fargate with auto-scaling based on queue depth is simpler and works well:
```yaml
CloudWatch alarm: scale up when queue depth > 500
- MetricName: ApproximateNumberOfMessages
Threshold: 500 ComparisonOperator: GreaterThanThreshold Statistic: Maximum ```
When queue depth exceeds 500 pending messages, add a worker. When it drops below 50, scale down. This handles burst traffic automatically without constant over-provisioning.
Monitoring: What to Watch
Three metrics that matter most for production bots.
Queue depth. Messages waiting to be processed. If this grows over time, workers can't keep up - add workers or optimize processing. Alert when depth > 1,000 and keeps growing for 5 minutes.
Processing latency p95. Time from message received to response sent. Users notice latency above 3-4 seconds. Alert when p95 > 5 seconds. Sudden spikes usually mean a slow external API call or database query.
Error rate by error type. Separate tracking for: Telegram API errors (429, 5xx), business logic errors (caught exceptions), unhandled errors (worker crashes). 429 errors tell you about rate limit violations; business logic errors tell you about input validation gaps; unhandled errors tell you about bugs.
``typescript // Structured logging for each processed update logger.info({ update_id: update.update_id, chat_id: getChatId(update), update_type: getUpdateType(update), processing_ms: Date.now() - startTime, outcome: 'success' | 'error', error_type: error?.constructor?.name }); ``
With structured logs, you can build dashboards in CloudWatch or Grafana without adding another metrics service.
Distributed Locking for Critical Operations
Some bot operations must not run concurrently even across multiple worker instances. A user hits "Pay" twice quickly - you must charge them once, not twice. A user triggers a long-running process - only one instance should run it.
Redis-based locking handles this:
```typescript import Redlock from 'redlock';
const redlock = new Redlock([redis], { retryCount: 3, retryDelay: 200 });
async function processPayment(userId: string, orderId: string) { const lockKey = payment:${userId}:${orderId};
let lock; try { // Try to acquire lock for 10 seconds lock = await redlock.acquire([lockKey], 10000);
// Check if payment already processed (idempotency) const existing = await db.query( 'SELECT id FROM payments WHERE order_id = $1', [orderId] ); if (existing.rows.length > 0) return existing.rows[0];
// Process payment return await createPayment(userId, orderId);
} finally { if (lock) await lock.release(); } } ```
Redlock works across multiple Redis instances for high availability. For most bots, a single Redis with the above pattern is sufficient - the real protection is the idempotency check inside the lock, not the lock alone.
Webhook vs Polling: When to Use Each
Telegram supports two update delivery modes: webhooks (Telegram calls your server) and long polling (your server calls Telegram).
Use webhooks for production. Lower latency (Telegram pushes updates immediately vs polling every N seconds), scales better, and Telegram's retry logic handles temporary downtime.
Use long polling for development. No need for a public HTTPS endpoint, works behind NAT, easier to debug with local tools. Switch to webhooks before going to production.
The transition is straightforward:
```typescript // Development: polling const bot = new Bot(TOKEN); bot.start(); // Uses getUpdates long polling
// Production: webhook await bot.api.setWebhook('https://yourdomain.com/webhook/SECRET', { max_connections: 100, // Telegram's concurrent webhook calls allowed_updates: ['message', 'callback_query', 'inline_query'] }); ```
allowed_updates is important for performance: if your bot doesn't use inline queries, don't subscribe to them. Fewer update types = smaller payload = less processing overhead.
Common Mistakes We've Seen (and Made)
Processing updates synchronously in the webhook handler. The most common mistake on first implementation. Works fine at 10 messages per minute, fails at 200 messages per minute when one slow database query blocks everything else.
Ignoring the retry-after header on 429 errors. Immediately retrying after a 429 makes things worse. The retry-after value (in seconds) is the minimum wait time before Telegram will accept the next request. Respecting it is not optional - ignoring it extends the rate limit period.
Storing per-user state in process memory. Works great for one instance, breaks immediately when you add a second. All shared state must live outside the process.
Not testing with real Telegram behavior. Telegram's test environment doesn't simulate rate limits or retry behavior accurately. Load testing with a real bot token against a test group gives a much more accurate picture of what happens under pressure. We run load tests by replaying 30 minutes of captured production traffic at 3x speed - this surfaces latency issues and rate limit violations before they happen to real users.
Using a single Redis instance for both queue and state without persistence. If Redis restarts, your message queue evaporates. Configure AOF persistence (appendonly yes) for the Redis instance that stores your BullMQ queues. For the state cache (conversation state with short TTL), RDB is sufficient. Don't conflate queue persistence with state persistence - different durability requirements.
Forgetting about bot blocking. Users block bots. When you try to send a message to a user who blocked the bot, you get a 403 error (bot was blocked by the user). You need to handle this: remove the user from broadcast lists, mark them inactive in your database. Otherwise you're repeatedly trying to message users who don't want your messages.
Key Numbers to Design Around
When planning capacity, these are the numbers we work from:
- Telegram rate limit: 30 global messages/sec, 1/sec per chat
- Average message processing time (simple response): 50-150ms
- Average message processing time (with DB query): 100-300ms
- Single Node.js worker: 500-800 messages/minute comfortable ceiling
- Redis queue overhead per message: ~2ms
- Cost of running 2 ECS workers (t3.small) 24/7: approximately $30/month
The architecture described here has run without significant issues at 20,000 messages per hour in production. At that scale, we run 4 worker instances, each with concurrency 10, and queue depth stays under 200 messages during normal operation.
If you're building a Telegram bot and want to get the architecture right before you have a scaling problem - describe what you're building.
Discuss your project
Tell us the task - what to build or extract from the monolith. Reply within one business day.
Or email us: mail@leval.pro