Skip to content

## Implement Immediate Email Retries with Enhanced Error Handling#17

Open
az4mm wants to merge 1 commit into
mainfrom
azam/mail
Open

## Implement Immediate Email Retries with Enhanced Error Handling#17
az4mm wants to merge 1 commit into
mainfrom
azam/mail

Conversation

@az4mm
Copy link
Copy Markdown
Collaborator

@az4mm az4mm commented Jan 29, 2026

Description

This PR refactors the email queue processing service to implement immediate consecutive retries when sending emails fails, replacing the previous exponential backoff approach with scheduled future retries.

Changes

backend/src/services/email-queue.service.ts

  • New sendEmailWithRetry() function: Encapsulates the retry logic for sending a single email with up to 3 consecutive attempts (configurable via MAX_RETRY_ATTEMPTS)
  • Removed exponential backoff: Emails are now retried immediately without delays between attempts, simplifying the retry flow
  • Simplified queue processing:
    • Removed attempts < max_attempts filter from the initial query since retries happen inline
    • Removed attempts = attempts + 1 increment during queue pickup (now tracked via the retry function)
    • Eliminated scheduled retry logic with backoff minutes calculation
  • Improved status handling: Emails now go directly to sent or failed status after processing, with no intermediate pending reschedules
  • Better logging: Added attempt number tracking in success/warning/error log messages for easier debugging

Motivation

The previous implementation used exponential backoff which:

  1. Required multiple cron runs to fully process failed emails
  2. Added complexity with scheduled future retries
  3. Made it harder to track email delivery issues in real-time

The new immediate retry approach:

  • Resolves transient failures quickly within a single processing cycle
  • Provides clearer success/failure outcomes
  • Reduces the time between initial send attempt and final status

Technical Details

Aspect Before After
Retry Strategy Exponential backoff (5, 10, 20, 30 min) Immediate consecutive retries
Max Attempts Configurable per email 3 attempts (constant)
Failed Email Handling Reschedule to pending Mark as failed immediately
Attempt Tracking Incremented on queue pickup Tracked in retry function

Testing

  • Verified successful email sends on first attempt
  • Verified retry behavior when transient failures occur
  • Verified emails marked as failed after 3 unsuccessful attempts
  • Verified timeout handling still resets emails to pending

Copilot AI review requested due to automatic review settings January 29, 2026 20:29
@vercel
Copy link
Copy Markdown

vercel Bot commented Jan 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
cab-management-system Ready Ready Preview, Comment Jan 29, 2026 8:29pm
cab-management-system-b59a Ready Ready Preview, Comment Jan 29, 2026 8:29pm

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the email queue processing service to implement immediate consecutive retries (up to 3 attempts) when email sending fails, replacing the previous exponential backoff strategy that scheduled retries for future cron runs.

Changes:

  • Introduced sendEmailWithRetry() function that performs up to 3 immediate retry attempts without delays between them
  • Removed exponential backoff logic and scheduled retry mechanism from queue processing
  • Simplified the queue selection query by removing the attempts < max_attempts filter since retries now happen inline
Comments suppressed due to low confidence (2)

backend/src/services/email-queue.service.ts:151

  • With immediate retries and parallel processing via Promise.all(), the total time to process a batch can increase significantly. If all 25 emails (default batch size) fail and each requires 3 retry attempts, and each attempt takes (for example) 2 seconds, the total time would be 25 * 3 * 2 = 150 seconds, far exceeding the default 25-second timeout.

This defeats the purpose of the timeout protection and will likely result in serverless function timeouts. The previous exponential backoff approach handled this better by deferring retries to future cron runs. Consider:

  1. Implementing a timeout budget per email (e.g., MAX_EXECUTION_TIME / batch_size)
  2. Reducing MAX_RETRY_ATTEMPTS when operating in parallel mode
  3. Switching to sequential processing when retries are needed
  4. Passing the time budget to sendEmailWithRetry() to limit retry attempts based on remaining time
    const sendPromises = result.rows.map(async (email) => {
      // Check if we're running out of time
      if (Date.now() - startTime > MAX_EXECUTION_TIME) {
        console.log(`⏰ Timeout approaching, skipping email ${email.id}`);
        await pool.query(`
          UPDATE email_queue 
          SET status = 'pending'
          WHERE id = $1
        `, [email.id]);
        return { success: false, skipped: true };
      }
      
      // Send with immediate retries
      const sendResult = await sendEmailWithRetry(email);
      
      if (sendResult.success) {
        // Mark as sent
        await pool.query(`
          UPDATE email_queue 
          SET status = 'sent', 
              sent_at = NOW(),
              attempts = $1
          WHERE id = $2
        `, [sendResult.attempts, email.id]);
        return { success: true };
        
      } else {
        // Mark as failed after all retries exhausted
        await pool.query(`
          UPDATE email_queue 
          SET status = 'failed', 
              last_error = $1,
              attempts = $2
          WHERE id = $3
        `, [sendResult.error, sendResult.attempts, email.id]);
        return { success: false };
      }
    });
    
    // Wait for all emails to complete
    const results = await Promise.all(sendPromises);

backend/src/services/email-queue.service.ts:108

  • The query no longer filters by attempts < max_attempts, but the database has an index defined specifically for this condition (idx_email_queue_pending in migration 017_create_email_queue_table.sql:43). This index will no longer be utilized efficiently, which could degrade query performance as the email_queue table grows. The index definition is:
CREATE INDEX IF NOT EXISTS idx_email_queue_pending 
  ON email_queue(priority, scheduled_for) 
  WHERE status = 'pending' AND attempts < max_attempts;

Consider either:

  1. Creating a new migration to drop the old index and create a new one without the attempts < max_attempts condition
  2. Adding the filter back to the query to maintain index compatibility

Without this, PostgreSQL will perform a full sequential scan on the table instead of using the index.

    const result = await pool.query(`
      UPDATE email_queue
      SET status = 'sending'
      WHERE id IN (
        SELECT id FROM email_queue
        WHERE status = 'pending'
          AND scheduled_for <= NOW()
        ORDER BY priority ASC, scheduled_for ASC
        LIMIT $1
        FOR UPDATE SKIP LOCKED
      )
      RETURNING *
    `, [limit]);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +28 to +35
async function sendEmailWithRetry(email: any): Promise<{
success: boolean;
attempts: number;
error?: string;
}> {
let lastError = '';

for (let attempt = 1; attempt <= MAX_RETRY_ATTEMPTS; attempt++) {
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max_attempts field from the database is no longer being respected. The database schema allows per-email configuration of maximum retry attempts via the max_attempts column (default: 3), but the code now uses a hardcoded MAX_RETRY_ATTEMPTS = 3 constant and ignores the database value.

This breaks the existing API contract where emails can be queued with custom retry limits. If any part of the system relies on setting different max_attempts values for different email types (e.g., 5 attempts for critical transactional emails vs 1 attempt for marketing emails), this functionality is now lost.

Consider either:

  1. Passing email.max_attempts to sendEmailWithRetry() and using that value instead of the constant
  2. Documenting that this is an intentional breaking change and updating any code that sets custom max_attempts values

Copilot uses AI. Check for mistakes.
Comment on lines 112 to 120
// Check if we're running out of time
if (Date.now() - startTime > MAX_EXECUTION_TIME) {
console.log(`⏰ Timeout approaching, skipping email ${email.id}`);
// Reset to pending for next run
await pool.query(`
UPDATE email_queue
SET status = 'pending', attempts = attempts - 1
SET status = 'pending'
WHERE id = $1
`, [email.id]);
return { success: false, skipped: true };
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an email is skipped due to timeout, the code resets its status to 'pending' but doesn't update the attempts field. In the new implementation, the attempts field serves as a record of how many times the email has been attempted. When a timeout occurs:

  1. The email is marked as 'sending' and picked up from the queue (line 98)
  2. Before any attempts are made by sendEmailWithRetry(), a timeout check occurs (line 113)
  3. The email is reset to 'pending' without any attempt tracking (line 115-119)

On the next cron run, the same email will be picked up again and go through the same flow, potentially creating an infinite loop for emails that consistently trigger timeouts. Consider tracking partial attempts or implementing a separate mechanism to handle timeout scenarios to prevent emails from being stuck in an infinite retry loop.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +54
for (let attempt = 1; attempt <= MAX_RETRY_ATTEMPTS; attempt++) {
try {
const result = await sendEmail({
to: email.to_email,
subject: email.subject,
html: email.body_html,
});

if (result.success) {
console.log(`✅ Email sent: ${email.id} to ${email.to_email} (attempt ${attempt})`);
return { success: true, attempts: attempt };
}

lastError = result.error || 'Unknown send error';
console.warn(`⚠️ Email attempt ${attempt} failed: ${email.id} - ${lastError}`);

} catch (error: any) {
lastError = error.message || 'Unknown error';
console.warn(`⚠️ Email attempt ${attempt} failed: ${email.id} - ${lastError}`);
}
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Immediate consecutive retries without any delay between attempts may cause issues with rate-limited email providers or transient network failures. Most email service providers (SMTP servers, SES, SendGrid, etc.) implement rate limiting, and sending 3 requests in rapid succession could:

  1. Trigger rate limit violations, causing all 3 attempts to fail when a small delay might have succeeded
  2. Exhaust API quotas unnecessarily
  3. Not give sufficient time for transient network issues to resolve

Consider adding a small delay (e.g., 1-2 seconds) between retry attempts to improve the success rate of retries. This is especially important for serverless environments where cold starts and network initialization can cause initial failures that resolve quickly.

Copilot uses AI. Check for mistakes.
Comment on lines 112 to +124
// Check if we're running out of time
if (Date.now() - startTime > MAX_EXECUTION_TIME) {
console.log(`⏰ Timeout approaching, skipping email ${email.id}`);
// Reset to pending for next run
await pool.query(`
UPDATE email_queue
SET status = 'pending', attempts = attempts - 1
SET status = 'pending'
WHERE id = $1
`, [email.id]);
return { success: false, skipped: true };
}

try {
const sendResult = await sendEmail({
to: email.to_email,
subject: email.subject,
html: email.body_html,
});

if (sendResult.success) {
await pool.query(`
UPDATE email_queue
SET status = 'sent', sent_at = NOW()
WHERE id = $1
`, [email.id]);
console.log(`✅ Email sent: ${email.id} to ${email.to_email}`);
return { success: true };
} else {
throw new Error(sendResult.error || 'Unknown send error');
}

} catch (error: any) {
console.error(`❌ Email failed: ${email.id}`, error.message);

const newStatus = email.attempts >= email.max_attempts ? 'failed' : 'pending';
const backoffMinutes = Math.min(5 * Math.pow(2, email.attempts - 1), 30);
// Send with immediate retries
const sendResult = await sendEmailWithRetry(email);
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout check occurs before calling sendEmailWithRetry(), but the actual email sending with retries can take significant time. If sendEmailWithRetry() makes 3 attempts and each attempt times out after several seconds, the total time could exceed MAX_EXECUTION_TIME, causing the serverless function to be terminated mid-execution.

This could leave emails in a 'sending' status with partial attempts recorded in the database. Consider:

  1. Moving the timeout check inside the retry loop to cancel between attempts
  2. Passing remaining time budget to sendEmailWithRetry() so it can decide whether to attempt another retry
  3. Implementing proper cleanup for emails that get terminated mid-execution

Copilot uses AI. Check for mistakes.
* @param email - Email record from database
* @returns Result object with success status and final attempt count
*/
async function sendEmailWithRetry(email: any): Promise<{
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The email parameter is typed as any, which loses type safety. Based on the database schema (backend/migrations/017_create_email_queue_table.sql), the email record should have specific fields like id, to_email, subject, body_html, attempts, max_attempts, etc.

Consider defining an interface for the email queue record to improve type safety and code maintainability:

interface EmailQueueRecord {
  id: string;
  to_email: string;
  subject: string;
  body_html: string;
  attempts: number;
  max_attempts: number;
  // ... other fields
}

Then update the function signature to use this type instead of any.

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +59
let lastError = '';

for (let attempt = 1; attempt <= MAX_RETRY_ATTEMPTS; attempt++) {
try {
const result = await sendEmail({
to: email.to_email,
subject: email.subject,
html: email.body_html,
});

if (result.success) {
console.log(`✅ Email sent: ${email.id} to ${email.to_email} (attempt ${attempt})`);
return { success: true, attempts: attempt };
}

lastError = result.error || 'Unknown send error';
console.warn(`⚠️ Email attempt ${attempt} failed: ${email.id} - ${lastError}`);

} catch (error: any) {
lastError = error.message || 'Unknown error';
console.warn(`⚠️ Email attempt ${attempt} failed: ${email.id} - ${lastError}`);
}
}

// All attempts exhausted
console.error(`❌ Email failed after ${MAX_RETRY_ATTEMPTS} attempts: ${email.id}`);
return { success: false, attempts: MAX_RETRY_ATTEMPTS, error: lastError };
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lastError variable is initialized as an empty string, which means if all retry attempts fail but no error message is captured, sendResult.error could be an empty string. When this is stored in the database at line 142, it will save an empty string to the last_error field, which is less helpful for debugging than storing null or a meaningful default message.

Consider initializing lastError to a more descriptive default like 'All retry attempts failed with unknown error' or checking if lastError is empty before storing it in the database.

Copilot uses AI. Check for mistakes.
Comment on lines +123 to +145
// Send with immediate retries
const sendResult = await sendEmailWithRetry(email);

if (sendResult.success) {
// Mark as sent
await pool.query(`
UPDATE email_queue
SET status = 'sent',
sent_at = NOW(),
attempts = $1
WHERE id = $2
`, [sendResult.attempts, email.id]);
return { success: true };

} else {
// Mark as failed after all retries exhausted
await pool.query(`
UPDATE email_queue
SET status = $1,
last_error = $2,
scheduled_for = NOW() + INTERVAL '${backoffMinutes} minutes'
SET status = 'failed',
last_error = $1,
attempts = $2
WHERE id = $3
`, [newStatus, error.message, email.id]);

`, [sendResult.error, sendResult.attempts, email.id]);
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retryFailedEmails() function at line 220-241 resets attempts = 0 when retrying failed emails, but this doesn't align with the new retry logic. When these emails are picked up by processEmailQueue(), they will go through sendEmailWithRetry() which will attempt 3 more times regardless of previous attempts.

This means an email could potentially be retried 6+ times total (3 attempts in first run, reset to pending via retryFailedEmails(), then 3 more attempts), which contradicts the documented behavior of having a maximum of 3 attempts.

The retryFailedEmails() function should either:

  1. Be updated to work with the new immediate retry approach
  2. Be removed if manual retries are no longer supported in the new design
  3. Keep the attempts count to track total lifetime attempts across manual retries

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants