ARTICLE

Adobe Commerce Incident Response Playbook: What Your Maintenance Partner Should Have

Adobe Commerce Incident Response Playbook: What Your Maintenance Partner Should Have

The quality of an Adobe Commerce maintenance partner is revealed during incidents, not during routine work. A store down on Black Friday morning, a checkout broken after a deployment, a stock sync failure that takes two hours to detect, these are the moments when the agency either earns its retainer or proves it shouldn’t have been hired. The difference between partners who handle incidents well and partners who handle them badly almost always comes down to whether they have a real playbook ready, or whether they’re improvising under stress.

This piece walks through what an Adobe Commerce incident response playbook should contain, the documentation, the procedures, the escalation paths, and the post-incident discipline that distinguish serious partners from inadequate ones. It is written for engineering leaders and operations directors who want to evaluate their current maintenance partner’s incident posture, or to set expectations for a new one. The patterns below come from Bemeir’s incident response practice and from the postmortems on incidents handled by partners we’ve replaced.

What an Incident Actually Is

The first discipline is defining incidents. An incident is any production event that materially affects the merchant’s ability to operate the store or that materially degrades customer experience. The definition has to be specific:

  • Severity 1 (Sev1): Production fully down. Checkout completely broken. Customer data exposure. Mass-scale fraud detected. Response time: under 15 minutes.
  • Severity 2 (Sev2): Significant degradation. Checkout partially broken or slow. PDP errors on subset of catalog. Integration silent failure with downstream impact. Response time: under 30 minutes.
  • Severity 3 (Sev3): Moderate degradation. Admin functionality broken. Specific feature degraded. Performance regression detectable but not breaking. Response time: under 2 hours.
  • Severity 4 (Sev4): Minor degradation. Cosmetic issues. Edge-case bugs. Response time: next business day.

Each level has a different response time, a different escalation pattern, and a different communication cadence. The playbook should specify all three.

The On-Call Structure

A maintenance partner that takes incident response seriously runs an on-call rotation. Engineers are scheduled to be reachable during specified windows, with a primary on-call and a secondary backup. The on-call schedule is published, the contact methods are explicit, and the response time SLA is binding.

The structure should include:

  • A primary on-call engineer reachable by phone, SMS, and the escalation system (PagerDuty, Opsgenie, etc.)
  • A secondary backup who picks up if primary doesn’t respond within 5 minutes
  • A team lead who is notified for Sev1 incidents and who can mobilize additional resources
  • An account-level escalation path to leadership for incidents affecting more than 4 hours

Bemeir’s maintenance practice publishes the on-call schedule visible to retainer clients, because transparency about who’s on-call is part of the trust relationship. The same pattern is documented in the Adobe Commerce maintenance practice.

The First Five Minutes

The first five minutes of an incident determine whether it becomes a 30-minute issue or a 4-hour disaster. The playbook should specify what happens in those first five minutes:

  1. Acknowledge the alert (via the escalation system, with a confirmation that a human has seen it)
  2. Open the incident channel (a dedicated Slack channel or war room with the merchant and the agency)
  3. Confirm the symptoms with at least two independent observations
  4. Establish severity and broadcast it to the channel
  5. Begin diagnosis with the most likely root causes based on the symptom pattern

The five-minute checklist is short by design. Lengthy procedures during the first five minutes increase the time to action and reduce the chance of containing the incident before it escalates.

The Common Failure Modes and Their First Diagnoses

A mature playbook includes pre-written first diagnoses for the common failure modes. The engineer responding to an incident shouldn’t have to invent the diagnosis. They should follow a structured path that has been refined across previous incidents.

Symptom First diagnosis path First mitigation
Site fully down (HTTP 5xx) Check Adobe Commerce app, database, Redis, OpenSearch in that order Restart application, scale up if resource-bound
Checkout broken Check payment gateway status, tax service, shipping service Disable failing service, fail over to backup
Admin slow or inaccessible Check indexer status, cron status, file system load Disable problem indexer, scale admin nodes
PDP errors on subset of catalog Check inventory feed, attribute set updates, recent deploys Rollback recent deploy, restore from backup
Email not sending Check mail provider status, queue worker status Restart queue workers, verify provider auth
Search returning no results Check OpenSearch cluster health, index status Trigger reindex, verify cluster nodes
Performance regression Check recent deploys, traffic spike, third-party tag changes Scale infrastructure, identify recent change

The table is a starting point. A real playbook has dozens of these entries, each refined from a previous incident.

Communication Discipline During Incidents

The merchant needs to know what’s happening. Communication during an incident should be:

  • Frequent: status updates every 15 minutes for Sev1, every 30 minutes for Sev2
  • Specific: what’s happening, what’s been tried, what’s being tried next
  • Honest about uncertainty: “we don’t yet know the root cause” is acceptable; manufactured confidence is not
  • Posted in a single channel: don’t fragment communication across email, Slack, and SMS

The agency lead should manage communication, not the engineer doing the work. The engineer is fully occupied with the technical diagnosis. The agency lead translates the engineer’s findings into customer-friendly status updates, communicates with the merchant’s leadership, and manages any external communication.

Adobe Commerce-Specific Diagnostics

The playbook should include Adobe Commerce-specific diagnostic commands and queries:

  • The standard `bin/magento` commands for cache, indexer, cron, and module status
  • The standard log locations and what each log captures
  • The standard database queries for checking quote, order, customer, and stock state
  • The standard cache and queue clearing patterns
  • The standard rollback procedure for deployments

For stores on Adobe Commerce Cloud, the playbook adds platform-specific commands for the cloud CLI, environment variable management, and deployment rollback. The Adobe Commerce on Cloud documentation covers the platform interface.

Integration Incident Handling

Most Adobe Commerce incidents involve integrations. The playbook should include integration-specific procedures for:

  • Payment gateway incidents (Stripe, Braintree, Adyen, PayPal)
  • Tax engine incidents (Avalara, Vertex, TaxJar)
  • Shipping carrier incidents (UPS, FedEx, USPS)
  • ERP incidents (NetSuite, SAP, Microsoft Dynamics)
  • PIM incidents (Akeneo, Salsify, Pimcore)
  • Marketing automation incidents

For each, the playbook should know how to identify whether the failure is on the integration partner’s side or on the Adobe Commerce side, how to communicate with the integration partner during the incident, and what fallback behavior is acceptable for the merchant’s customers.

The Post-Incident Review

Every incident ends with a written post-incident review (PIR). The PIR is not a blame document. It is a learning document with a specific structure:

  • Incident timeline: what happened and when
  • Root cause: the actual underlying cause, not just the immediate trigger
  • Impact: customers affected, revenue impact, brand impact
  • Response: what worked well, what worked badly
  • Action items: specific changes to prevent recurrence
  • Owners and deadlines for each action item

The PIR is delivered to the merchant within 5 business days of the incident’s resolution. It is reviewed jointly. The action items get tracked through completion.

Bemeir’s Adobe Commerce maintenance engagements treat the PIR as a contract deliverable, because the agency that doesn’t write PIRs doesn’t learn from incidents, and the maintenance relationship that doesn’t learn from incidents repeats them.

Annual Incident Review

Once a year, all incidents are reviewed together. The review identifies patterns:

  • Are certain failure modes recurring?
  • Are certain integrations producing disproportionate incident volume?
  • Are response times trending up or down?
  • Are action items from previous PIRs being completed?
  • Is the on-call structure handling load appropriately?

The annual review feeds back into the playbook. Patterns that recur get pre-written diagnostic paths. Integrations that produce many incidents get architectural attention. The playbook is a living document, not a one-time deliverable.

Industry References

Industry best practices for incident response are well-documented. Google’s Site Reliability Engineering book is the canonical reference for incident management at scale. The Incident Management for Operations book by Rob Schnepp covers the equivalent for smaller teams. The PagerDuty Incident Response documentation is publicly available and provides specific procedural templates.

The patterns from these sources apply to Adobe Commerce as much as to any other production system. The playbook should integrate them rather than treating eCommerce as somehow exempt from operational discipline.

What the Maintenance Partner Should Hand You

If you’re evaluating a maintenance partner today, ask to see their incident response playbook. The serious partners will share it (with confidential client details redacted). The ones who can’t share a playbook either don’t have one or have one that wouldn’t survive scrutiny. The playbook is the deliverable that proves incident response is real rather than aspirational.

The same playbook discipline applies across the broader maintenance practice, Bemeir’s Adobe Commerce work and the team’s Shopify Plus, Shopware, and BigCommerce maintenance engagements all use the same fundamental incident structure, adapted to platform-specific diagnostics. Production incidents are inevitable. The question isn’t whether they happen but whether they get handled well when they do, and the playbook is what determines the answer.

Let us help you get started on a project with Adobe Commerce Incident Response Playbook: What Your Maintenance Partner Should Have and leverage our partnership to your fullest advantage. Fill out the contact form below to get started.

more articles about ecommerce

Read on the latest with Shopify, Magento, eCommerce topics and more.