Why Restartability Is a Feature, Not a Hack

A batch job processes a file with one million records. It runs for two hours. If the job crashes after processing 500,000 records, it restarts from the beginning. All one million records are reprocessed. Some are processed twice. The job is not idempotent, so processing records twice corrupts the results.

To avoid reprocessing, a checkpoint system is added: every 100,000 records, the position is saved. If the job crashes, it resumes from the last checkpoint. This is treated as a technical hack: a workaround to avoid expensive reprocessing. The system is fragile; checkpoint state can become inconsistent with the actual data. Recovery from checkpoint is manual and error-prone.

A different batch job is designed from the start to be restartable. Every record has a unique ID and a sequential position. The system processes records sequentially and idempotently: processing the same record twice produces the same result in the database. The state is outside the system: the database tracks which records have been processed. If the job crashes, it restarts, checks the database, and resumes from the next unprocessed record. Restartability is not a hack. It’s the primary design principle.

The second system is simpler, more reliable, and easier to operate. Restartability was cheaper to build in up front than to bolt on later as a hack.

What Restartability Actually Means

A restartable system can be interrupted at any point and restarted without losing correctness or requiring manual recovery.

“Interrupted at any point” means the system can fail, be killed, be shut down, or be suspended at any time. There’s no safe stopping point. There’s no “finish this operation then shut down gracefully.” Interruption can happen immediately.

“Restarted without losing correctness” means the system produces the same result regardless of whether it runs once from start to finish or is interrupted and restarted multiple times.

“Without requiring manual recovery” means the system automatically determines what work remains and completes it. No human has to examine state, figure out what happened, and tell the system what to do next.

Designing for restartability requires several properties:

Idempotency. Operations can be repeated safely. Running the same operation twice produces the same result as running it once.
External state. The system’s progress is tracked outside the system (in a database, file, or external store), not in memory or local state.
Deterministic ordering. The system processes units of work in a deterministic order so that after restart, it can resume from a known position.
No side effects from incomplete operations. If an operation is interrupted partway through, it doesn’t leave the system in a half-changed state.

Systems without these properties are not restartable. They require checkpointing (saving and resuming state) or careful orchestration to avoid reprocessing.

Why Checkpoint-Based Systems Are Fragile

Many systems attempt to make themselves restartable by adding checkpointing: periodically saving the current state so the system can resume from a checkpoint if it crashes.

Checkpoint-based restartability seems like a hack because it is a hack. The system was designed to run in a single linear execution. Checkpointing is bolted on top to handle the case where the system fails.

Checkpoint systems are fragile for several reasons:

State consistency. The system’s internal state (variables, data structures, memory) must be checkpointed. The checkpoint must capture all state necessary to resume correctly. Missing a piece of state means resuming from the checkpoint produces incorrect results. Ensuring all state is checkpointed is difficult; it’s easy to forget some piece.

Version compatibility. If code is deployed between the crash and the restart, the checkpoint format might not be compatible with the new code. Resuming from a checkpoint created by old code might fail or produce wrong results in new code.

Checkpoint corruption. Checkpoints can become inconsistent with the system’s actual state. A database transaction might partially complete, with some changes persisted and others not. The checkpoint might be saved mid-transaction. Resuming produces inconsistent state.

Atomicity of checkpoint. Saving a checkpoint is itself an operation that can fail. The checkpoint write might partially complete, leaving the checkpoint in an inconsistent state. Trying to resume from a corrupted checkpoint fails.

Rollback complexity. If something goes wrong during recovery, rolling back requires understanding what the checkpoint contained and how to undo its effects. There’s no clean way to invalidate the checkpoint and restart fresh.

Checkpoint-based systems are not restartable in the pure sense. They’re resumable from specific points, with significant caveats and potential for failure.

Idempotent Design Is Restartable By Nature

Systems designed for idempotency are naturally restartable.

An idempotent operation produces the same result regardless of how many times it’s executed. Writing a record with a specific ID to a database is idempotent: if the write succeeds, running the write again succeeds but doesn’t change anything (if the database has unique constraints). A transaction that is idempotent can be retried safely.

If all operations in a system are idempotent and the system tracks which operations have completed, the system is restartable: restart the system, check which operations completed, execute the remaining operations. There’s no checkpoint. There’s no resumption from a specific point. There’s just “run all incomplete operations.”

The key is external state. The system doesn’t store progress in memory. It stores progress in the database or external state, where it’s durable, queryable, and verifiable.

A data processing system that marks each processed record in the database is restartable: after restart, query the database for unprocessed records and continue processing. A deployment system that idempotently applies configuration changes is restartable: querying the current state reveals what’s been deployed, and rerunning the deployment applies only the missing changes. A message queue consumer that marks each consumed message as processed is restartable: after restart, query the message queue for unprocessed messages and resume.

These systems are restartable by design, not by bolting on checkpointing.

The Relationship Between Restartability and Simplicity

Restartable systems are often simpler than non-restartable systems.

A non-restartable system must be carefully orchestrated to avoid failures. Operators must monitor it, watch for failures, manually recover if something goes wrong. The system requires a specific order of operations and careful state management.

A restartable system can be restarted at any time without concern for its state. If a restart fails, restart it again. The system is self-healing: it always converges to the correct state eventually. Operators don’t need to monitor closely or manually recover. They can restart the system and trust it to complete.

A batch job that is not restartable must be monitored continuously. If it fails, the operator must determine where it failed, what work was completed, what work remains, and manually resume it. The operator might restart it without understanding the cleanup needed, leading to duplicate processing or data corruption.

A batch job that is restartable can be restarted blindly. The system checks what work is complete and resumes from the next item. The operator doesn’t need to understand the internal state. The system is simpler to operate because the simplicity is built in.

Designing for Restartability From the Start

Restartability should be a design requirement, not a hack added later.

Use external state for progress tracking. Don’t store progress in memory. Use a database, file, or message queue to track which items have been processed. This state must be durable and queryable: after restart, the system queries the state and knows what to do next.

Design operations to be idempotent. Every operation should be safe to retry. This usually means: record the ID of the operation, check if it’s already been processed, if not process it, if yes skip it. The operation is idempotent if processing it twice produces the same final state as processing it once.

Order work deterministically. Process items in a deterministic order: by ID, by timestamp, by priority. This ensures that after restart, the system resumes from a known position.

Keep state simple. The less state the system carries, the simpler restart logic is. Avoid complex in-memory data structures. Use simple, queryable external state.

Separate concerns. The progress of processing and the actual processing should be separate. One system tracks what’s been processed and what remains. Another system does the processing. If the processing system crashes, restarting it checks the progress tracker and resumes.

Test restartability. Write tests that verify: crash the job at various points, restart it, verify it completes correctly. Simulate crashes in the middle of operations. Verify the system recovers.

Examples of Restartable Designs

Several patterns enable restartability:

Distributed transactions with external state. A payment system processes transactions idempotently. Each transaction has a unique ID. The system checks if the transaction has been processed. If not, it processes it. If yes, it skips it. The transaction state (processed, failed, pending) lives in the database, not in memory.

Event sourcing. The system appends immutable events to an event log. The current state is derived from the events. If the system crashes, it replays the event log to rebuild state. Events themselves are idempotent: appending the same event twice produces the same current state.

Message queue consumers. A consumer pulls messages from a queue, processes them, and marks them as consumed. If the consumer crashes mid-processing, restarting it queries the queue for unprocessed messages and resumes. The message queue tracks which messages have been consumed.

File-based batch processing. A batch job reads records from a file sequentially and writes results to a database. An external record (in another file or database) tracks the position in the input file. If the job crashes, restarting it reads the position and resumes from the next record.

Schedulers with task state. A task scheduler tracks the state of each task: pending, running, completed, failed. If the scheduler crashes, restarting it queries the state and resumes incomplete tasks. Tasks are idempotent: running the same task twice produces the same result.

When to Accept Non-Restartable Design

Not all systems need to be restartable. Some systems can afford the complexity of non-restartability.

Short-lived processes that rarely fail. A process that always completes in seconds is unlikely to fail mid-execution. The cost of adding restartability might outweigh the benefit.

Stateless services with external resource management. A stateless API service doesn’t have progress to track. If it crashes, restarting it immediately makes it available again. Restartability is implicit because there’s no state.

Real-time systems with tight latency requirements. A system with microsecond latency requirements might not have time to query external state and resume. Non-restartability might be acceptable if the system is replicated and failure of individual instances is tolerated.

Interactive systems where state is held by the user. A web form where users enter data across multiple steps doesn’t need system restartability. If the system crashes, the user’s browser session is lost, but that’s acceptable. The user interaction is the state that survives restarts.

For these systems, non-restartable design is a reasonable choice. But many systems can afford the complexity and would benefit from restartability.

The Cost of Not Having Restartability

Systems without restartability have hidden costs:

Operational complexity. When the system fails, operators must manually recover. This requires understanding the system’s state, which is complex if state is internal. Recovery is error-prone.

Unavoidable downtime. A system that cannot restart mid-operation must either complete normally or be lost. Restarting requires redoing all work. For long-running operations, this is expensive.

Cascading failures. A system that fails and requires manual recovery might fail again if recovery is not done correctly. If the recovery process itself is fragile, the system becomes less reliable, not more.

Low confidence in operations. Operators of non-restartable systems are nervous about restarts. They’re never sure if restarting will complete correctly or require manual intervention. This leads to reluctance to restart, which leads to systems running longer than they should, which leads to more degradation.

Difficult to scale. If a system must be restarted end-to-end when it fails, scaling to larger datasets or more frequent runs becomes expensive. Long-running jobs that process days of data are nightmare scenarios: if anything fails 23.95 hours in, the entire job restarts.

Restartability as Infrastructure Feature

Some platforms provide restartability as a feature, removing the burden from individual systems.

Kubernetes and container orchestration. If a container crashes, the orchestrator automatically restarts it. Restartability is implicit: the container should be designed so that restarting from a clean state is safe. This encourages idempotent design.

Workflow engines. Some workflow engines (Apache Airflow, Temporal) provide restartability as a built-in feature. Tasks are designed to be idempotent. The engine tracks which tasks have completed and resumes incomplete ones.

Distributed databases. Some databases provide crash recovery: if the database crashes, restarting it recovers to a consistent state. This is restartability at the database level.

Message brokers. Message brokers like Kafka or RabbitMQ track which messages have been consumed. Consumers can restart and resume from the last consumed message.

Using these platforms reduces the burden of designing restartability, but it doesn’t eliminate it. The system must still be designed to work with restartability: use idempotent operations, track progress externally, avoid complex internal state.

Restartability and Reliability

Restartability is not the same as reliability, but they’re deeply related.

A reliable system rarely fails. A restartable system can fail frequently and still produce correct results.

Reliability is about avoiding failures. Restartability is about managing failures.

In practice, the best systems combine both: they’re designed to be reliable (failures are rare) and restartable (failures that do occur are managed well).

A system that’s unreliable but restartable might fail every day but produce correct results because it restarts and recovers. A system that’s reliable but not restartable might fail once a year but require week-long recovery when it does.

For most systems, restartability is more valuable than reliability. A system that fails predictably and recovers automatically is easier to operate than a system that rarely fails but requires manual recovery when it does.

The Mindset Shift

Building restartable systems requires a mindset shift from “prevent failures” to “manage failures.”

Non-restartable systems are built with the assumption that failures are exceptional and should be prevented. The system is designed to run in specific conditions and with careful state management. Failures are treated as catastrophic.

Restartable systems are built with the assumption that failures are normal and inevitable. The system is designed to handle interruption gracefully. Failures are just a restart away.

This mindset shift is difficult. Engineers want to prevent failures. Accepting that they’re inevitable requires different thinking.

But systems designed with this mindset are fundamentally simpler and more reliable. They expect to be restarted. They’re designed to recover automatically. They don’t require careful choreography or manual intervention.

Choose Restartability

Restartability is not a hack or a workaround. It’s a design principle that produces simpler, more reliable systems.

For most long-running systems, batch jobs, and infrastructure components, restartability should be a core requirement. Design for it from the start. Use idempotency. Track progress externally. Test restart scenarios.

The cost upfront is minimal. The benefit is systems that are simpler to operate, more resilient to failure, and easier to scale.

Choose restartability by design, not by accident.