How We Ship Software Every Day (And Sleep at Night)
Shipping software every day isnβt about speed or heroics. Itβs about designing systems that allow engineers to build, deploy, fail, recover, and learn without fear.
This post walks through a normal day in a production-grade DevOps environment β including the moments when everything goes wrong.
π§βπ» Monday, 09:12 β I start my dayβ
I have a clear requirement:
Add a new rule validation to the contract flow."
I create a branch
git checkout -b feat/contract-rule-validation
I Start coding.
While writing the code:
- the rule is isolated in the domain layer
- unit tests cover both the happy path and edge cases
- I'm not thinking about the pipeline β the system already expects me to do the bare minimum propertly
When I'm done:
git commit -m "feat(domain): add contract rule validation"
git push
βοΈ 09:26 β branch pipeline kicks in (without getting in my way)β
While I grab a coffee, the pipeline runs:
- cached build
- unit tests
- linting
- fast SAST scan
- breaking change detection (API contracts)
In 4 minutes, I get the result:
β all green
No blocked environments. No queues. No waiting
π 09:35 β I open the Pull Requestβ
I open the PR.
- the PR pipeline starts
- an ephemeral environment is created automatically
No approvals. No requests. No Slack messages.
I get an automated comment on the PR:
Ephemeral environment ready:
π 09:45 β my code is already runningβ
In that environment:
- a container with my code
- an isolated database
- mocked dependencies
- logs, metrics and distributed traces fully wired
QA validates the real flow.
Product click through the feature.
Another engineer tests an edge case.
Meanwhile, in parallel:
- integration tests are running
- contract tests are running
- selective E2E tests are running
Everything happens concurrently.
π 10:30 β real human reviewβ
A tech lead joins the PR.
He doesn't comment on formatting.
He doesn't argue about naming.
He doesn't ask for basic tests.
Instead, he asks:
"Should this rule live here, or in bounded context X?"
We discuss.
I adjust the design.
I push another commit:
refactor(domain): align rule with contract context
The pipeline runs agains.
The ephemeral environment updates automatically.
β 11:15 β PR approved and mergedβ
I merge into main.
There's no ceremony.
No one "authorizes" a deployment.
ποΈ 11:16 β an immutable artifact is bornβ
The main pipeline does the following:
- final build
- full test suite
- SBOM generation
- container vulnerability scanning
- artifact signing
- automatic versioning (e.g.
1.12.0)
This container is now the law.
The same artifact will go all the way to production.
π 11:25 β automatic deployment to DEVβ
Without me asking for anything:
- the artifact is deployed to DEV
- smoke tests run
- metrics start flowing
DEV is noisy.
It's continuous integration of everything.
π 13:00 β automatic promotion to STAGINGβ
The system decides β not a human.
Since everything passed:
- the same container is promoted to STAGING
- realistic database
- near-production data
- full E2E suite
- DAST
- performance sanity checks
I move on to another task.
I'm not "waiting for staging."
π¦ 16:40 β ready for productionβ
STAGING is green.
Release notes are already generated.
Versioning is locked.
No on asks in Slack:
"Can we deploy?"
The pipeline creates a release candidate marked as ready.
π Tuesday, 10:00 β production, no dramaβ
The release is a single click
(or fully automated depending on the product).
Production rollout starts like this:
- 5% of pods receive traffic (canary)
- metrics under observation:
- error rate
- latency
- saturation
- critical logs
After 10 minutes:
- all good β 25%
- then 50%
- then 100%
If anything goes wrong:
- automatic rollback
- no one wakes up at 3 AM
π₯ The day everything went wrong
π Tuesday, 10:07 β production starts burningβ
The deployment started as usual:
- 5% canary
- green metrics in the first minutes
Then:
- HTTP 500s start rising
- P95 latency doubles
- a specific flow explodes
No one "noticed by chance."
π¨ 10:08 β the system notices before a humanβ
Alerts fire automatically:
- error rate > SLO
- correlation with version
1.12.0 - canary tag detected
Slack receives:
Canary degradation detected Service: contract-api Version: 1.12.0 Action: rollback initiated
No one decides. The system acts.
π 10:09 β automatic rollbackβ
The pipeline:
- cuts traffic to the canary
- rolls back to version
1.11.3(last healthy) - keeps old pods warm
- traffic stabilizes
Impact:
- ~1-2 minutes of partial errors
- zero full downtime
- most users never notice
π§ 10:11 β now the real work beginsβ
The failure is contained.
Nos it's engineering
An incident channel is created automatically:
#incident-contract-api-2026-02-05
Initial automated message includes:
- start time
- affected version
- impacted metrics
- rollback executed
- current status: stabilized
π 10:15 β investigation with data, not guessesβ
I join the channel
First thing I do:
- open distributed traces
- filter by version
1.12.0 - follow the failing request
What I see is clear:
- the new rule
- a specific input
- an unhandled condition
- an unhandled exception
This is not a mystery.
Not intermittent.
It's a real bug.
π 10:25 β hypothesis confirmedβ
I correlate data:
- structured logs
- error metrics per route
- correlation with feature flag (enabled in prod)
Conclusion:
The new rule assumes a field that doesn't exist in ~2% of legacy contracts.
The pipeline didn't fail
Tests didn't cover a historical edge case.
This happens.
The system absorbed the impact.
π οΈ 10:35 β fix starts calmy, without bureaucracyβ
I create a branch:
git checkout -b fix/contract-rule-null-case
I fix the issue:
- handle the legacy scenario
- add a unit test
- add a specific integration test
Commit:
git commit -m "fix(domain): handle legacy contract edge case"
Push.
βοΈ 10:45 β fast pipeline (hotfix path)β
Because it's a fix:
- prioritize pipeline
- focused test subset
- less generic E2E
- heavy focus on the broken regression path
An ephemeral environment spins up with masked real data.
I validate the broken flow.
QA validates.
The lead validates quickly.
π 11:30 β merge and new artifactβ
Merged into main
New verison:
1.12.1
Pipeline runs:
- build
- tests
- scans
- signing
π― 11:40 β production again, with extra cautionβ
This time:
- 1% canary
- metrics observed for longer
- tighter alert thresholds
Everything stays green.
Traffic ramps up:
- 5%
- 25%
- 100%
No rollback.
No Drama.
π 12:10 β post-mortem (no blame, no theater)β
The system automatically creates:
- a post-mortem draft
- a populated timeline
- attached metrics
- linked commits
Short meeting (30-45 minutes).
Conclusion:
- legitimate failure
- historical data not represented in tests
- pipeline reacted correctly
- automatic rollback saved the day
Action items:
- new dataset for legacy contract testing
- new preventive metric
- adjusted canary thresholds for this service
The mature truthβ
At the end of the day, this wasn't a special week or an exceptional incident.
It was just another normal cycle of building, shipping, operating and learning.
Some days everything flows smoothly.
Other days production reminds you that real systems are messy, full of history and impossible to fully simulate.
The difference isn't whether things break β they always will.
The difference is whether you delivery system turns those moments into stress and heroics, or into something routing and manageable.
In this story, no one had to stop the world to fix production.
Work didn't halt.
Trust in the system didn't disappear.
The pipeline absorbed the impact, created space for investigation, and allowed the fix to move forward with the same discipline as any other change.
That's what maturity looks like in practice.
Not perfection, not speed for its own sake β but the ability to ship continuously, recover quickly and learn without fear.
That's how we ship software every day.
And that's why we can afford to sleep at night.