Memory Management
1. What this document is about
This document addresses memory allocation, garbage collection behavior, and resource lifecycle management in .NET applications operating under sustained production load. It focuses on the CLR's memory subsystem and how allocation patterns, GC tuning, and object lifetime decisions affect system throughput, latency distribution, and operational cost.
It applies to:
- Server-side APIs handling concurrent requests
- Background processors with long-running operations
- Cloud-native workloads in containerized environments
- Applications with strict P99 latency requirements
- Systems experiencing memory pressure or LOH fragmentation
It does not apply to:
- Unmanaged memory management in C/C++
- Memory safety vulnerabilities (buffer overflows, use-after-free)
- Basic .NET programming concepts
- Language-specific features unrelated to runtime behavior
This document is about reasoning and trade-offs, not about micro-optimizations or cargo-cult "performance tips".
2. Why this matters in real systems
Memory management becomes a critical concern when systems operate outside the happy path assumptions of default GC behavior.
Typical pressure points:
Sustained throughput: A payment processing API handling 50,000 requests/minute starts experiencing intermittent 300ms spikes. Profiling reveals Gen2 collections blocking request threads. The default workstation GC cannot keep pace with allocation rate.
Container limits: A microservice deployed with a 512MB memory limit crashes with OutOfMemoryException during traffic bursts, despite avg memory usage at 280MB. The GC cannot reclaim fast enough when allocation spikes occur within cgroup constraints.
Long-running processes: A background job processor runs for 72+ hours, gradually slowing from 1000 items/sec to 200 items/sec. LOH fragmentation prevents allocation of large buffers, forcing expensive Gen2 compactions.
Latency-sensitive workloads: A real-time notification service maintains WebSocket connections to 100K clients. GC pauses exceeding 50ms cause connection timeouts and reconnection storms.
What tends to break when this is ignored:
- P99 latency degrades by 10-100x during GC pressure
- Memory costs increase 2-4x due to conservative allocation patterns
- Systems fail unpredictably under load spikes that shouldn't be fatal
- Kubernetes kills pods for exceeding memory limits during otherwise healthy operation
Simpler approaches stop working when allocation rate exceeds ~500MB/sec, when object lifetimes span multiple GC generations, or when heap size approaches container memory limits.
3. Core concept (mental model)
The .NET GC operates as a generational copying collector with concurrent marking. Think of it as a conveyor belt system with three stages.
The conveyor belt model:
[Stage 0: Fast Lane] → [Stage 1: Middle Belt] → [Stage 2: Long-Term Storage]
Gen0 (Eden) Gen1 (Survivor) Gen2 (Tenured)
Collect every Collect every Collect only when
~1-2MB allocated ~10-20 Gen0 cycles pressure builds
Most objects die Intermediate objects Long-lived objects
immediately promoted if survive live here indefinitely
Objects start in Gen0. If they survive a Gen0 collection, they're promoted to Gen1. Surviving Gen1 moves them to Gen2. The GC assumes most objects die young (generational hypothesis), so it collects Gen0 frequently and cheaply.
The Large Object Heap (LOH) is a separate conveyor:
Objects ≥85,000 bytes bypass generational collection entirely and go directly to a special heap that only compacts under memory pressure. This prevents copying costs for large buffers but introduces fragmentation risk.
Key mental model:
- Gen0/Gen1 collections are fast (sub-millisecond) because they scan small memory regions
- Gen2 collections are expensive (10-100ms+) because they scan the entire heap
- The GC tries to avoid Gen2 collections as long as possible
- Your allocation patterns determine how often expensive collections occur
By the end of this section, you should think: "The GC is optimized for short-lived objects. My job is to align allocation patterns with this assumption or explicitly manage what doesn't fit."
4. How it works (step-by-step)
Step 1 — Allocation request
When code executes new MyObject(), the CLR checks if Gen0 has enough contiguous space. Gen0 uses a simple bump allocator — just increment a pointer. This is why managed allocation can be faster than malloc.
Why this exists: Bump allocation is O(1) and cache-friendly. The GC assumes you'll allocate many objects quickly, so it optimizes for throughput over fragmentation.
Invariant: Gen0 must always have space. If not, trigger a collection.
Step 2 — Gen0 collection trigger
When Gen0 fills (typically 1-2MB threshold), the GC pauses application threads and scans live objects. This is a "stop-the-world" event, but the pause is usually <1ms because Gen0 is small.
Why this exists: The GC can only reclaim memory by identifying dead objects. It must stop mutation to avoid concurrent modification issues during graph traversal.
Assumption: Most Gen0 objects are already dead. Collection finds few survivors to copy.
Step 3 — Mark and promote
The GC traces from GC roots (stack variables, static fields, active threads) to mark live objects. Survivors are copied to Gen1 or Gen2 depending on promotion policy. Dead objects are implicitly freed by not being copied.
Why copying instead of mark-sweep: Copying compacts memory, preventing fragmentation. This maintains fast bump allocation in Gen1.
Step 4 — Gen1 and Gen2 collections
Gen1 collections trigger after ~10-20 Gen0 cycles. Gen2 triggers only when memory pressure builds or explicit GC.Collect() is called. Gen2 collection scans the entire managed heap.
Why this exists: The generational hypothesis holds — very few objects live long enough to reach Gen2. By deferring Gen2 collection, the GC amortizes the cost over many fast Gen0 collections.
Invariant: Gen2 size grows until memory pressure forces a full collection.
LOH allocation path
Step 5 — Large object handling
Objects ≥85KB bypass Gen0-Gen2 and go directly to the LOH. The LOH does not compact by default (expensive to copy large blocks). It uses a free list allocator similar to traditional malloc.
Why this exists: Copying a 10MB buffer is too expensive to do frequently. The LOH trades fragmentation risk for allocation speed.
Fragmentation scenario: Allocate 100MB buffer, free it, allocate 90MB buffer. The 90MB cannot reuse the 100MB slot, wasting 10MB until a Gen2 collection compacts.
Server GC vs Workstation GC
Workstation GC:
- Single dedicated GC thread
- Optimized for low latency in interactive apps
- Smaller Gen0 budget
Server GC:
- Multiple GC threads (one per logical processor)
- Larger Gen0 budget per heap
- Higher throughput, higher pause times
- Each logical core gets its own Gen0/Gen1/Gen2 heap segment
Why two modes: UI apps need responsiveness (short pauses). Servers need throughput (more work between pauses).
5. Minimal, realistic example
Scenario: High-throughput API with memory pooling
This ASP.NET Core API processes image uploads. Without pooling, each request allocates a 1MB byte array, causing frequent Gen2 collections under load.
// Bad: Allocates 1MB per request, Gen2 collections every ~500 requests
public class ImageProcessor
{
public async Task<byte[]> ProcessImageAsync(Stream input)
{
var buffer = new byte[1024 * 1024]; // 1MB allocation
await input.ReadAsync(buffer, 0, buffer.Length);
// ... process image ...
return CompressImage(buffer);
}
}
// Good: Pools buffers, reduces Gen2 pressure
public class ImageProcessor
{
private static readonly ArrayPool<byte> _bufferPool = ArrayPool<byte>.Shared;
public async Task<byte[]> ProcessImageAsync(Stream input)
{
var buffer = _bufferPool.Rent(1024 * 1024); // Rent from pool
try
{
await input.ReadAsync(buffer, 0, buffer.Length);
// ... process image ...
return CompressImage(buffer);
}
finally
{
_bufferPool.Return(buffer); // Return to pool
}
}
}
How this maps to the concept:
-
Without pooling: Every request creates a 1MB byte array. Since 1MB > 85KB, it goes to LOH. Under 1000 req/sec, you allocate 1GB/sec to LOH, forcing frequent Gen2 collections.
-
With pooling: Buffers are reused. Only initial allocation hits LOH. Subsequent requests rent existing buffers, eliminating allocation pressure. Gen2 collections drop from every 10s to every 10 minutes.
Production impact: At 5000 req/sec, this change reduced P99 latency from 450ms to 12ms and cut memory usage from 4GB to 800MB.
6. Design trade-offs
| Approach | Throughput | Latency (P99) | Memory Overhead | Complexity | Failure Mode |
|---|---|---|---|---|---|
| Default GC (Workstation) | Medium | Low (optimized) | Low | Minimal | Degrades under sustained load |
| Server GC | High | Medium-High | High (per-core heaps) | Minimal | Long pauses under memory pressure |
| Object Pooling | Very High | Very Low | Medium (pool overhead) | Medium | Memory leaks if pools grow unbounded |
| Struct + Stackalloc | Very High | Very Low | None (stack) | High (unsafe code) | Stack overflow if misused |
| Span<T> + Memory<T> | High | Low | Low | Medium | Requires careful lifetime management |
| Manual GC Tuning | Variable | Variable | Variable | High | Fragile across runtime versions |
What you gain:
- Pooling: Eliminates allocation cost, predictable latency
- Server GC: Higher throughput in multi-core environments
- Span<T>: Zero-copy slicing, reduced allocations
What you give up:
- Pooling: Must manage pool size, risk memory leaks
- Server GC: Higher pause times, more memory overhead
- Span<T>: Cannot be stored in fields, async-hostile
What you're implicitly accepting:
- Pooling: Retained memory even when load drops
- Server GC: GC pause times may exceed 100ms during Gen2 collections
- Manual tuning: Behavior may change across .NET versions
7. Common mistakes and misconceptions
"Calling GC.Collect() to "help" the GC"
Why it happends:
- Developers see high memory usage and think forcing a collection will help.
Problem caused:
- GC.Collect() triggers a full Gen2 collection, pausing all threads for 50-500ms. This destroys throughput and causes latency spikes. The GC is tuned to collect optimally based on allocation pressure — forcing it undermines heuristics.
Avoidance:
- Never call GC.Collect() in production code except in controlled scenarios (e.g., after bulk data import, before snapshot for diagnostics). Trust the GC.
Assuming "more memory = faster"
Why it happends:
- If the system has 32GB RAM, why not use it?
Problem caused:
- Larger heaps mean larger Gen2 collections. A 16GB heap takes seconds to scan during full GC. This increases worst-case latency.
Avoidance:
- Tune
GCHeapHardLimitor container limits to force more frequent Gen2 collections at smaller heap sizes. Counterintuitively, less memory can mean better P99 latency.
Ignoring LOH fragmentation
Why it happends:
- Developers allocate large buffers without considering fragmentation.
Problem caused:
- After hours of operation, LOH becomes fragmented. New allocations fail even when total free memory is available, forcing expensive Gen2 compactions or OutOfMemoryException.
Avoidance:
- Use
ArrayPool<T>for buffers >85KB. Enable LOH compaction explicitly when needed (GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce).
Misunderstanding "managed memory" in monitoring
Why it happends:
- Monitoring shows "managed memory = 400MB" but container is using 1.2GB.
Problem caused:
- GC heap size != process memory. Native allocations (P/Invoke, unmanaged libraries, runtime overhead) don't appear in GC metrics. Containers get killed for exceeding limits while GC thinks there's headroom.
Avoidance:
- Monitor process working set (WorkingSet64), not just GC heap size. Set
GCHeapHardLimitto leave 20-30% headroom for native allocations.
Over-pooling
Why it happends:
- "Pooling is good, so pool everything."
Problem caused:
- Pools that never shrink become memory leaks. A traffic spike causes pool to grow to 10GB, then traffic drops but memory stays allocated.
Avoidance:
- Only pool objects that are (a) expensive to allocate, (b) allocated frequently, and (c) have predictable lifetimes. Implement pool trimming logic (e.g., trim to 80% capacity if idle for 60s).
Mixing long-lived and short-lived objects
Why it happends:
- Caching objects that reference short-lived data.
Problem caused:
- Short-lived objects get promoted to Gen2 because they're referenced by long-lived cache entries. Gen2 fills with garbage that should've died in Gen0.
Avoidance:
- Use
WeakReference<T>for caches. Break reference chains between long-lived and short-lived data. Consider separate caches for different lifetime categories.
8. Operational and production considerations
What to monitor
Critical metrics:
% Time in GC // Should be <5%. Above 10% indicates GC pressure
Gen0/Gen1/Gen2 count // Gen2 should be rare (<1/min under normal load)
Gen2 heap size // Tracks memory growth. Alert if growing unbounded
LOH size // Fragmentation risk. Should be stable or grow slowly
Allocation rate (MB/sec) // Baseline normal. Spikes indicate allocation bugs
GC pause time (P50/P99) // Should align with latency SLOs
In .NET, collect via:
GC.CollectionCount(0) // Gen0 collections
GC.CollectionCount(1) // Gen1 collections
GC.CollectionCount(2) // Gen2 collections
GC.GetTotalMemory(false) // Heap size without forcing collection
GC.GetGCMemoryInfo() // Detailed heap stats
For production observability:
- EventSource events:
Microsoft-Windows-DotNETRuntimeprovider - PerfView traces for allocation stacks
- dotnet-counters for live metrics:
dotnet-counters monitor -p <pid> --counters System.Runtime
What degrades first
Under memory pressure:
- P99 latency spikes (Gen2 collections)
- Throughput drops (more time in GC)
- Memory fragmentation (LOH cannot allocate)
- OutOfMemoryException (terminal state)
Under high allocation rate:
- Gen0 collection frequency increases (OK if pauses stay short)
- Gen1 starts filling faster (objects promoted before dying)
- Gen2 collections become frequent (promotion pressure)
- System becomes unresponsive (constant GC pauses)
What becomes expensive
In cloud environments:
- Memory costs scale linearly with heap size
- CPU costs increase with GC overhead (% time in GC)
- Latency SLO violations trigger autoscaling, increasing costs
In Kubernetes:
- Pods killed for exceeding memory limits (OOMKilled)
- Horizontal scaling triggered by false memory pressure (GC hasn't run yet)
- Node memory fragmentation if many pods GC simultaneously
Operational risks
Risk: GC pauses during critical operations
Transaction commits, database writes, or external API calls that occur during GC pause will timeout.
Mitigation:
- Use
GCSettings.IsServerGCto check GC mode - Set
ConcurrentGC=truein runtimeconfig.json to enable background GC - Implement retry logic with exponential backoff
Risk: Memory leaks in pooled resources
Pools that grow but never shrink consume memory indefinitely.
Mitigation:
- Implement pool trimming (e.g.,
ArrayPoolhas built-in trimming) - Monitor pool metrics separately
- Set max pool sizes based on expected concurrency
Risk: Container OOM kills
GC doesn't know about cgroup limits. It assumes all system memory is available.
Mitigation:
<!-- In runtimeconfig.json -->
{
"runtimeOptions": {
"configProperties": {
"System.GC.HeapHardLimit": 536870912, // 512MB in bytes
"System.GC.HeapHardLimitPercent": 75 // Or 75% of container limit
}
}
}
Observability signals
Green (healthy):
- Gen2 collections <1/minute
- % Time in GC <3%
- Allocation rate stable
- P99 latency meets SLO
Yellow (watch):
- Gen2 collections 1-5/minute
- % Time in GC 3-8%
- LOH size growing >10% per hour
- P99 latency approaching SLO
Red (critical):
- Gen2 collections >10/minute
- % Time in GC >10%
- Allocation rate spiking (>2x baseline)
- OutOfMemoryException events
- P99 latency violating SLO
9. When NOT to use this
Don't over-optimize for GC in these scenarios:
-
Low-throughput applications (<100 req/sec): Default GC settings handle this trivially. Pooling and tuning add complexity without measurable benefit.
-
Short-lived processes (batch jobs <5 minutes): Process exits before Gen2 pressure builds. Memory leaks don't matter. Focus on correctness.
-
Memory-unconstrained environments: If you have 128GB RAM and use 2GB, GC tuning is premature. Optimize when resource limits become visible.
-
Development and testing: Default settings expose more bugs. Production-tuned GC can hide memory leaks during development.
-
UI applications with <1000 objects: Workstation GC is already optimized for this. Server GC would increase latency.
Harmful scenarios:
-
Prematurely pooling everything: Pools add complexity and can introduce use-after-return bugs. Pool only after profiling shows allocation is actually a bottleneck.
-
Manually tuning GC without profiling: Changing GC knobs based on intuition often makes things worse. The default heuristics are well-tuned for most workloads.
-
Disabling concurrent GC: Setting
ConcurrentGC=falseeliminates background collection, increasing pause times. Only disable if profiling proves background threads interfere with workload (extremely rare). -
Using GC.Collect() as a "fix": If you need to call GC.Collect() to prevent OOM, you have a memory leak, not a GC tuning problem.
When simple alternatives suffice:
Problem: High allocation rate from LINQ queries
Overkill: Custom pooling, unsafe code
Sufficient: Rewrite hot paths to use for loops
Problem: Large objects causing LOH fragmentation
Overkill: Custom memory manager
Sufficient: ArrayPool<byte>.Shared
Problem: Container OOM kills
Overkill: Rewrite in unmanaged C++
Sufficient: Set GCHeapHardLimit to 75% of container memory
10. Key takeaways
-
The GC optimizes for short-lived objects. Align allocation patterns with this assumption: allocate in Gen0, die in Gen0. Long-lived objects should be initialized once and reused.
-
Gen2 collections are the enemy of latency. Every design decision should be evaluated through the lens of "does this increase Gen2 pressure?" Pooling, struct usage, and Span<T> all reduce Gen2 collections.
-
LOH fragmentation is silent until it kills you. Large buffer allocations (>85KB) bypass generational GC and fragment over time. ArrayPool prevents this. Monitor LOH size — if it grows unbounded, you have a problem.
-
Server GC trades latency for throughput. Use it in server workloads, but expect 50-200ms pauses during Gen2 collections. If P99 latency matters more than throughput, workstation GC may be better.
-
Container memory limits require explicit GC configuration. The GC doesn't know about cgroups. Set
GCHeapHardLimitto 70-80% of container memory or Kubernetes will kill your pods during traffic spikes. -
Profile before optimizing. PerfView, dotnet-trace, and dotnet-counters show actual allocation stacks and GC behavior. Intuition about "expensive" operations is often wrong. Measure first.
-
GC tuning is a last resort, not a first step. Fix allocation patterns (pooling, Span<T>, avoiding closures) before touching GC knobs. The default settings are good. Custom tuning is fragile across runtime versions and workload changes.
11. High-Level Overview
Visual representation of .NET memory management, highlighting stack-scoped execution, heap allocation, GC root discovery, survivorship, and pressure-driven reclamation.