Memory Management

1. What this document is about

This document addresses memory allocation, garbage collection behavior, and resource lifecycle management in .NET applications operating under sustained production load. It focuses on the CLR's memory subsystem and how allocation patterns, GC tuning, and object lifetime decisions affect system throughput, latency distribution, and operational cost.

It applies to:

Server-side APIs handling concurrent requests
Background processors with long-running operations
Cloud-native workloads in containerized environments
Applications with strict P99 latency requirements
Systems experiencing memory pressure or LOH fragmentation

It does not apply to:

Unmanaged memory management in C/C++
Memory safety vulnerabilities (buffer overflows, use-after-free)
Basic .NET programming concepts
Language-specific features unrelated to runtime behavior

This document is about reasoning and trade-offs, not about micro-optimizations or cargo-cult "performance tips".

2. Why this matters in real systems

Memory management becomes a critical concern when systems operate outside the happy path assumptions of default GC behavior.

Typical pressure points:

Sustained throughput: A payment processing API handling 50,000 requests/minute starts experiencing intermittent 300ms spikes. Profiling reveals Gen2 collections blocking request threads. The default workstation GC cannot keep pace with allocation rate.

Container limits: A microservice deployed with a 512MB memory limit crashes with OutOfMemoryException during traffic bursts, despite avg memory usage at 280MB. The GC cannot reclaim fast enough when allocation spikes occur within cgroup constraints.

Long-running processes: A background job processor runs for 72+ hours, gradually slowing from 1000 items/sec to 200 items/sec. LOH fragmentation prevents allocation of large buffers, forcing expensive Gen2 compactions.

Latency-sensitive workloads: A real-time notification service maintains WebSocket connections to 100K clients. GC pauses exceeding 50ms cause connection timeouts and reconnection storms.

What tends to break when this is ignored:

P99 latency degrades by 10-100x during GC pressure
Memory costs increase 2-4x due to conservative allocation patterns
Systems fail unpredictably under load spikes that shouldn't be fatal
Kubernetes kills pods for exceeding memory limits during otherwise healthy operation

Simpler approaches stop working when allocation rate exceeds ~500MB/sec, when object lifetimes span multiple GC generations, or when heap size approaches container memory limits.

3. Core concept (mental model)

The .NET GC operates as a generational copying collector with concurrent marking. Think of it as a conveyor belt system with three stages.

The conveyor belt model:

[Stage 0: Fast Lane] → [Stage 1: Middle Belt] → [Stage 2: Long-Term Storage]
   Gen0 (Eden)           Gen1 (Survivor)          Gen2 (Tenured)
   
   Collect every         Collect every            Collect only when
   ~1-2MB allocated      ~10-20 Gen0 cycles       pressure builds
   
   Most objects die      Intermediate objects     Long-lived objects
   immediately           promoted if survive      live here indefinitely

Objects start in Gen0. If they survive a Gen0 collection, they're promoted to Gen1. Surviving Gen1 moves them to Gen2. The GC assumes most objects die young (generational hypothesis), so it collects Gen0 frequently and cheaply.

The Large Object Heap (LOH) is a separate conveyor:

Objects ≥85,000 bytes bypass generational collection entirely and go directly to a special heap that only compacts under memory pressure. This prevents copying costs for large buffers but introduces fragmentation risk.

Key mental model:

Gen0/Gen1 collections are fast (sub-millisecond) because they scan small memory regions
Gen2 collections are expensive (10-100ms+) because they scan the entire heap
The GC tries to avoid Gen2 collections as long as possible
Your allocation patterns determine how often expensive collections occur

By the end of this section, you should think: "The GC is optimized for short-lived objects. My job is to align allocation patterns with this assumption or explicitly manage what doesn't fit."

4. How it works (step-by-step)

Step 1 — Allocation request

When code executes new MyObject(), the CLR checks if Gen0 has enough contiguous space. Gen0 uses a simple bump allocator — just increment a pointer. This is why managed allocation can be faster than malloc.

Why this exists: Bump allocation is O(1) and cache-friendly. The GC assumes you'll allocate many objects quickly, so it optimizes for throughput over fragmentation.

Invariant: Gen0 must always have space. If not, trigger a collection.

Step 2 — Gen0 collection trigger

When Gen0 fills (typically 1-2MB threshold), the GC pauses application threads and scans live objects. This is a "stop-the-world" event, but the pause is usually <1ms because Gen0 is small.

Why this exists: The GC can only reclaim memory by identifying dead objects. It must stop mutation to avoid concurrent modification issues during graph traversal.

Assumption: Most Gen0 objects are already dead. Collection finds few survivors to copy.

Step 3 — Mark and promote

The GC traces from GC roots (stack variables, static fields, active threads) to mark live objects. Survivors are copied to Gen1 or Gen2 depending on promotion policy. Dead objects are implicitly freed by not being copied.

Why copying instead of mark-sweep: Copying compacts memory, preventing fragmentation. This maintains fast bump allocation in Gen1.

Step 4 — Gen1 and Gen2 collections

Gen1 collections trigger after ~10-20 Gen0 cycles. Gen2 triggers only when memory pressure builds or explicit GC.Collect() is called. Gen2 collection scans the entire managed heap.

Why this exists: The generational hypothesis holds — very few objects live long enough to reach Gen2. By deferring Gen2 collection, the GC amortizes the cost over many fast Gen0 collections.

Invariant: Gen2 size grows until memory pressure forces a full collection.

LOH allocation path

Step 5 — Large object handling

Objects ≥85KB bypass Gen0-Gen2 and go directly to the LOH. The LOH does not compact by default (expensive to copy large blocks). It uses a free list allocator similar to traditional malloc.

Why this exists: Copying a 10MB buffer is too expensive to do frequently. The LOH trades fragmentation risk for allocation speed.

Fragmentation scenario: Allocate 100MB buffer, free it, allocate 90MB buffer. The 90MB cannot reuse the 100MB slot, wasting 10MB until a Gen2 collection compacts.

Server GC vs Workstation GC

Workstation GC:

Single dedicated GC thread
Optimized for low latency in interactive apps
Smaller Gen0 budget

Server GC:

Multiple GC threads (one per logical processor)
Larger Gen0 budget per heap
Higher throughput, higher pause times
Each logical core gets its own Gen0/Gen1/Gen2 heap segment

Why two modes: UI apps need responsiveness (short pauses). Servers need throughput (more work between pauses).

5. Minimal, realistic example

Scenario: High-throughput API with memory pooling

This ASP.NET Core API processes image uploads. Without pooling, each request allocates a 1MB byte array, causing frequent Gen2 collections under load.

// Bad: Allocates 1MB per request, Gen2 collections every ~500 requests
public class ImageProcessor
{
    public async Task<byte[]> ProcessImageAsync(Stream input)
    {
        var buffer = new byte[1024 * 1024]; // 1MB allocation
        await input.ReadAsync(buffer, 0, buffer.Length);
        // ... process image ...
        return CompressImage(buffer);
    }
}

// Good: Pools buffers, reduces Gen2 pressure
public class ImageProcessor
{
    private static readonly ArrayPool<byte> _bufferPool = ArrayPool<byte>.Shared;
    
    public async Task<byte[]> ProcessImageAsync(Stream input)
    {
        var buffer = _bufferPool.Rent(1024 * 1024); // Rent from pool
        try
        {
            await input.ReadAsync(buffer, 0, buffer.Length);
            // ... process image ...
            return CompressImage(buffer);
        }
        finally
        {
            _bufferPool.Return(buffer); // Return to pool
        }
    }
}

How this maps to the concept:

Without pooling: Every request creates a 1MB byte array. Since 1MB > 85KB, it goes to LOH. Under 1000 req/sec, you allocate 1GB/sec to LOH, forcing frequent Gen2 collections.
With pooling: Buffers are reused. Only initial allocation hits LOH. Subsequent requests rent existing buffers, eliminating allocation pressure. Gen2 collections drop from every 10s to every 10 minutes.

Production impact: At 5000 req/sec, this change reduced P99 latency from 450ms to 12ms and cut memory usage from 4GB to 800MB.

6. Design trade-offs

Approach	Throughput	Latency (P99)	Memory Overhead	Complexity	Failure Mode
Default GC (Workstation)	Medium	Low (optimized)	Low	Minimal	Degrades under sustained load
Server GC	High	Medium-High	High (per-core heaps)	Minimal	Long pauses under memory pressure
Object Pooling	Very High	Very Low	Medium (pool overhead)	Medium	Memory leaks if pools grow unbounded
Struct + Stackalloc	Very High	Very Low	None (stack)	High (unsafe code)	Stack overflow if misused
Span<T> + Memory<T>	High	Low	Low	Medium	Requires careful lifetime management
Manual GC Tuning	Variable	Variable	Variable	High	Fragile across runtime versions

What you gain:

Pooling: Eliminates allocation cost, predictable latency
Server GC: Higher throughput in multi-core environments
Span<T>: Zero-copy slicing, reduced allocations

What you give up:

Pooling: Must manage pool size, risk memory leaks
Server GC: Higher pause times, more memory overhead
Span<T>: Cannot be stored in fields, async-hostile

What you're implicitly accepting:

Pooling: Retained memory even when load drops
Server GC: GC pause times may exceed 100ms during Gen2 collections
Manual tuning: Behavior may change across .NET versions

7. Common mistakes and misconceptions

"Calling GC.Collect() to "help" the GC"

Why it happends:

Developers see high memory usage and think forcing a collection will help.

Problem caused:

GC.Collect() triggers a full Gen2 collection, pausing all threads for 50-500ms. This destroys throughput and causes latency spikes. The GC is tuned to collect optimally based on allocation pressure — forcing it undermines heuristics.

Avoidance:

Never call GC.Collect() in production code except in controlled scenarios (e.g., after bulk data import, before snapshot for diagnostics). Trust the GC.

Assuming "more memory = faster"

Why it happends:

If the system has 32GB RAM, why not use it?

Problem caused:

Larger heaps mean larger Gen2 collections. A 16GB heap takes seconds to scan during full GC. This increases worst-case latency.

Avoidance:

Tune GCHeapHardLimit or container limits to force more frequent Gen2 collections at smaller heap sizes. Counterintuitively, less memory can mean better P99 latency.

Ignoring LOH fragmentation

Why it happends:

Developers allocate large buffers without considering fragmentation.

Problem caused:

After hours of operation, LOH becomes fragmented. New allocations fail even when total free memory is available, forcing expensive Gen2 compactions or OutOfMemoryException.

Avoidance:

Use ArrayPool<T> for buffers >85KB. Enable LOH compaction explicitly when needed (GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce).

Misunderstanding "managed memory" in monitoring

Why it happends:

Monitoring shows "managed memory = 400MB" but container is using 1.2GB.

Problem caused:

GC heap size != process memory. Native allocations (P/Invoke, unmanaged libraries, runtime overhead) don't appear in GC metrics. Containers get killed for exceeding limits while GC thinks there's headroom.

Avoidance:

Monitor process working set (WorkingSet64), not just GC heap size. Set GCHeapHardLimit to leave 20-30% headroom for native allocations.

Over-pooling

Why it happends:

"Pooling is good, so pool everything."

Problem caused:

Pools that never shrink become memory leaks. A traffic spike causes pool to grow to 10GB, then traffic drops but memory stays allocated.

Avoidance:

Only pool objects that are (a) expensive to allocate, (b) allocated frequently, and (c) have predictable lifetimes. Implement pool trimming logic (e.g., trim to 80% capacity if idle for 60s).

Mixing long-lived and short-lived objects

Why it happends:

Caching objects that reference short-lived data.

Problem caused:

Short-lived objects get promoted to Gen2 because they're referenced by long-lived cache entries. Gen2 fills with garbage that should've died in Gen0.

Avoidance:

Use WeakReference<T> for caches. Break reference chains between long-lived and short-lived data. Consider separate caches for different lifetime categories.

8. Operational and production considerations

What to monitor

Critical metrics:

% Time in GC              // Should be <5%. Above 10% indicates GC pressure
Gen0/Gen1/Gen2 count      // Gen2 should be rare (<1/min under normal load)
Gen2 heap size            // Tracks memory growth. Alert if growing unbounded
LOH size                  // Fragmentation risk. Should be stable or grow slowly
Allocation rate (MB/sec)  // Baseline normal. Spikes indicate allocation bugs
GC pause time (P50/P99)   // Should align with latency SLOs

In .NET, collect via:

GC.CollectionCount(0)  // Gen0 collections
GC.CollectionCount(1)  // Gen1 collections
GC.CollectionCount(2)  // Gen2 collections
GC.GetTotalMemory(false)  // Heap size without forcing collection
GC.GetGCMemoryInfo()   // Detailed heap stats

For production observability:

EventSource events: Microsoft-Windows-DotNETRuntime provider
PerfView traces for allocation stacks
dotnet-counters for live metrics: dotnet-counters monitor -p <pid> --counters System.Runtime

What degrades first

Under memory pressure:

P99 latency spikes (Gen2 collections)
Throughput drops (more time in GC)
Memory fragmentation (LOH cannot allocate)
OutOfMemoryException (terminal state)

Under high allocation rate:

Gen0 collection frequency increases (OK if pauses stay short)
Gen1 starts filling faster (objects promoted before dying)
Gen2 collections become frequent (promotion pressure)
System becomes unresponsive (constant GC pauses)

What becomes expensive

In cloud environments:

Memory costs scale linearly with heap size
CPU costs increase with GC overhead (% time in GC)
Latency SLO violations trigger autoscaling, increasing costs

In Kubernetes:

Pods killed for exceeding memory limits (OOMKilled)
Horizontal scaling triggered by false memory pressure (GC hasn't run yet)
Node memory fragmentation if many pods GC simultaneously

Operational risks

Risk: GC pauses during critical operations

Transaction commits, database writes, or external API calls that occur during GC pause will timeout.

Mitigation:

Use GCSettings.IsServerGC to check GC mode
Set ConcurrentGC=true in runtimeconfig.json to enable background GC
Implement retry logic with exponential backoff

Risk: Memory leaks in pooled resources

Pools that grow but never shrink consume memory indefinitely.

Mitigation:

Implement pool trimming (e.g., ArrayPool has built-in trimming)
Monitor pool metrics separately
Set max pool sizes based on expected concurrency

Risk: Container OOM kills

GC doesn't know about cgroup limits. It assumes all system memory is available.

Mitigation:

<!-- In runtimeconfig.json -->
{
  "runtimeOptions": {
    "configProperties": {
      "System.GC.HeapHardLimit": 536870912,  // 512MB in bytes
      "System.GC.HeapHardLimitPercent": 75   // Or 75% of container limit
    }
  }
}

Observability signals

Green (healthy):

Gen2 collections <1/minute
% Time in GC <3%
Allocation rate stable
P99 latency meets SLO

Yellow (watch):

Gen2 collections 1-5/minute
% Time in GC 3-8%
LOH size growing >10% per hour
P99 latency approaching SLO

Red (critical):

Gen2 collections >10/minute
% Time in GC >10%
Allocation rate spiking (>2x baseline)
OutOfMemoryException events
P99 latency violating SLO

9. When NOT to use this

Don't over-optimize for GC in these scenarios:

Low-throughput applications (<100 req/sec): Default GC settings handle this trivially. Pooling and tuning add complexity without measurable benefit.
Short-lived processes (batch jobs <5 minutes): Process exits before Gen2 pressure builds. Memory leaks don't matter. Focus on correctness.
Memory-unconstrained environments: If you have 128GB RAM and use 2GB, GC tuning is premature. Optimize when resource limits become visible.
Development and testing: Default settings expose more bugs. Production-tuned GC can hide memory leaks during development.
UI applications with <1000 objects: Workstation GC is already optimized for this. Server GC would increase latency.

Harmful scenarios:

Prematurely pooling everything: Pools add complexity and can introduce use-after-return bugs. Pool only after profiling shows allocation is actually a bottleneck.
Manually tuning GC without profiling: Changing GC knobs based on intuition often makes things worse. The default heuristics are well-tuned for most workloads.
Disabling concurrent GC: Setting ConcurrentGC=false eliminates background collection, increasing pause times. Only disable if profiling proves background threads interfere with workload (extremely rare).
Using GC.Collect() as a "fix": If you need to call GC.Collect() to prevent OOM, you have a memory leak, not a GC tuning problem.

When simple alternatives suffice:

Problem: High allocation rate from LINQ queries
Overkill: Custom pooling, unsafe code
Sufficient: Rewrite hot paths to use for loops

Problem: Large objects causing LOH fragmentation
Overkill: Custom memory manager
Sufficient: ArrayPool<byte>.Shared

Problem: Container OOM kills
Overkill: Rewrite in unmanaged C++
Sufficient: Set GCHeapHardLimit to 75% of container memory

10. Key takeaways

The GC optimizes for short-lived objects. Align allocation patterns with this assumption: allocate in Gen0, die in Gen0. Long-lived objects should be initialized once and reused.
Gen2 collections are the enemy of latency. Every design decision should be evaluated through the lens of "does this increase Gen2 pressure?" Pooling, struct usage, and Span<T> all reduce Gen2 collections.
LOH fragmentation is silent until it kills you. Large buffer allocations (>85KB) bypass generational GC and fragment over time. ArrayPool prevents this. Monitor LOH size — if it grows unbounded, you have a problem.
Server GC trades latency for throughput. Use it in server workloads, but expect 50-200ms pauses during Gen2 collections. If P99 latency matters more than throughput, workstation GC may be better.
Container memory limits require explicit GC configuration. The GC doesn't know about cgroups. Set GCHeapHardLimit to 70-80% of container memory or Kubernetes will kill your pods during traffic spikes.
Profile before optimizing. PerfView, dotnet-trace, and dotnet-counters show actual allocation stacks and GC behavior. Intuition about "expensive" operations is often wrong. Measure first.
GC tuning is a last resort, not a first step. Fix allocation patterns (pooling, Span<T>, avoiding closures) before touching GC knobs. The default settings are good. Custom tuning is fragile across runtime versions and workload changes.

11. High-Level Overview

Visual representation of .NET memory management, highlighting stack-scoped execution, heap allocation, GC root discovery, survivorship, and pressure-driven reclamation.

Scroll to zoom • Drag to pan

1. What this document is about​

2. Why this matters in real systems​

Typical pressure points:​

What tends to break when this is ignored:​

3. Core concept (mental model)​

4. How it works (step-by-step)​

Step 1 — Allocation request​

Step 2 — Gen0 collection trigger​

Step 3 — Mark and promote​

Step 4 — Gen1 and Gen2 collections​

LOH allocation path​

Step 5 — Large object handling​

Server GC vs Workstation GC​

5. Minimal, realistic example​

Scenario: High-throughput API with memory pooling​

6. Design trade-offs​

7. Common mistakes and misconceptions​

"Calling GC.Collect() to "help" the GC"​

Assuming "more memory = faster"​

Ignoring LOH fragmentation​

Misunderstanding "managed memory" in monitoring​

Over-pooling​

Mixing long-lived and short-lived objects​

8. Operational and production considerations​

What to monitor​

What degrades first​

What becomes expensive​

Operational risks​

Observability signals​

9. When NOT to use this​

Don't over-optimize for GC in these scenarios:​

Harmful scenarios:​

When simple alternatives suffice:​

10. Key takeaways​

11. High-Level Overview​

1. What this document is about

2. Why this matters in real systems

Typical pressure points:

What tends to break when this is ignored:

3. Core concept (mental model)

4. How it works (step-by-step)

Step 1 — Allocation request

Step 2 — Gen0 collection trigger

Step 3 — Mark and promote

Step 4 — Gen1 and Gen2 collections

LOH allocation path

Step 5 — Large object handling

Server GC vs Workstation GC

5. Minimal, realistic example

Scenario: High-throughput API with memory pooling

6. Design trade-offs

7. Common mistakes and misconceptions

"Calling GC.Collect() to "help" the GC"

Assuming "more memory = faster"

Ignoring LOH fragmentation

Misunderstanding "managed memory" in monitoring

Over-pooling

Mixing long-lived and short-lived objects

8. Operational and production considerations

What to monitor

What degrades first

What becomes expensive

Operational risks

Observability signals

9. When NOT to use this

Don't over-optimize for GC in these scenarios:

Harmful scenarios:

When simple alternatives suffice:

10. Key takeaways

11. High-Level Overview