(Part 1) “Data Can Move Itself” — Cold and Hot Migration with Adaptive Thresholds: How to Maximize Long-Term Experience

 prologue 

No matter how intelligent the front-end decisions are, they cannot withstand the wear of time and fluctuations in business demands. To ensure that devices are not just “fast in the moment” but “stable in the long run,” the key lies in two things: backend cold-hot migration (where data finds its appropriate home on its own) and adaptive thresholds (where the system adjusts automatically like a thermostat). Today, we’ll break down these two optimizations that “happen quietly.”

01 Why Do We Need “Background Relocation”?

Foreground placement decisions address “where should this write land?” However, data temperature changes over time: initially hot data may “cool down,” while previously cold regions can suddenly become active again.

If left unattended:

  • Performance tier becomes clogged with cold data, leaving no room for urgent new workloads;
  • Capacity tier holds recently warmed-up data, causing repeated latency penalties on access;
  • Write Amplification Factor (WAF) and garbage collection pressure increase, degrading steady-state performance.

The goal of background hot/cold migration is to periodically rebalance data placement—without disturbing foreground operations.


02 How Do We Judge “Hot” vs. “Cold”? Two Signals Suffice

Two simple yet powerful signals: access frequency (how often) + recency of last access (how fresh).

  • Hot: High frequency or recent access → likely to be accessed again soon.
  • Cold: Low frequency and long time since last access → safe to demote.

In practice, each address range (or object) maintains a heat score. Higher frequency and greater recency → higher score; aging and idleness → lower score. Crucially, the score incorporates time-based decay: untouched entries gradually “cool off,” eliminating the need for manual cleanup or resets.


03 How Does Relocation Work? A Four-Step, Low-Impact Workflow

  1. Scan & Tag During idle periods, traverse the heat-score table to identify candidates for promotion/demotion—with quotas to prevent over-migration in one go.
  2. ** Opportunistic Relocation** Execute moves during low-load windows; process in small batches and throttle bandwidth to guarantee foreground latency SLAs.
  3. Mapping Update Upon successful copy, mark the original location as invalid and update logical-to-physical mapping to point to the new address.
  4. Reclaim & Reorganize Release the old physical blocks into the free pool, integrating them into garbage collection (GC) and wear-leveling (WL) workflows for long-term health.

Think of it as a nighttime janitorial crew: no disruption during business hours, quietly clearing pathways and optimizing flow overnight.


04 Preventing “Ping-Pong”: Three Safeguards Against Thrashing

Poorly tuned migration can cause thrashing (e.g., promote → cool → demote → heat up → promote again). To avoid this, we enforce three guardrails:

  1. Hysteresis & Threshold Bands Promotion/demotion thresholds include buffer zones and require sustained state changes before triggering—preventing reactions to transient spikes or noise.
  2. Migration Budgeting Enforce per-window caps on relocation volume (e.g., max IOPS, block count). Better to move slowly than compete with the foreground.
  3. Black/White Lists Critical metadata, logs, etc., are whitelisted for promotion-only; large sequential cold archives are blacklisted from promotion—reducing unnecessary shuffling.

05 Adaptive Thresholds: A “Thermostat” That Self-Adjusts

Cold/hot migration alone isn’t enough. The system must self-tune based on real-time conditions—closing the loop via three key KPI categories:

Dimension
Key Metrics
Latency
Avg / median latency, p99/p999 tail latency, jitter
Resource
Performance-tier occupancy, tier hit ratios (read/write)
Health
WAF, erase-count distribution, bad-block growth rate

When anomalies are detected, the system automatically adapts:

  • Placement Thresholds: → Congested fast tier? Raise thresholds—only the hottest workloads earn promotion. → Fast tier underutilized? Lower thresholds—admit more “warm” candidates.
  • Scoring Weights: Dynamically adjust weighting of factors (e.g., randomness, concurrency, host hints). Example: When small random I/O surges, increase weight on “randomness” to prioritize responsiveness.
  • Background Pace: Slow down migration/GC during peak load; accelerate during idle periods—always prioritizing system stability.

This creates a self-regulating storage fabric: intelligent, resilient, and always optimizing for the user experience.

You might interested