Hardware 🛠LLMsUltimate Guide to Local Inference

Ultimate Guide to Local Inference

Overview

This document details the data flow in LLM inference systems, covering model loading, GPU communication, memory hierarchy, and inference processes.

1. Model Loading Process

Loading Model from Disk to VRAM

Load Model From Disk to VRAM Process

The process involves several steps:

  1. Initial Storage (Disk)

    • Models are stored on SATA/SSD drives
    • Initial format includes model weights and code
  2. Transfer Process

    • Model is loaded into RAM first
    • Subsequently transferred to GPU VRAM
    • Direct loading to VRAM isn’t possible due to hardware architecture constraints
  3. Memory Management

    • CPU/RAM serves as an intermediary
    • Model weights are loaded in chunks to manage memory efficiently
    • VRAM allocation is optimized for GPU processing

Inference Execution Flow

When a prompt is received (e.g., “Hello, How are you?”), the system follows this process:

  1. CPU Decision Point The CPU decides between two execution paths:

    • Execute computation entirely on CPU
    • Offload model layers to GPU for parallel computation
  2. Execution Process

    CPU-GPU Execution Flow

    • Input data transfers from CPU to GPU
    • GPU executes kernels on model weights
    • Results are copied from VRAM to RAM (if needed)
    • Post-processing occurs on CPU
  3. Synchronization Considerations Critical sync points exist between Execute Kernel and Model Weight operations:

    • GPU kernels run asynchronously by default
    • CPU continues execution immediately after kernel launch
    • Explicit synchronization prevents CPU from accessing results before GPU completes computation

2. GPU Communication Patterns

Flow 1: Direct GPU-to-GPU Communication

Direct GPU Communication

  • Technology Used:
    • NVIDIA: NVLINK/NVSwitch
    • AMD: Crossfire
  • Implementation Requirements:
    • NVIDIA: NCCL (NVIDIA Collective Communications Library)
    • AMD: RCCL (ROCm Communication Collective Library)
  • Benefits:
    • Higher bandwidth
    • Lower latency
    • Bypasses system RAM
    • Direct memory access between GPUs

Flow 2: GPU-RAM-GPU Communication

GPU-RAM-GPU Communication

  • Traditional Path:
    • Data moves from GPU 1 → System RAM → GPU 2
  • Performance Limitations:
    1. PCIe Bandwidth Constraints:
      • PCIe 3.0: ~16 GB/s
      • PCIe 4.0: ~32 GB/s
      • PCIe 5.0: ~64 GB/s
    2. RAM Speed Limitations:
      • DDR4: 25.6 GB/s
      • DDR5: 38.4 GB/s
      • HBM2: 307.2 GB/s

3. Memory Hierarchy

Memory Hierarchy

When offloading model layers to GPU, data traverses through multiple memory tiers, each with different access speeds and capacities. Understanding this hierarchy is crucial for optimizing LLM inference performance.

Memory Components

  1. High Bandwidth Memory (HBM/VRAM) - 16 GB

    • Primary storage for model weights and activations
    • Connected directly to GPU via wide memory bus
    • Bandwidth: ~900 GB/s - 1.2 TB/s
    • Higher latency compared to caches (~400-600 cycles)
    • Stores the complete model during inference
  2. L2 Cache - 5144 KB

    • Unified cache shared across all Streaming Multiprocessors (SMs)
    • Acts as buffer between HBM and L1 caches
    • Latency: ~193 cycles
    • Caches frequently accessed model weights
    • Helps reduce HBM access frequency
  3. L1/Shared Memory Tier

    • L1 Data Cache

      • Per-SM cache for fast data access
      • Typically 128 KB per SM
      • Latency: ~28 cycles
    • Shared Memory

      • Programmer-managed scratchpad memory
      • Same speed as L1 cache
      • Used for inter-thread communication within SM
      • Crucial for reduction operations (sum, softmax)
      • Size configurable with L1 cache (shared pool)
    • L1 Instruction Cache

      • Stores frequently executed instructions
      • Reduces instruction fetch latency
  4. Registers - 64 KB per SM

    • Fastest memory tier (~1 cycle latency)
    • Local to each thread
    • Used for immediate computations
    • Limited quantity per thread

Warp Shuffle vs Shared Memory

Warp Shuffle Operations

  • Direct register-to-register communication within a warp
  • Used for fast reduction operations within warps
  • Zero memory access needed
  • Common uses:
    • Warp-level reductions
    • Matrix multiplication accumulation
    • Softmax normalization
  • Limitations: Only works within same warp

Shared Memory Operations

  • Visible to all threads in thread block
  • Used for:
    • Cross-warp communication
    • Block-level reductions
    • Cooperative data loading
  • Higher latency than warp shuffle
  • More flexible but requires explicit synchronization
  • Better for larger data sharing patterns

Latency Hierarchy

  • Register access: ~1 cycle
  • L1 cache: ~28 cycles
  • L2 cache: ~193 cycles
  • Global memory (VRAM inside GPU): ~400 cycles

4. Inference Process

Inference Request Flow

Request Flow

  1. Input Processing

    • Text input received
    • Tokenization applied
    • Vector conversion
  2. Model Pipeline

    • Embedding lookup
    • Transformer layers processing
    • Linear layer computations
    • Vocabulary mapping

KV Cache Management

KV Cache Management

  1. Cache Verification

    • Check for existing KV pairs
    • Evaluate cache hit/miss
  2. Cache Operations

    • Compute new KV pairs if needed
    • Store in VRAM cache
    • Update cache for attention computation
  3. Memory Operations

Memory Operations

  • VRAM memory management
  • High bandwidth access patterns
  • I/O bottleneck handling

Performance Considerations

VRAM vs RAM Speed Comparison

VRAM vs RAM Speed

References