HiReLC: A Hierarchical Reinforcement Learning Framework for Joint Neural Network Pruning and Quantization Compression
This paper presents HiReLC, a hierarchical ensemble reinforcement learning framework for the automated joint quantization and structured pruning of deep neural networks. The approach decomposes the compression search space across two levels of abstraction: low-level agents (LLAs) operate independently, selecting multi-discrete action configurations per module that span bit-width, pruning retention ratio, quantization type, and granularity; high-level agents (HLAs) coordinate global budget allocation via ensemble voting guided by Fisher-information-based sensitivity estimation. To reduce the computational cost of policy evaluation, the framework introduces an iterative active-learning loop that employs a lightweight MLP surrogate model for reward shaping and a logit-MSE surrogate during cold start, ultimately performing rigorous evaluation through post-compression fine-tuning. Experiments show that HiReLC achieves parameter-storage compression ratios ranging from 5.99× to 6.72× across Vision Transformer and CNN benchmarks, with accuracy gains up to 3.83% in select settings and degradation of 0.55%–5.62% in others, validating both hierarchical policy decomposition and sensitivity-aware guidance.
Background and Context
The deployment of deep neural networks in resource-constrained environments faces significant hurdles due to the immense computational costs and storage requirements associated with modern architectures. Traditional model compression techniques have historically treated pruning and quantization as separate, sequential processes. This decoupled approach fails to capture the complex, non-linear coupling relationships between structural sparsity and numerical precision, often resulting in suboptimal compression ratios or severe accuracy degradation. The core problem lies in the inability of conventional methods to jointly optimize these parameters, leading to inefficient search spaces and a trade-off that rarely achieves the best possible balance between model size and performance.
To address this fundamental limitation, the HiReLC framework introduces a hierarchical ensemble reinforcement learning approach designed for the automated joint quantization and structured pruning of deep neural networks. Unlike previous monolithic optimization strategies, HiReLC decomposes the vast compression search space into two distinct levels of abstraction: low-level and high-level agents. This architectural shift aims to mitigate the curse of dimensionality inherent in joint optimization problems. By separating the granular configuration of individual network modules from the global allocation of computational budgets, the framework seeks to navigate the search space more efficiently, ensuring that both compression efficiency and model accuracy are preserved.
The significance of this approach extends beyond theoretical novelty, offering a practical solution for automating machine learning workflows in model compression. By employing an architecture-agnostic modular controller, HiReLC can be applied across various neural network structures, including Convolutional Neural Networks (CNNs) and Vision Transformers. This universality is critical for industrial adoption, as it eliminates the need for manual, architecture-specific tuning. The framework’s design philosophy centers on reducing the human effort required to achieve high-performance compressed models, thereby accelerating the deployment pipeline from training to edge inference.
Deep Analysis
At the technical core of HiReLC is a dual-layer reinforcement learning system that orchestrates the compression process through coordinated agent interactions. Low-Level Agents (LLAs) operate independently within each network module, selecting multi-discrete action configurations. These actions encompass a wide range of parameters, including bit-width selection, pruning retention ratios, quantization types, and granularity levels. This fine-grained control allows the system to tailor the compression strategy to the specific characteristics of each module, rather than applying a uniform reduction across the entire network. The multi-discrete nature of the action space enables a highly customized approach to model optimization, capturing the unique sensitivity and redundancy of different layers.
Complementing the LLAs, High-Level Agents (HLAs) are responsible for coordinating the global budget allocation across the network. The HLAs utilize an ensemble voting mechanism guided by Fisher-information-based sensitivity estimation. This statistical measure allows the system to identify which network layers are most sensitive to perturbations and errors. By prioritizing the protection of these critical layers or assigning them more generous compression budgets, the HLAs ensure that the overall model accuracy is maintained even under aggressive compression. This sensitivity-aware guidance is a key differentiator, as it prevents the indiscriminate reduction of parameters that could lead to catastrophic accuracy loss.
To mitigate the prohibitive computational costs associated with evaluating reinforcement learning policies, HiReLC incorporates an iterative active-learning loop. This loop alternates between surrogate-based optimization and rigorous post-compression fine-tuning. During the cold-start phase, the framework employs a logit-MSE surrogate to accelerate initial policy convergence. Subsequently, a lightweight Multi-Layer Perceptron (MLP) surrogate model is used for reward shaping, approximating the performance of compression strategies without the need for full training cycles. This strategy significantly reduces the computational overhead while maintaining the integrity of the final evaluation, which is always grounded in actual post-compression fine-tuning results.
Industry Impact
The experimental validation of HiReLC demonstrates its efficacy across a variety of mainstream benchmarks, including Vision Transformers and CNNs. The framework achieves parameter-storage compression ratios ranging from 5.99x to 6.72x, a substantial reduction that highlights its potential for deploying large models on edge devices. These results are particularly notable given the diversity of the test cases, indicating that the hierarchical approach is robust across different architectural paradigms. The ability to achieve such high compression ratios without manual intervention represents a significant step forward in the automation of model optimization workflows.
In terms of accuracy, the performance of HiReLC exhibits a nuanced behavior that underscores the effectiveness of its joint optimization strategy. In select settings, the compressed models achieved accuracy gains of up to 3.83% compared to their uncompressed counterparts. This counterintuitive improvement suggests that the compression process can act as a regularizer, potentially enhancing the model's generalization capabilities by removing redundant parameters and noise. In other configurations, accuracy degradation was observed, ranging from 0.55% to 5.62%. While this represents a loss, it remains within an acceptable range for many practical applications, especially when weighed against the significant gains in storage efficiency and inference speed.
Ablation studies further validate the importance of the hierarchical policy decomposition and sensitivity-aware guidance. Comparisons with single-layer agent approaches and methods lacking sensitivity guidance reveal that HiReLC consistently achieves a superior balance between compression ratio and accuracy retention. These findings confirm that the separation of concerns between low-level configuration and high-level budget allocation is not merely a theoretical construct but a practical necessity for effective joint compression. The results provide a strong empirical basis for the adoption of hierarchical reinforcement learning in automated machine learning pipelines.
Outlook
The implications of HiReLC for the broader AI industry are profound, particularly in the context of edge computing and mobile deployment. By providing a robust, automated tool for model compression, the framework reduces the barrier to entry for deploying sophisticated AI models on resource-constrained hardware. This capability is essential for the next generation of intelligent devices, where latency, power consumption, and storage capacity are critical constraints. The architecture-agnostic nature of HiReLC ensures that it can be integrated into existing deep learning frameworks, facilitating rapid adoption by both academic researchers and industrial practitioners.
Furthermore, the introduction of iterative active learning and surrogate models in HiReLC sets a new precedent for reducing the computational costs of reinforcement learning in large-scale optimization tasks. This methodology may inspire future research into more efficient automated compression algorithms, potentially extending beyond pruning and quantization to other forms of model optimization. By demonstrating the viability of hierarchical search spaces and sensitivity-guided allocation, HiReLC opens new avenues for exploring the limits of model efficiency.
As the demand for lightweight AI models continues to grow, frameworks like HiReLC will play a crucial role in bridging the gap between high-performance research models and practical, deployable applications. The success of this approach in achieving high compression ratios with minimal accuracy loss validates the potential of automated, hierarchical reinforcement learning in solving complex optimization problems. This work not only advances the state of the art in model compression but also contributes to the broader goal of making artificial intelligence more accessible, efficient, and sustainable across diverse computing environments.