Optimizing Inference for Small Models in Memory-Limited GPU Environments
Presentation:
PDF
This presentation explores optimizing inference for small models in GPU-limited environments. Contrary to traditional methods, we propose replicating small models on a single GPU for improved efficiency. Our empirical studies support this approach, offering insights into inference strategies for small models in resource-constrained settings.