Model latency is critical to a successful roll-out of a production machine learning model. No one wants to wait, especially not customers using a machine learning model. My notes cover tools to investigate the model performance to detect bottlenecks within the model graph.
Speculative Decoding with vLLM using Gemma
Improving LLM inferences with speculative decoding using Gemma