Model latency is critical to a successful roll-out of a production machine learning model. No one wants to wait, especially not customers using a machine learning model. My notes cover tools to investigate the model performance to detect bottlenecks within the model graph.
Speculative Decoding with vLLM
Improving LLV inferences with speculative decoding