Quantization
1. Introduction Machine learning models are getting larger and larger, and these larger models tend to perform better. However, these larger models have high memory requirements just to load them into memory. Consumer hardware doesn’t have a huge amount of memory to load these large models. Model compression is a technique used to compress large models into smaller ones at the expense of small to negligible inaccuracies. The compressed models can be optimized to run on consumer hardware and can make use of NPUs for high performance....