View publication

Modern neural networks are growing not only in size and complexity but also in inference time. One of the most effective compression techniques -- channel pruning -- combats this trend by removing channels from convolutional weights to reduce resource consumption. However, removing channels is non-trivial for multi-branch segments of a model, which can introduce extra memory copies at inference time. These copies incur increase latency -- so much so, that the pruned model is even slower than the original, unpruned model. As a workaround, existing pruning works constrain certain channels to be pruned together. This fully eliminates inference-time memory copies, but as we show, these constraints significantly impair accuracy. To solve both challenges, our insight is to enable unconstrained pruning by reordering channels to minimize memory copies. Using this insight, we design a generic algorithm UCPE to prune models with any pruning pattern. Critically, by removing constraints from existing pruning heuristics, we improve ImageNet top-1 accuracy for post-training pruning by 2.1 points on average -- benefiting pruned DenseNet (+16.9), EfficientNetV2 (+7.9), and ResNet (+6.2). Furthermore, our UCPE algorithm reduces latency by up to 52.8% when compared with naive unconstrained pruning, nearly fully eliminating memory copies at inference time.

Related readings and updates.

PDP: Parameter-free Differentiable Pruning is All You Need

DNN pruning is a popular way to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators. However, existing approaches might be too complex, expensive or ineffective to apply to a variety of vision/language tasks, DNN architectures and to honor structured pruning constraints. In this paper, we propose an efficient yet effective train-time pruning scheme, Parameter-free Differentiable…
See paper details

Error-driven Pruning of Language Models for Virtual Assistants

Language models (LMs) for virtual assistants (VAs) are typically trained on large amounts of data, resulting in prohibitively large models which require excessive memory and/or cannot be used to serve user requests in real-time. Entropy pruning results in smaller models but with significant degradation of effectiveness in the tail of the user request distribution. We customize entropy pruning by allowing for a keep list of infrequent n-grams that…
See paper details