A probabilistic framework for pruning transformers via a finite admixture of keys

Nguyen, M. Tan; Nguyen, Tam; Bui, Long; Do, Hai; Nguyen, Duy Khuong; Le, Duy Dung; Tran, The Hung; Ho, Nhat; Osher, Stan J.; Baraniuk, Richard G.

dc.contributor.author	Nguyen, M. Tan
dc.contributor.author	Nguyen, Tam
dc.contributor.author	Bui, Long
dc.contributor.author	Do, Hai
dc.contributor.author	Nguyen, Duy Khuong
dc.contributor.author	Le, Duy Dung
dc.contributor.author	Tran, The Hung
dc.contributor.author	Ho, Nhat
dc.contributor.author	Osher, Stan J.
dc.contributor.author	Baraniuk, Richard G.
dc.date.accessioned	2025-02-22T18:42:35Z
dc.date.available	2025-02-22T18:42:35Z
dc.date.issued	2023-04-11
dc.identifier.uri	https://vinspace.edu.vn/handle/VIN/567
dc.description.abstract	Pairwise dot product-based self-attention is key to the success of transformers which achieve state-of-the-art performance across a variety of applications in language and vision, but are costly to compute. It has been shown that most attention scores and keys in transformers are redundant and can be removed without loss of accuracy. In this paper, we develop a novel probabilistic framework for pruning attention scores and keys in transformers. We first formulate an admixture model of attention keys whose input data to be clustered are attention queries. We show that attention scores in self-attention correspond to the posterior distribution of this model when attention keys admit a uniform prior distribution. We then relax this uniform prior constraint and let the model learn these priors from data, resulting in a new Finite Admixture of Keys (FiAK). The learned priors are used for pruning away redundant attention scores and keys in the baseline transformers, improving the diversity of attention patterns that the models capture. We corroborate the efficiency of transformers pruned with FiAK on the ImageNet object classification and WikiText-103 language modeling tasks. Our experiments demonstrate that transformers pruned with FiAK yield similar or better accuracy than the baseline dense transformers while being much more efficient in terms of memory and computational cost.	en_US
dc.language.iso	en_US	en_US
dc.subject	transformers	en_US
dc.subject	admixture models	en_US
dc.subject	pruning	en_US
dc.title	A probabilistic framework for pruning transformers via a finite admixture of keys	en_US
dc.type	Article	en_US

Files in this item

Name:: FiAK_ICASSP.pdf
Size:: 681.4Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Le Duy Dung, PhD [4]
Assistant Professor, Computer Science program, College of Engineering and Computer Science

Show simple item record