(108.00, 754.57) (126.82, 754.57) (126.82, 763.12) (108.00, 763.12)       /F81 IBM	<|special_separator|>
(399.97, 754.57) (504.00, 754.57) (504.00, 763.12) (399.97, 763.12)       /F81 Granite Language Models	<|special_separator|>
(108.25, 698.11) (121.70, 698.11) (121.70, 706.66) (108.25, 706.66)       /F81 2.1	<|special_separator|>
(132.16, 698.11) (201.23, 698.52) (201.23, 705.37) (132.16, 706.66)       /F81 DENSE MODELS	<|special_separator|>
(108.00, 676.91) (504.00, 676.91) (504.00, 685.46) (108.00, 685.46)       /F81 Granite 3.0 2B and 8B dense models share a similar architecture as popular language models like	<|special_separator|>
(108.00, 665.95) (504.35, 665.95) (504.35, 674.50) (108.00, 674.50)       /F81 Llama and our previous Granite Code models Mishra et al. (2024), ensuring strong compatibility	<|special_separator|>
(107.64, 654.99) (328.26, 654.99) (328.26, 663.54) (107.64, 663.54)       /F81 with open-source inference and fine-tuning pipelines.	<|special_separator|>
(333.83, 654.99) (504.83, 654.99) (504.83, 663.54) (333.83, 663.54)       /F81 We use Grouped Query Attention (GQA;	<|special_separator|>
(107.64, 644.03) (504.00, 644.03) (504.00, 652.58) (107.64, 652.58)       /F81 Ainslie et al. 2023) with 8 key-value heads to get a good balance between memory cost and model	<|special_separator|>
(108.00, 633.07) (504.00, 633.07) (504.00, 641.62) (108.00, 641.62)       /F81 performance, and Rotary Position Embedding (RoPE; Su et al. 2024) to model the relative position	<|special_separator|>
(108.00, 622.11) (174.25, 622.11) (174.25, 630.66) (108.00, 630.66)       /F81 between tokens.	<|special_separator|>
(180.10, 622.11) (504.00, 622.11) (504.00, 630.66) (180.10, 630.66)       /F81 For the MLP layers, Granite 3.0 Dense models use SwiGLU as the activation	<|special_separator|>
(108.00, 611.15) (504.00, 611.15) (504.00, 619.71) (108.00, 619.71)       /F81 function. Before each MLP and attention layer, we use RMSNorm to normalize the layer's input. We	<|special_separator|>
(108.00, 600.20) (504.00, 600.20) (504.00, 608.75) (108.00, 608.75)       /F81 also share parameters between the input embedding and the output linear transform. This reduces	<|special_separator|>
(108.00, 589.24) (504.00, 589.24) (504.00, 597.79) (108.00, 597.79)       /F81 the size of the model, and we have observed that the tying of these embeddings have zero, or even a	<|special_separator|>
(108.00, 578.28) (263.94, 578.28) (263.94, 586.83) (108.00, 586.83)       /F81 positive impact on model performance.	<|special_separator|>
(108.25, 551.57) (121.70, 551.57) (121.70, 560.13) (108.25, 560.13)       /F81 2.2	<|special_separator|>
(132.16, 551.57) (264.00, 551.99) (264.00, 558.83) (132.16, 560.13)       /F81 MIXTURE-OF-EXPERT MODELS	<|special_separator|>
(108.00, 530.37) (504.00, 530.37) (504.00, 538.92) (108.00, 538.92)       /F81 Granite 3.0 1B and 3B MoE models use similar architecture as Granite Dense models, with the	<|special_separator|>
(108.00, 519.41) (456.02, 519.41) (456.02, 527.97) (108.00, 527.97)       /F81 MLP layers substituted with MoE layers. A Mixture of Experts (MoE) layer comprises	<|special_separator|>
(458.52, 519.58) (466.52, 519.58) (466.52, 528.28) (458.52, 528.28)       /F31 N	<|special_separator|>
(470.10, 519.41) (504.00, 519.41) (504.00, 527.97) (470.10, 527.97)       /F81 modules	<|special_separator|>
(108.00, 508.62) (112.88, 508.62) (112.88, 517.33) (108.00, 517.33)       /F31 f	<|special_separator|>
(112.88, 507.69) (116.85, 507.69) (116.85, 513.79) (112.88, 513.79)       /F27 1	<|special_separator|>
(117.35, 508.62) (144.36, 508.62) (144.36, 517.33) (117.35, 517.33)       /F31 , . . . , f	<|special_separator|>
(144.36, 507.69) (150.68, 507.69) (150.68, 513.79) (144.36, 513.79)       /F30 N	<|special_separator|>
(154.42, 508.45) (201.24, 508.45) (201.24, 517.01) (154.42, 517.01)       /F81 and a router	<|special_separator|>
(203.74, 508.62) (208.49, 508.62) (208.49, 517.33) (203.74, 517.33)       /F31 g	<|special_separator|>
(208.85, 508.62) (212.72, 508.62) (212.72, 517.33) (208.85, 517.33)       /F28 (	<|special_separator|>
(212.72, 508.62) (217.36, 508.62) (217.36, 517.33) (212.72, 517.33)       /F31 e	<|special_separator|>
(220.14, 508.76) (222.91, 508.76) (222.91, 517.33) (220.14, 517.33)       /F34 |	<|special_separator|>
(225.69, 508.61) (231.74, 508.61) (231.74, 517.36) (225.69, 517.36)       /F55 x	<|special_separator|>
(231.74, 508.62) (235.61, 508.62) (235.61, 517.33) (231.74, 517.33)       /F28 )	<|special_separator|>
(235.61, 508.45) (299.02, 508.45) (299.02, 517.01) (235.61, 517.01)       /F81 . Given an input	<|special_separator|>
(301.52, 508.61) (307.56, 508.61) (307.56, 517.36) (301.52, 517.36)       /F55 x	<|special_separator|>
(310.06, 508.45) (504.35, 508.45) (504.35, 517.01) (310.06, 517.01)       /F81 to the MoE layer, the router predicts a probability	<|special_separator|>
(108.00, 497.50) (187.33, 497.50) (187.33, 506.05) (108.00, 506.05)       /F81 distribution over the	<|special_separator|>
(189.83, 497.66) (197.83, 497.66) (197.83, 506.37) (189.83, 506.37)       /F31 N	<|special_separator|>
(201.41, 497.50) (344.25, 497.50) (344.25, 506.05) (201.41, 506.05)       /F81 modules. Of these, we select the top	<|special_separator|>
(346.74, 497.66) (351.93, 497.66) (351.93, 506.37) (346.74, 506.37)       /F31 k	<|special_separator|>
(354.74, 497.50) (412.03, 497.50) (412.03, 506.05) (354.74, 506.05)       /F81 experts. When	<|special_separator|>
(414.52, 497.66) (441.33, 497.66) (441.33, 506.37) (414.52, 506.37)       /F31 k < N	<|special_separator|>
(442.42, 497.50) (504.00, 497.50) (504.00, 506.05) (442.42, 506.05)       /F81 , we are using a	<|special_separator|>
(108.00, 486.54) (504.00, 486.54) (504.00, 495.09) (108.00, 495.09)       /F81 Sparse Mixture of Experts (SMoE; Shazeer et al. 2017). For this series of Granite MoE models, we	<|special_separator|>
(108.00, 475.58) (257.13, 475.58) (257.13, 484.13) (108.00, 484.13)       /F81 use a linear layer to model the router:	<|special_separator|>
(231.00, 456.29) (235.52, 456.29) (235.52, 465.04) (231.00, 465.04)       /F55 s	<|special_separator|>
(238.28, 456.30) (246.03, 456.30) (246.03, 465.01) (238.28, 465.01)       /F28 =	<|special_separator|>
(248.80, 456.29) (260.64, 456.29) (260.64, 465.04) (248.80, 465.04)       /F55 W	<|special_separator|>
(260.64, 455.38) (281.90, 455.38) (281.90, 461.47) (260.64, 461.47)       /F27 router	<|special_separator|>
(282.40, 456.29) (288.44, 456.29) (288.44, 465.04) (282.40, 465.04)       /F55 x	<|special_separator|>
(288.44, 456.30) (291.21, 456.30) (291.21, 465.01) (288.44, 465.01)       /F31 ,	<|special_separator|>
(493.05, 456.14) (504.67, 456.14) (504.67, 464.69) (493.05, 464.69)       /F81 (1)	<|special_separator|>
(203.67, 434.58) (208.42, 434.58) (208.42, 443.29) (203.67, 443.29)       /F31 g	<|special_separator|>
(208.78, 434.58) (212.65, 434.58) (212.65, 443.29) (208.78, 443.29)       /F28 (	<|special_separator|>
(212.65, 434.58) (217.29, 434.58) (217.29, 443.29) (212.65, 443.29)       /F31 e	<|special_separator|>
(220.06, 434.72) (222.83, 434.72) (222.83, 443.29) (220.06, 443.29)       /F34 |	<|special_separator|>
(225.59, 434.57) (231.64, 434.57) (231.64, 443.32) (225.59, 443.32)       /F55 x	<|special_separator|>
(231.64, 434.58) (246.03, 434.58) (246.03, 443.29) (231.64, 443.29)       /F28 ) =	<|special_separator|>
(248.80, 444.55) (256.27, 444.55) (256.27, 450.93) (248.80, 450.93)       /F21 {	<|special_separator|>
(256.27, 440.36) (296.18, 440.36) (296.18, 449.07) (256.27, 449.07)       /F28 softmax (	<|special_separator|>
(296.18, 440.20) (311.43, 440.20) (311.43, 448.75) (296.18, 448.75)       /F81 Top	<|special_separator|>
(311.43, 440.36) (316.62, 440.36) (316.62, 449.07) (311.43, 449.07)       /F31 k	<|special_separator|>
(318.59, 440.36) (322.46, 440.36) (322.46, 449.07) (318.59, 449.07)       /F28 (	<|special_separator|>
(322.46, 440.35) (326.98, 440.35) (326.98, 449.10) (322.46, 449.10)       /F55 s	<|special_separator|>
(326.98, 440.36) (334.73, 440.36) (334.73, 449.07) (326.98, 449.07)       /F28 ))	<|special_separator|>
(334.73, 437.94) (337.55, 437.94) (337.55, 444.04) (334.73, 444.04)       /F30 i	<|special_separator|>
(339.71, 440.36) (342.48, 440.36) (342.48, 449.07) (339.71, 449.07)       /F31 ,	<|special_separator|>
(352.44, 440.35) (356.96, 440.35) (356.96, 449.10) (352.44, 449.10)       /F55 s	<|special_separator|>
(356.96, 439.44) (359.78, 439.44) (359.78, 445.53) (356.96, 445.53)       /F30 i	<|special_separator|>
(363.04, 440.50) (369.69, 440.50) (369.69, 449.07) (363.04, 449.07)       /F34 ∈	<|special_separator|>
(372.45, 440.20) (387.71, 440.20) (387.71, 448.75) (372.45, 448.75)       /F81 Top	<|special_separator|>
(387.71, 440.36) (392.89, 440.36) (392.89, 449.07) (387.71, 449.07)       /F31 k	<|special_separator|>
(394.87, 440.36) (398.74, 440.36) (398.74, 449.07) (394.87, 449.07)       /F28 (	<|special_separator|>
(398.74, 440.35) (403.26, 440.35) (403.26, 449.10) (398.74, 449.10)       /F55 s	<|special_separator|>
(403.26, 440.36) (407.13, 440.36) (407.13, 449.07) (403.26, 449.07)       /F28 )	<|special_separator|>
(295.50, 429.40) (300.48, 429.40) (300.48, 438.11) (295.50, 438.11)       /F28 0	<|special_separator|>
(300.48, 429.40) (303.25, 429.40) (303.25, 438.11) (300.48, 438.11)       /F31 ,	<|special_separator|>
(352.44, 429.39) (356.96, 429.39) (356.96, 438.14) (352.44, 438.14)       /F55 s	<|special_separator|>
(356.96, 428.48) (359.78, 428.48) (359.78, 434.57) (356.96, 434.57)       /F30 i	<|special_separator|>
(364.15, 429.40) (369.13, 429.40) (369.13, 438.11) (364.15, 438.11)       /F31 /	<|special_separator|>
(363.04, 429.54) (369.69, 429.54) (369.69, 438.11) (363.04, 438.11)       /F34 ∈	<|special_separator|>
(372.45, 429.24) (387.71, 429.24) (387.71, 437.79) (372.45, 437.79)       /F81 Top	<|special_separator|>
(387.71, 429.40) (392.89, 429.40) (392.89, 438.11) (387.71, 438.11)       /F31 k	<|special_separator|>
(394.87, 429.40) (398.74, 429.40) (398.74, 438.11) (394.87, 438.11)       /F28 (	<|special_separator|>
(398.74, 429.39) (403.26, 429.39) (403.26, 438.14) (398.74, 438.14)       /F55 s	<|special_separator|>
(403.26, 429.40) (407.13, 429.40) (407.13, 438.11) (403.26, 438.11)       /F28 )	<|special_separator|>
(493.05, 434.42) (504.67, 434.42) (504.67, 442.97) (493.05, 442.97)       /F81 (2)	<|special_separator|>
(107.64, 408.00) (132.47, 408.00) (132.47, 416.55) (107.64, 416.55)       /F81 where	<|special_separator|>
(135.29, 408.16) (147.14, 408.16) (147.14, 416.90) (135.29, 416.90)       /F55 W	<|special_separator|>
(147.14, 407.24) (168.39, 407.24) (168.39, 413.34) (147.14, 413.34)       /F27 router	<|special_separator|>
(171.71, 408.00) (336.41, 408.00) (336.41, 416.55) (171.71, 416.55)       /F81 is the expert embedding matrix of shape	<|special_separator|>
(339.23, 408.17) (343.10, 408.17) (343.10, 416.87) (339.23, 416.87)       /F28 (	<|special_separator|>
(343.10, 408.17) (364.31, 408.17) (364.31, 416.87) (343.10, 416.87)       /F31 N,D	<|special_separator|>
(364.32, 407.13) (376.33, 407.13) (376.33, 413.11) (364.32, 413.11)       /F81 emb	<|special_separator|>
(376.82, 408.17) (380.70, 408.17) (380.70, 416.87) (376.82, 416.87)       /F28 )	<|special_separator|>
(380.70, 408.00) (418.90, 408.00) (418.90, 416.55) (380.70, 416.55)       /F81 , and Top	<|special_separator|>
(418.90, 408.17) (424.09, 408.17) (424.09, 416.87) (418.90, 416.87)       /F31 k	<|special_separator|>
(427.23, 408.00) (504.00, 408.00) (504.00, 416.55) (427.23, 416.55)       /F81 is the operator that	<|special_separator|>
(108.00, 397.04) (164.45, 397.04) (164.45, 405.60) (108.00, 405.60)       /F81 selects the top	<|special_separator|>
(166.94, 397.21) (172.12, 397.21) (172.12, 405.91) (166.94, 405.91)       /F31 k	<|special_separator|>
(174.93, 397.04) (218.93, 397.04) (218.93, 405.60) (174.93, 405.60)       /F81 logits from	<|special_separator|>
(221.42, 397.20) (225.94, 397.20) (225.94, 405.94) (221.42, 405.94)       /F55 s	<|special_separator|>
(225.94, 397.04) (414.05, 397.04) (414.05, 405.60) (225.94, 405.60)       /F81 . The final output of the SMoE is then given by	<|special_separator|>
(257.72, 367.36) (262.60, 367.36) (262.60, 376.06) (257.72, 376.06)       /F31 y	<|special_separator|>
(265.73, 367.36) (273.47, 367.36) (273.47, 376.06) (265.73, 376.06)       /F28 =	<|special_separator|>
(279.91, 380.38) (286.22, 380.38) (286.22, 386.48) (279.91, 386.48)       /F30 N	<|special_separator|>
(276.24, 372.75) (290.63, 372.75) (290.63, 379.12) (276.24, 379.12)       /F21 ∑	<|special_separator|>
(276.50, 356.29) (280.28, 356.29) (280.28, 362.39) (276.50, 362.39)       /F30 e	<|special_separator|>
(280.28, 356.29) (290.37, 356.29) (290.37, 362.39) (280.28, 362.39)       /F27 =1	<|special_separator|>
(292.29, 367.36) (297.04, 367.36) (297.04, 376.06) (292.29, 376.06)       /F31 g	<|special_separator|>
(297.40, 367.36) (301.27, 367.36) (301.27, 376.06) (297.40, 376.06)       /F28 (	<|special_separator|>
(301.28, 367.36) (305.92, 367.36) (305.92, 376.06) (301.28, 376.06)       /F31 e	<|special_separator|>
(308.68, 367.50) (311.45, 367.50) (311.45, 376.06) (308.68, 376.06)       /F34 |	<|special_separator|>
(314.22, 367.35) (320.26, 367.35) (320.26, 376.09) (314.22, 376.09)       /F55 x	<|special_separator|>
(320.26, 367.36) (324.14, 367.36) (324.14, 376.06) (320.26, 376.06)       /F28 )	<|special_separator|>
(326.35, 367.50) (329.12, 367.50) (329.12, 376.06) (326.35, 376.06)       /F34 ·	<|special_separator|>
(331.33, 367.36) (336.21, 367.36) (336.21, 376.06) (331.33, 376.06)       /F31 f	<|special_separator|>
(336.21, 366.43) (339.99, 366.43) (339.99, 372.53) (336.21, 372.53)       /F30 e	<|special_separator|>
(340.49, 367.36) (344.36, 367.36) (344.36, 376.06) (340.49, 376.06)       /F28 (	<|special_separator|>
(344.36, 367.35) (350.41, 367.35) (350.41, 376.09) (344.36, 376.09)       /F55 x	<|special_separator|>
(350.41, 367.36) (354.28, 367.36) (354.28, 376.06) (350.41, 376.06)       /F28 )	<|special_separator|>
(493.05, 367.19) (504.67, 367.19) (504.67, 375.75) (493.05, 375.75)       /F81 (3)	<|special_separator|>
(107.53, 337.61) (131.80, 337.61) (131.80, 346.16) (107.53, 346.16)       /F81 When	<|special_separator|>
(134.84, 337.78) (139.59, 337.78) (139.59, 346.48) (134.84, 346.48)       /F31 g	<|special_separator|>
(139.95, 337.78) (143.82, 337.78) (143.82, 346.48) (139.95, 346.48)       /F28 (	<|special_separator|>
(143.82, 337.78) (148.46, 337.78) (148.46, 346.48) (143.82, 346.48)       /F31 e	<|special_separator|>
(152.25, 337.92) (155.02, 337.92) (155.02, 346.48) (152.25, 346.48)       /F34 |	<|special_separator|>
(158.81, 337.77) (164.86, 337.77) (164.86, 346.51) (158.81, 346.51)       /F55 x	<|special_separator|>
(164.86, 337.78) (189.04, 337.78) (189.04, 346.48) (164.86, 346.48)       /F28 ) = 0	<|special_separator|>
(189.04, 337.61) (191.58, 337.61) (191.58, 346.16) (189.04, 346.16)       /F81 ,	<|special_separator|>
(194.76, 337.78) (199.64, 337.78) (199.64, 346.48) (194.76, 346.48)       /F31 f	<|special_separator|>
(199.64, 336.85) (203.42, 336.85) (203.42, 342.95) (199.64, 342.95)       /F30 e	<|special_separator|>
(203.92, 337.78) (207.79, 337.78) (207.79, 346.48) (203.92, 346.48)       /F28 (	<|special_separator|>
(207.79, 337.77) (213.84, 337.77) (213.84, 346.51) (207.79, 346.51)       /F55 x	<|special_separator|>
(213.84, 337.78) (217.72, 337.78) (217.72, 346.48) (213.84, 346.48)       /F28 )	<|special_separator|>
(220.76, 337.61) (504.00, 337.61) (504.00, 346.16) (220.76, 346.16)       /F81 will not need to be evaluated, thus reducing computation cost during	<|special_separator|>
(108.00, 326.65) (472.21, 326.65) (472.21, 335.20) (108.00, 335.20)       /F81 training and inference. The key designs of the Granite MoE models are summarized below:	<|special_separator|>
(108.00, 301.28) (216.00, 301.28) (216.00, 310.24) (108.00, 310.24)       /F90 Dropless Token Routing.	<|special_separator|>
(225.96, 301.30) (504.00, 301.30) (504.00, 309.85) (225.96, 309.85)       /F81 Since each token selects experts independently, some experts could	<|special_separator|>
(108.00, 290.34) (505.25, 290.34) (505.25, 298.89) (108.00, 298.89)       /F81 receive more tokens than others. In previous MoE models, like Switch Transformer (Fedus et al.,	<|special_separator|>
(108.00, 279.38) (504.00, 279.38) (504.00, 287.93) (108.00, 287.93)       /F81 2022) and Deepseek-V2 (Liu et al., 2024a), a capacity cap is set for each expert or device, and the	<|special_separator|>
(108.00, 268.42) (504.35, 268.42) (504.35, 276.97) (108.00, 276.97)       /F81 extra tokens that exceed the cap are dropped. As observed in Gale et al. (2023), this cap negatively	<|special_separator|>
(108.00, 257.46) (504.00, 257.46) (504.00, 266.01) (108.00, 266.01)       /F81 affects the model training stability and loss. In our training, we use ScatterMoE (Tan et al., 2024), a	<|special_separator|>
(108.00, 246.50) (457.59, 246.50) (457.59, 255.05) (108.00, 255.05)       /F81 dropless MoE implementation, to avoid token dropping and improve training efficiency.	<|special_separator|>
(108.00, 221.13) (199.34, 221.13) (199.34, 230.08) (108.00, 230.08)       /F90 Fine-grained Experts.	<|special_separator|>
(209.30, 221.14) (504.00, 221.14) (504.00, 229.69) (209.30, 229.69)       /F81 Recent studies (Krajewski et al., 2024; Dai et al., 2024) suggest that setting	<|special_separator|>
(108.00, 210.18) (504.00, 210.18) (504.00, 218.74) (108.00, 218.74)       /F81 the size of experts in MoE to mirror the feed-forward layer is not optimal. Instead, increasing the	<|special_separator|>
(108.00, 199.22) (504.00, 199.22) (504.00, 207.78) (108.00, 207.78)       /F81 expert granularity, number of experts, and number of activated experts could increase the possible	<|special_separator|>
(108.00, 188.26) (504.00, 188.26) (504.00, 196.82) (108.00, 196.82)       /F81 combinations of experts and result in better model performance. Following these observations, we use	<|special_separator|>
(108.00, 177.31) (505.24, 177.31) (505.24, 185.86) (108.00, 185.86)       /F81 fine-grained experts and a larger number of activated experts in Granite 3.0 MoE models. Specifically,	<|special_separator|>
(107.64, 166.35) (160.48, 166.35) (160.48, 174.90) (107.64, 174.90)       /F81 we use a top-	<|special_separator|>
(160.48, 166.51) (165.67, 166.51) (165.67, 175.22) (160.48, 175.22)       /F31 k	<|special_separator|>
(168.47, 166.35) (463.45, 166.35) (463.45, 174.90) (168.47, 174.90)       /F81 of 8 out of 32 and 40 experts respectively for the 1B and 3B MoE models.	<|special_separator|>
(108.00, 140.97) (198.16, 140.97) (198.16, 149.93) (108.00, 149.93)       /F90 Load Balancing Loss.	<|special_separator|>
(208.12, 140.99) (504.00, 140.99) (504.00, 149.54) (208.12, 149.54)       /F81 To avoid routing tokens repeatedly to the same expert and wasting the extra	<|special_separator|>
(108.00, 130.03) (504.67, 130.03) (504.67, 138.58) (108.00, 138.58)       /F81 capacity in other experts, we use the frequency-based auxiliary loss introduced in Fedus et al. (2022)	<|special_separator|>
(271.57, 100.48) (278.44, 100.48) (278.44, 109.05) (271.57, 109.05)       /F34 L	<|special_separator|>
(278.44, 099.42) (281.94, 099.42) (281.94, 105.51) (278.44, 105.51)       /F30 b	<|special_separator|>
(285.21, 100.34) (292.96, 100.34) (292.96, 109.05) (285.21, 109.05)       /F28 =	<|special_separator|>
(295.72, 100.34) (303.73, 100.34) (303.73, 109.05) (295.72, 109.05)       /F31 N	<|special_separator|>
(310.14, 113.37) (316.45, 113.37) (316.45, 119.46) (310.14, 119.46)       /F30 N	<|special_separator|>
(306.48, 105.73) (320.87, 105.73) (320.87, 112.11) (306.48, 112.11)       /F21 ∑	<|special_separator|>
(307.22, 089.16) (310.04, 089.16) (310.04, 095.25) (307.22, 095.25)       /F30 i	<|special_separator|>
(310.04, 089.16) (320.12, 089.16) (320.12, 095.25) (310.04, 095.25)       /F27 =1	<|special_separator|>
(322.53, 100.34) (327.40, 100.34) (327.40, 109.05) (322.53, 109.05)       /F31 f	<|special_separator|>
(327.40, 099.42) (330.22, 099.42) (330.22, 105.51) (327.40, 105.51)       /F30 i	<|special_separator|>
(330.72, 100.34) (337.12, 100.34) (337.12, 109.05) (330.72, 109.05)       /F31 P	<|special_separator|>
(337.12, 099.42) (339.94, 099.42) (339.94, 105.51) (337.12, 105.51)       /F30 i	<|special_separator|>
(493.05, 100.18) (504.67, 100.18) (504.67, 108.73) (493.05, 108.73)       /F81 (4)	<|special_separator|>
(107.64, 071.03) (132.47, 071.03) (132.47, 079.58) (107.64, 079.58)       /F81 where	<|special_separator|>
(135.09, 071.19) (143.10, 071.19) (143.10, 079.90) (135.09, 079.90)       /F31 N	<|special_separator|>
(146.81, 071.03) (247.74, 071.03) (247.74, 079.58) (146.81, 079.58)       /F81 is the number of experts,	<|special_separator|>
(250.40, 071.19) (255.28, 071.19) (255.28, 079.90) (250.40, 079.90)       /F31 f	<|special_separator|>
(255.28, 070.27) (258.10, 070.27) (258.10, 076.37) (255.28, 076.37)       /F30 i	<|special_separator|>
(261.22, 071.03) (467.22, 071.03) (467.22, 079.58) (261.22, 079.58)       /F81 is the fraction of tokens dispatched to expert i, and	<|special_separator|>
(469.84, 071.19) (476.24, 071.19) (476.24, 079.90) (469.84, 079.90)       /F31 P	<|special_separator|>
(476.24, 070.27) (479.06, 070.27) (479.06, 076.37) (476.24, 076.37)       /F30 i	<|special_separator|>
(482.18, 071.03) (504.00, 071.03) (504.00, 079.58) (482.18, 079.58)       /F81 is the	<|special_separator|>
(108.00, 060.07) (318.04, 060.07) (318.04, 068.62) (108.00, 068.62)       /F81 fraction of the router probability allocated for expert	<|special_separator|>
(320.52, 060.24) (323.95, 060.24) (323.95, 068.94) (320.52, 068.94)       /F31 i	<|special_separator|>
(323.96, 060.07) (504.00, 060.07) (504.00, 068.62) (323.96, 068.62)       /F81 . Intuitively, this loss penalises over-usage of	<|special_separator|>
(303.33, 030.18) (308.32, 030.18) (308.32, 038.74) (303.33, 038.74)       /F81 4