AIBullisharXiv โ CS AI ยท 14h ago6/10
๐ง
Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series
Researchers have optimized the Bielik v3 language models (7B and 11B parameters) by replacing universal tokenizers with Polish-specific vocabulary, addressing inefficiencies in morphological representation. This optimization reduces token fertility, lowers inference costs, and expands effective context windows while maintaining multilingual capabilities through advanced training techniques including supervised fine-tuning and reinforcement learning.