AINeutralarXiv – CS AI · 6h ago5/10
🧠
Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese
TOTEN is a new tokenization framework for Brazilian Portuguese that uses formal ontologies to semantically preserve physical quantities, units, and technical notation instead of fragmenting them like standard statistical methods. The system significantly outperforms existing baselines in numerical reconstruction and dimensional equivalence, achieving 0.775-0.904 accuracy compared to 0.627-0.703 for competing approaches.