y0news
← Feed
Back to feed
🧠 AI NeutralImportance 5/10

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

arXiv – CS AI|Antonio de Sousa Leit\~ao Filho; Allan Kardec Duailibe Barros Filho; Fabr\'icio Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa|
🤖AI Summary

TOTEN is a new tokenization framework for Brazilian Portuguese that uses formal ontologies to semantically preserve physical quantities, units, and technical notation instead of fragmenting them like standard statistical methods. The system significantly outperforms existing baselines in numerical reconstruction and dimensional equivalence, achieving 0.775-0.904 accuracy compared to 0.627-0.703 for competing approaches.

Analysis

TOTEN addresses a fundamental limitation in natural language processing: standard tokenization methods like Byte-Pair Encoding treat technical and scientific text as generic language, fragmenting dimensioned quantities and symbolic expressions into meaningless subword units. This framework instead grounds tokenization in a formal ontology of engineering entities, coupling it with external validation systems for dimensional analysis, typography, and morphology. The approach shifts from statistical compression to declarative classification, preserving semantic structure in technical documents.

The research represents progress in domain-specific NLP where general-purpose tokenizers fail. Engineering, scientific, and technical documentation requires precise handling of quantities, units, and notation—areas where TOTEN demonstrates marked improvement. The system's robustness derives from coupling with established oracles (Pint for dimensional analysis, Unicode standards, Portuguese morphology rules), making it reproducible and maintainable rather than reliant on opaque learned patterns.

The evaluation methodology is rigorous, combining internal benchmarking (EngQuant with 800 physically validated samples) and external validation across four Brazilian Portuguese corpora. Statistical significance testing with McNemar's and Holm correction strengthens claims. The 0.780 vs. 0.340 improvement on internal benchmarks against the best baseline (Quantulum3) is substantial, though this gap narrows on external corpora (0.904 vs. 0.703), suggesting domain generalization challenges.

For developers building technical document processing systems in Portuguese, TOTEN offers a practical alternative to generic tokenizers. The framework's dependency on external oracles and Portuguese-specific morphology means broader adoption requires adaptation for other languages and domains. Organizations processing engineering documentation, technical specifications, or scientific literature in Portuguese could benefit from this specialized approach.

Key Takeaways
  • TOTEN achieves 0.775-0.904 numerical reconstruction accuracy versus 0.627-0.703 for state-of-the-art baselines on external corpora
  • Knowledge-based ontological tokenization preserves semantic structure of physical quantities and technical notation better than statistical methods
  • Framework couples declarative classification with external oracles (Pint, Unicode, RSLP) for deterministic, reproducible results
  • Rigorous evaluation includes internal benchmark (EngQuant, N=800) and four external Brazilian Portuguese corpora with statistical significance testing
  • System achieves perfect ontological atomicity across all contrasts while maintaining dimensional equivalence parity with authoritative Pint oracle
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles