#dataset-engineering News & Analysis

3 articles tagged with #dataset-engineering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

FormalASR: End-to-End Spoken Chinese to Formal Text

Researchers present FormalASR, compact end-to-end models that convert spoken Chinese directly into formal written text, eliminating the need for post-processing with large language models. Built on newly created datasets and fine-tuned versions of Qwen3-ASR, the solution achieves significant error reduction while enabling lightweight on-device deployment.

AIBullisharXiv – CS AI · May 277/10

🧠

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Researchers introduce LocateAnything, a new vision-language model framework that uses Parallel Box Decoding to detect and localize objects simultaneously rather than sequentially, improving both inference speed and accuracy. The team curated a 138-million-sample dataset and demonstrated significant performance improvements across multiple benchmarks.

AINeutralarXiv – CS AI · Mar 36/106

🧠

Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

Researchers documented their experience training Summer-22B, a video foundation model developed from scratch using 50 million clips. The report details engineering challenges, dataset curation methods, and architectural decisions, emphasizing that dataset engineering consumed the majority of development effort.