chai

Dribs

Plan to add a new language like Tibetan to an LLM (e.g., Apertus).

Reality check: does Apertus already “support” Tibetan? Apertus is a fully open Swiss LLM (8B/70B) with stated 1k+ languages. That likely means some coverage via byte/Unicode tokenization, not that it’s good at Tibetan. Expect weak tokenization + thin data → you’ll need adaptation. ([Amazon Web Services, Inc.][1])
Get clean Tibetan data (legally). Start with open corpora & pipelines: BUDA/BDRC (massive Tibetan archives; check access terms), OpenPecha (tools, corpora, Botok/pybo tokenizers), plus minority-language corpora like MC². Curate for Unicode correctness (Tibetan block U+0F00–0FFF), normalize punctuation (tsheg/shad), dedupe, and filter OCR noise. ([library.bdrc.io][2])

1 month ago