freebsd-ports/textproc/py-tokenizers/pkg-descr
Hiroki Tagato e3dfc2fad4 textproc/py-tokenizers: add port: Fast state-of-the-art tokenizers optimized for research and production
Provides an implementation of today's most used tokenizers, with a
focus on performance and versatility.

Main features:
- Train new vocabularies and tokenize, using today's most used
  tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust
  implementation. Takes less than 20 seconds to tokenize a GB of text
  on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible
  to get the part of the original sentence that corresponds to a given
  token.
- Does all the pre-processing: Truncate, Pad, add the special tokens
  your model needs.

WWW: https://github.com/huggingface/tokenizers
2024-02-12 17:34:14 +09:00

16 lines
682 B
Text

Provides an implementation of today's most used tokenizers, with a
focus on performance and versatility.
Main features:
- Train new vocabularies and tokenize, using today's most used
tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust
implementation. Takes less than 20 seconds to tokenize a GB of text
on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible
to get the part of the original sentence that corresponds to a given
token.
- Does all the pre-processing: Truncate, Pad, add the special tokens
your model needs.