forked from Lainports/freebsd-ports
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: - Train new vocabularies and tokenize, using today's most used tokenizers. - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. - Easy to use, but also extremely versatile. - Designed for research and production. - Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token. - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs. WWW: https://github.com/huggingface/tokenizers
16 lines
682 B
Text
16 lines
682 B
Text
Provides an implementation of today's most used tokenizers, with a
|
|
focus on performance and versatility.
|
|
|
|
Main features:
|
|
- Train new vocabularies and tokenize, using today's most used
|
|
tokenizers.
|
|
- Extremely fast (both training and tokenization), thanks to the Rust
|
|
implementation. Takes less than 20 seconds to tokenize a GB of text
|
|
on a server's CPU.
|
|
- Easy to use, but also extremely versatile.
|
|
- Designed for research and production.
|
|
- Normalization comes with alignments tracking. It's always possible
|
|
to get the part of the original sentence that corresponds to a given
|
|
token.
|
|
- Does all the pre-processing: Truncate, Pad, add the special tokens
|
|
your model needs.
|