Tiktokenizer.js

Token Count


                                    

A custom tokenizer visualizer written in pure JavaScript that mirrors the functionality of OpenAI's GPT-2/GPT-3 Byte Pair Encoding (BPE) tokenizer to showcase how text is tokenized into subword units. The encoder.json and vocab.bpe files provided by OpenAI are used here so the tokens IDs are exactly matches the official BPE (GPT-2/GPT-3) representation. Currently, the dropdown on top is just a placeholder to add more schemes in future. Try different characters including ASCII, emojis, and non-English languages!
Did you notice any difference from the official tokenizer 😎 Check out the GitHub Repo for this project. See more projects here.