---
title: "Count tokens with the Gemma 2 Tokenizer in Rust"
description: "Quickly count tokens in Large Language Models (LLMs) like Gemini and Gemma using Rust. This efficient method avoids slow network calls, leveraging the `tokenizers` crate for local processing.  The code example demonstrates token counting with minimal dependencies, even handling `aarch64` architecture challenges. Get started with fast, local token counting now!\n"
slug: "count-tokens-with-the-gemma-2-tokenizer-in-rust"
created: 2024-11-22T20:46:00Z
updated: 2024-11-22T20:46:00Z
tags:
  - "rust"
  - "gemma"
  - "tokenizer"
  - "artificial-intelligence"
  - "machine-learning"
ai_assisted: false
---

For those working with Large Language Models, counting the number of tokens in an input can be a frequent task. As [Gemini and Gemma share the same tokenizer][5] (at least for now), it is quite useful to be able to be able to count tokens on an input locally, without making network calls [to an endpoint][6], which can be much slower.

In rust, this can be achieved with the [`tokenizers`][7] crate. The sample code below is a minimalistc implementation of [this sample code][2], removing the need for the [`candle-examples`][8] create, but still uses the [`hf_hub`][8] crate to manage model download, but those could be manually downloaded too.

```rust
use hf_hub::{api::sync::ApiBuilder, Repo, RepoType};
use tokenizers::Tokenizer;

const HF_TOKEN: &str = "YOUR_TOKEN_HERE";
const MODEL_ID: &str= "google/gemma-2-2b";
const MODEL_REVISION: &str = "main";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let api = ApiBuilder::new().with_token(Some(HF_TOKEN.to_string())).build()?;
    let repo = api.repo(Repo::with_revision(
        MODEL_ID.to_string(),
        RepoType::Model,
        MODEL_REVISION.to_string().to_string(),
    ));

    let tokenizer_filename = repo.get("tokenizer.json")?;
    let tokenizer = Tokenizer::from_file(tokenizer_filename.clone()).unwrap();

    let prompt = "Why is the sky blue?";
    let tokens = tokenizer
        .encode(prompt, true)
        .unwrap()
        .get_ids()
        .to_vec();

    println!("Generated {}", tokens.len());
    Ok(())
}
```

The `hf_hub` crate is smart and caches the model once downloaded. While initializing the model from still takes about 600ms, it should be done only once in the application, counting tokens is quite fast, generally under 1ms.

## `aarch64-pc-windows-msvc` issues with `candle-examples`
In the [original example][2], this code is based on depends on the [`candle-examples`][4] crate, which fails to build on `aarch64` architectures. The issue is caused by one of its dependencies, the [`gemm-f16`][3] crate. There are workarounds described in [this issue][1].

For `aarch64-pc-windows-msvc`, adding the configuration below to `.cargo/config.toml` file should do the trick:

```toml
[build]
rustflags = [
    "-Ctarget-feature=+fp16,+fhm"
]
```

[1]: https://github.com/sarah-quinones/gemm/issues/31
[2]: https://github.com/Kamalabot/cratesploring/blob/main/candle_explorer/gemma-tokenizer/src/main.rs
[3]: https://crates.io/crates/gemm-f16
[4]: https://crates.io/crates/candle-examples
[5]: https://medium.com/google-cloud/a-gemini-and-gemma-tokenizer-in-java-e18831ac9677
[6]: https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/count-tokens
[7]: https://crates.io/crates/tokenizers
[8]: https://crates.io/crates/candle-examples
