From PyTorch to Browser: Creating a Web-Friendly AI Model

Motivation

There's an immense availability of AI models for a wide range of use cases, readily available to web developers via tools like Mediapipe and Transformers.js, and more recently via the Built-in AI APIs, with friendly APIs. Those tools provide models for a wide range of common tasks, ranging from text texts like text classification or language detection, vision tasks like image segmentation, and even large language models.

However, developers will sometimes have specific needs that are not covered by readily available models, or those models might not be available in a web friendly format.

In this article, I explore building a model from scratch using PyTorch, and exporting it to a browser friendly format that is compatible with Google's LiteRT library.

Disclaimer: I'm not a Python developer or an AI engineer. This is just an exercise to understand the process of building and deploying a model, end to end.

Picking a problem and outlining a solution

Ideally, such an experiment should happen on a problem that is at least adjacent to the real world. The inspiration for this one comes from a colleague trying to understand sentiment of messages from a mailing list - whether the messages were positive, negative or neutral. This is also not too far away from another problem I heard from a developer, where they had their own specific, somewhat more lenient rules, for toxicity detection.

The Google AI Embeddings API has an option that optimizes embeddings for classification, and this looked like a good opportunity to experiment with building a classification model on top of the embeddings generated by that API.

The last thing needed to get started is a dataset, as finding good datasets is crucial for building good models. Fortunately, dataset hubs like Kaggle or HuggingFace for various datasets we can use and, for this particular problem, I chose this Kaggle dataset for sentiment analysis on YouTube comments. It contains 17872 comments, each one classified as positive, negative, or neutral.

Preparing the dataset

With the dataset selected and downloaded, the next step is transforming it into a format that can be used by our model. Most important, in this case, is transforming the comments from the dataset into the embeddings we are going to use as the the input for the model.

As this is a time consuming process, the solution is to pre-process the data and save the embeddings into a separate file, using the Google GenAI Python library:

client = genai.Client(api_key="YOUR API KEY")
with open("YoutubeCommentsDataSet.csv", "r", encoding="utf-8") as csvfile:
    csvreader = csv.DictReader(csvfile)
    with open("YouTubeCommentsEmbeddings.jsonl", "a") as embeddingsfile:
        for row in csvreader:
            result = client.models.embed_content(
                model='text-embedding-004',
                contents=row['Comment'],
                config=types.EmbedContentConfig(task_type="CLASSIFICATION")
            )
            result = {
                "embeddings": result.embeddings[0].values,
                "sentiment": row['Sentiment'],
            }
            json_result = json.dumps(result)
            embeddingsfile.write(json_result + "\n")
        embeddingsfile.flush()

The code above loads all comments from YoutubeCommentsDataset.csv, transforms the comments into embeddings using the Google AI SDK, and saves the embeddings and the sentiment into a new file, YouTubeCommentsEmbeddings.jsonl.

Training the model

Loading the previously generated data.

Before training the model, the file created in the previous set must be loaded. While doing that, the positive, neutral and negative sentiment values are also mapped to 0, 1, and 2.

sentiments_dict = {'positive': 0, 'neutral': 1, 'negative': 2}
dataset = []
with open('YouTubeCommentsEmbeddings.jsonl', 'r') as f:
  for line in f:
    data = json.loads(line)
    embeddings = torch.tensor(data['embeddings'])
    sentiment = torch.tensor(sentiments_dict[data['sentiment']])
    dataset.append((embeddings, sentiment))

Splitting into training and validation datasets

With the initial dataset loaded, a good practice is to split the dataset into a training dataset and a validation dataset. While the first is used to train the model, the second is used to check the model accuracy, and that it's not memorizing the training dataset instead, which would lead to poor performance in the real world.

num_samples = len(dataset)
num_validation = int(0.2 * num_samples)
shuffled_indices = torch.randperm(num_samples)
train_indices = shuffled_indices[:-num_validation]
validation_indices = shuffled_indices[-num_validation:]

training_dataset = [dataset[i] for i in train_indices]
validation_dataset = [dataset[i] for i in validation_indices]

train_loader = torch.utils.data.DataLoader(training_dataset, batch_size=64, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_dataset, batch_size=64, shuffle=False)

Dataset loaders are also created, with a batch size of 64. The training loader has the shuffle parameter set to True, ensuring that, each training loop, the order of the inputs is different.

The training loop

With the training and validation sets ready, it's now time to train the model. The training loop is fairly standard for training neural networks: the output is calculated by invoking the model with model(x_train), the loss is calculated from the predicted values and the expected results with loss_fn(y_predicted, y_train). After any remaining gradients are cleared with optimizer.zero_grad(), new gradients are generated with loss.backwards(), and then model weights are updated with optimizer.step().

def training_loop(n_epochs, model, optimizer, loss_fn, training_loader, validation_loader):
  for epoch in range(1, n_epochs + 1):
    for x_train, y_train in training_loader:
      y_predicted = model(x_train)
      loss = loss_fn(y_predicted, y_train)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

    # Disable grad for calculating validation metrics, since backpropagation
    # is not needed and this should improve performance.
    with torch.no_grad():
        correct = 0
        total = 0
        for x_val, y_val in validation_loader:
          outputs = model(x_val)
          _, predicted = torch.max(outputs, dim=-1)
          correct += int((predicted == y_val).sum())
          total += x_val.shape[0]
        print('Epoch: %d, Loss: %f, Accuracy: %f' % (epoch, float(loss), correct / total))

After each epoch, the training loop calculates and prints the accuracy using the validation dataset.

The model, optimizer and hyper parameters

The model used has an input of 768, which is the size of the embeddings array created by the Google AI embeddings API, and an output 3, one for each possible value. The hidden layer size is 512. Stochastic Gradient Descent (SGD) is used as the optimizer, and Cross Entropy as the loss function.

        
seq_model = nn.Sequential(OrderedDict([
    ('hidden_linear_0', nn.Linear(768, 512)),
    ('hidden_activation_0', nn.ReLU()),           
    ('output_linear', nn.Linear(512, 3))
]))

optimizer = optim.SGD(seq_model.parameters(), lr=1e-2)

training_loop(
    n_epochs = 200,
    model = seq_model,
    optimizer = optimizer,
    loss_fn = nn.CrossEntropyLoss(),
    training_loader = train_loader,
    validation_loader = validation_loader,
)

Training results

The model needs less than 200 epochs (or training loops) for the accuracy on the validation set to stabilize around 83%! While there's probably a lot of space for improvements, this seems to be on the top end of the existing notebooks for the model, shared on Kaggle, which range between 65% and 85%.

Once the training is finished, the weights can be easily saved with torch.save():

torch.save(seq_model.state_dict(), 'ytsentiment.safetensors')

Running the model on the web

The model is now trained and the weights saved into ytsentiment.safetensors. Unfortunately, this format cannot be run directly on the web. But not everything is lost - there are formats that are web friendly, and one of the is the TFLite mode, used by LiteRT.

Note: LiteRT is the new name for Tensorflow Lite. At the time this is being writter, the LiteRT documentation doesn't mention web libraries. However, the Tensorflow Lite library is still available on NPM and can handle the TFLite format.

Converting the PyTorch model to TFLite

The LiteRT team provides a library that makes the work to convert PyTorch models to TFLite straightforward. The process consists on creating an instance of the same model used for training and loading the previously trained weights into it, then calling ai_edge_torch.convert(), passing the model and a random input as parameter so the library can understand the model better. Finally, save the model to disk with edge_model.export().

model = nn.Sequential(OrderedDict([
    ('hidden_linear_0', nn.Linear(768, 512)),
    ('hidden_activation_0', nn.ReLU()),           
    ('output_linear', nn.Linear(512, 3))
]))

model.load_state_dict(torch.load("ytsentiment.safetensors", weights_only=True))
sample_inputs = torch.randn(1, 768)
edge_model = ai_edge_torch.convert(model.eval(), (torch.randn(1, 768),))
edge_model.export('ytsentiment.tflite')

Running the converted model in the browser

The model is now compabitle with the Tensorflow Lite library, can be loaded with tflite.loadTFLiteModel() and inference is executed with model.predict().

We we'll use Hello, your video is amazing as an example input. Because the model was trained on embeddings, rather than on text, it first needs to be converted into embeddings, using the same embedding model as before, but this time via the Google Gen AI JavaScript library:

const genAi = new GoogleGenAI({ apiKey: 'YOUR API KEY HERE' });

const sampleInput = "Hello, your video is amazing!".
const result = await genAi.models.embedContent({
		 model: 'text-embedding-004',
		 contents: [sampleInput]
});
const embeddings = embedResult.embeddings.map(embedding => embedding.values);

The embeddings are generated with models.embedContent() which allows to back the generation of embeddings by providing an array of inputs to contents contents. The result object contains a list of embeddings results, one for each input provided as parameter. The embedding array can be accessed with embedding.values. The result is then mapped into a 2D array, representing the list of embeddings for each input, and then the values of the embeddings themselves.

Since the model needs a tensor to run inference instead of a JavaScript array, tf.tensor2d(), is used to convert the 2D JavaScript array into a 2D tensor, which is then passed to the model when calling model.predict():

const embeddingTensor = tf.tensor2D(embeddings);
const outputTensor = await model.predict(embeddingTensor);

The output of the prediction is another 2D tensor, containing one element for each input, then another array with the logits, are are the score given by the model for each possible class. The tensor can be converted to a regular JavaScript array by calling .array():

console.log(`${await outputTensor.array()}`);
// Outputs "[[4.526814937591553,0.1881929636001587,-4.69814395904541]]"

As noticed in the output, the array contains only one item in the first level, matching the number of inputs passed to the model, and 3 items on the second level, which are the scores for each class - the 1st item is the score for positive, the 2nd for neutral, and the 3rd for negative, and the highest score is the most likely one, according to the model.

A neat trick to convert the array of results into an array of classes to use the tf.argMax() function, which transforms the array of scores into an array with the index of the highest scores:

const labels = ['Positive', 'Neutral', 'Negative'];
const argmax = await tf.argMax(outputTensor, 1).array();
const results = argmax.map(i => labels[i]);
console.log(results); 
// Outputs " ['Positive']"

Viewing results as probabilities

While the logits are enough to find the most likely result for the classification, developers may want to view those scores as probabilities, which is clearly not the case looking at the result numbers right now. This can be solved by feeding the model output into the the tf.softmax() function:

console.log(await tf.softmax(outputTensor, 1).array());
// Outputs "[[0.9870176911354065, 0.012885026633739471, 0.00009726943972054869]]"

The model gave a probability of 98.7% that Hello, your video is amazing is a Positive comment. Seems to check out.

Putting it all together

This is the code for the JavaScript inference all together:

    const input = 'Hello, your video is amazing!';
    const embedResult = await genAi.models.embedContent({
        model: 'text-embedding-004',
        contents: [input],
    });
    const embeddings = embedResult.embeddings.map(embedding => embedding.values);
    const outputTensor = await model.predict(tf.tensor2d(embeddings));
    const argmax = await tf.argMax(outputTensor, 1).array();
    const labels = ['Positive', 'Neutral', 'Negative'];
    const results = argmax.map(i => labels[i]);
    console.log(results);

    const probabilities = await tf.softmax(outputTensor, 1).array()
    console.log(probabilities);

Conclusion

Training a custom model, and doing well, requires specialized knowledge, which may not be worth for developers who want to focus on web development. At the same time, having understanding how models work and, more importantly, how to adapt them to the web can be a powerful tool for AI developers who want to get more usage of their model, or for web developers who want to take advantage of off the shelf models that are not immediately available on the web.

bandarra.me