Realtime Speech to Text
Realtime transcription of dialogue.
9 months ago
11-07-2023

CAS Reflection

I often use the speech-to-text feature on my phone to quickly jot down ideas or notes. It's a useful feature that allows me to articulate my thoughts quickly and easily without having to type them out. Curious, I decided to explore how exactly this feature works and maybe even build my own version of it.


I started by researching the topic and found that the process of converting speech to text is called speech recognition or speech-to-text. Generally, these systems are built using machine learning models that are trained on large datasets of speech data. A fairly well known model called "Whisper" from OpenAI is used in many speech-to-text applications and services.


For an efficient speech-to-text system, the model must be able to accurately transcribe speech in real-time. Therefore, the more data the model is trained on and the larger the parameters of the model, the better the transcription quality. However, this comes at a cost. Training large models requires a lot of computational power and energy, which can have a significant environmental impact. Consider the pricing of a single model like Whisper1.

Pricing for Whisper. Image source: OpenAI.Pricing for Whisper. Image source: OpenAI.

Aside the usage cost, the environmental cost of training large models is also a concern. A recent study found that that a smaller model the CO2 emissions of training a slightly larger model BLOOM was equivalent to around 60 flights between London and New York2. Therefore, while speech-to-text systems are incredibly useful, it is important to consider the environmental impact of the models that power them. I didn't want to contribute to the environmental impact of training large models, so I decided to handle the speech-to-text conversion on the client side without using a large model.


The pros of this approach are that it is more environmentally friendly and it also allows for real-time transcription without the need for an internet connection. The cons are that the transcription quality may not be as good as a model like Whisper and it may not work as well in noisy environments. In most modern technology stacks, speech-to-text can be handled natively without fetching a model from the internet. For example, Apple provides a speech recognition framework that can be used to convert speech to text in real-time3.

Apple speech recognition framework. Image source: Apple.Apple speech recognition framework. Image source: Apple.

Then, if we are able to access the native API of the device, we can use the speech recognition framework to convert speech to text in real-time. This is the approach I took to build my own speech-to-text system. Modern web browsers also provide a Web Speech API that can be used to convert speech to text in real-time4. We can use this API to build a speech-to-text system that works in the browser without the need for an internet connection.


Native Javascript bindings are accessible through the SpeechRecognition object. This object provides methods to start and stop listening for speech and to get the transcribed text. For my project I've used the react-speech-recognition library which provides a React hook to easily use the Web Speech API in a React component. This allows me to quickly build a speech-to-text system that works in the browser. For example, the code snippet below shows how to use the useSpeechRecognition hook to build a simple speech-to-text system.


import React from 'react';
import SpeechRecognition, { useSpeechRecognition } from 'react-speech-recognition';

const Dictaphone = () => {
  const {
    transcript,
    listening,
    resetTranscript,
    browserSupportsSpeechRecognition
  } = useSpeechRecognition();

  if (!browserSupportsSpeechRecognition) {
    return <span>Browser doesn't support speech recognition.</span>;
  }

  return (
    <div>
      <p>Microphone: {listening ? 'on' : 'off'}</p>
      <button onClick={SpeechRecognition.startListening}>Start</button>
      <button onClick={SpeechRecognition.stopListening}>Stop</button>
      <button onClick={resetTranscript}>Reset</button>
      <p>{transcript}</p>
    </div>
  );
};

export default Dictaphone;

Then, I can use the Dictaphone component in my application to build a speech-to-text system that works in the browser without the need for an internet connection. Without having to fetch a model from the internet, I can convert speech to text in real-time using the Web Speech API. This allows me to build a speech-to-text system that is more environmentally friendly and works in real-time without the need for an internet connection.


In creating a speech-to-text system, I learned about the environmental impact of training large models and the importance of considering the environmental impact of the technology we use. By handling the speech-to-text conversion on the client side without using a large model, I was able to build a speech-to-text system that is more environmentally friendly and works in real-time without the need for an internet connection. I hope in the end that I made a difference, however small, in reducing the environmental impact of speech-to-text systems and raising awareness of the importance of considering the environmental impact of the technology we use.

References