OpenAI Whisper API : Quick Guide

OpenAI announced Whisper API recently alongside the paradigm shifter ChatGPT API.

Whisper API’s enormous capabilities has been masked with the ChatGPT a little bit however this API will see lots of AI use cases in solo mode as well as combined with the other models and modalities.

In this tutorial, we will explain OpenAI’s Whisper and how it can be best harnessed with the API technology using the popular programming language Python and its libraries. We will make a few Whisper API examples and demonstrate its transcriptions and translations API end points with intuitive use cases of this powerful AI technology that’s available right now.

1- OpenAI Whisper API : Quick Guide

It’s always exciting to see advancements in the world of artificial intelligence, and the introduction of GPT-4 is surely a monumental milestone in the history of AI. With its ability to process image inputs, generate more dependable answers, produce socially responsible outputs, and handle queries more efficiently, this model looks set to revolutionize the field of natural language processing.

Robust Speech Recognition via Large-Scale Weak Supervision is the research paper authored by AI scientists Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever whom introduced OpenAI Whisper.

Additionally, whisper is an open source model which can be analyzed, downloaded, developed on and used for FREE.

As a Python programmer, I’m eager to see how this latest development will further enhance our ability to interact with machines in a meaningful and efficient way.

 

Whisper’s Transformer Architecture Explained

The speech-to-text application using a transformer architecture involves transforming recorded speech into text using a deep learning model that is based on a transformer architecture. The transformer architecture is structured in layers, and each layer has a unique function in processing the speech input for transcription.
 

On the 4th page of the research paper, OpenAI’s Whisper model which is based on Google Research’s Transformer research paper is explained simply and elegantly. As usual in transformer AI applications, we are seeing the following layers clearly:

  • Input layer: In speech-to-text applications based on transformers, input layer consists of audio with speech.
  • Encoder layers: In a nutshell, sinusodial positional encoder layers which consists of encoder blocks, analyze audio in 30 second chunks and extract positional information related to speech and work with them within the realm of human language structures.

    The input layer reads the speech signal as a series of data points and converts it to a sequence of vectors, which are fed into the first encoder layer. In this layer, the input vectors are processed and transformed into a set of feature maps, which contains the most relevant information present in the speech input.

    The subsequent encoder layers repeat this process, with each layer building on the previous one to extract further information from the speech input until the final layer produces the transcription output.
  • Decoder layers: Decoder layers play role in the accurate conversion of audio data to tokens. After the encoder layers, the decoder layers come in play. Herein, the decoder layers attend to the words generated by the encoder and generate the transcription of the speech. 
  • Output layer: In whisper’s case, output layer consists of tokens which consists of logical word and punctuation structures. You can familiarize yourself with OpenAI models’ token calculations.
Whisper's transformer architecture explained (Radford et al. page 4)

AI development using OpenAI’s Whisper model and Python is rather straightforward. Model can be access through API calls to its API Endpoints: Transcriptions & Translations.

Let’s get to the practical section of our tutorial and create a few simplistic Python scripts for demonstration and inspiration purposes. You can refer to Holypython’s Learn OpenAI’s Official ChatGPT API : Comprehensive Developer Guide (AI 2023) article.

Whisper API Python Example

Let’s work on the “I’m a cow song“. This funny video about cows has 23M videos on Youtube and is published under Creative Commons license. It can be downloaded to test our model’s speech-to-text capabilities. Use this Download Youtube Videos with Python tutorial or any other tool you’d like.

Also, the song’s lyrics are pronounced in a unique way that might be hard to understand, so it’s a good first test for our model.

First thing first we can address two formalities so we can get to the heart of the Python tutorial. Those formalities are importing the necessary libraries and taking care of authorization.

We will use openai library so it needs to be installed (if not installed yet) and imported.

  • It can be installed as: pip install openai (might differ based on your OS setup)
  • It can be imported as: import openai (see below)

For the authorization bit we will employ openai’s api_key method. It entails of simply using openai.api_key and passing the key obtained from the platform.openai.com account. (There is a menu called something like View OpenAI API Keys.)

import openai

OPENAI_API_KEY = 'YOUR_OPENAI_KEY'
openai.api_key = OPENAI_API_KEY

Once you’re done with the openai library and openai.api_key steps, you can conveniently use the .transcribe() method from the Audio class using openai library.

openai.Audio.transcribe(model="whisper-1", file=file)

We will pass two required parameters to the .transcribe method, these are model and path of audio file.

file= open("home/moomoo.mp3", "rb")
transcript = openai.Audio.transcribe(model="whisper-1", file=file)

print(transcript)

In the Python code above we are utilizing the openai.Audio.transcribe() method with the required parameters model and file. Below you can see a list of parameters that can be used with Whisper API.

Having a good grasp of these parameters can be important to fine tune the AI model and create the best possible user experiences. But first let’s check out the results of our simplistic example.

Here is the response.

Whisper API Output:

{
“text”: “I’m a cow, I’m a cow, look at me, look at me I’m a cow, I’m a cow, look at me, look
me, cause they need me Farmers milk me, and they feed me People eat me, children love me, alth
meat, I have big thighs And wow, I am so fat A few months later… Sooooo… I’m a cow, I’m a
}

The result is a json object which is very similar to Python dictionaries with a few differences. The key “text” is where the result is stored which is the transcription of our audio file.
 
After a quick check, the transcription looks 100% accurate. We can conclude that the Whisper API passes the first quality checks.

Whisper API Parameters

file

The path and name of the file to be transcribed or translated. File formats can be mp3, mp4, mpeg, mpga, m4a, wav, or webm. File is a required parameter.

model

The name of the AI model that’ll be used. Currently only available speech-to-text model from OpenAI is whisper-1. Model is a required parameter.

prompt

Prompt is a guidance that can be provided to the AI model as text. This creates an opportunity to have a say on the model’s style and refrain it from taking certain actions. It can be very useful for specialized applications.

response_format

This is the response format of the model. It is json by default. Other available formats are text, srt, verbose_json, or vtt which is convenient.

temperature

Used for narrowing down the focus of AI model and make it more deterministic or the opposite. It takes values between 0 and 1. The value given by default is 0 and while values near 0 will result in highly focused AI output, values near 1 will introduce randomness to the results.

language

Used for providing the language of the audio which results in both improved performance (reduced latency) and accuracy (reduced error rate).

Whisper was announced by Sam Altman in the last quarter of 2022. Speech-to-text will be a crucial step in the near future advancements of multimodal AI and possibly on the way to Artificial General Intelligence as it provides an auditory bridge between the AI and human speech. We have an interesting article if you are curious about the history of speech-to-text technology.

In simpler words, with this technology you can talk to AI or you can make AI listen to other products of human speech such as conversations, podcasts, debates, commercials , historical speeches, music videos, news and movies.

Open Sourced Whisper vs. OpenAI Whisper API

I think it’s important to make a clear distinction about two different methods of how Whisper can be implemented in projects.

Open source whisper is the trained whisper models that are also published as open source by OpenAI. OpenAI doesn’t do this for all of its models so it’s a great privilege to be able to actually download whisper models.

That being said, downloading AI models, creating special environments and dealing with the implications of hosting the models by oneself isn’t the ideal route for every developer. It can be less practical and even more costly due to resources, cybersecurity and professional supervision required.

That’s where the Whisper’s API endpoints come in and this tutorial mostly focuses on implementation of that technology. OpenAI Whisper API is the service through which whisper model can be accessed on the go and its powers can be harnessed for a modest cost ($0.006 per audio minute) without worrying about downloading and hosting the models.

It’s good to be aware of the difference in case different model names and features come up.

Whisper vs ChatGPT

Whisper is able to carry out speech to text transformations. Although it can be perfectly fine in solo use cases with audio and speech transcriptions and translations, it also has the potential to compliment AI models with different modalities such as ChatGPT’s GPT3.5-turbo and GPT4.

Whisper’s Modality

Whisper is an AI model specially developed for speech-to-text applications.
OpenAI Whisper Modality Vs. ChatGPT
Whisper-1 speech-to-text
GPT-3.5-Turbo text-to-text only
GPT-4 text-to-text
image-to-text

Whisper Pricing

Whisper API can be considered on the cheap side. With $0.006 per 1-minute audio processing, you slightly more than 2,5 hours of audio processing for $1 US.

OpenAI Whisper Pricing
Whsiper-1 $0.006 per 1-minute audio

Although current pricing of whisper is quite competitive as is, I believe there might be a price reduction in the future similar to what happened to GPT models. Gpt-3.5-turbo came with a 10X (or 30X depending on the time frame) price improvement although it’s a superior model compared to previous gpt-3.

Models & API: API Endpoints of Whisper

Whisper-1 model that’s used by the Whisper API uses Transcriptions or Translations endpoints depending on the use case. Model name can be assigned as whisper-1 and currently that is the only option through the API. Near the bottom of this page you can see open source versions of the whisper model but they won’t work through the official Whisper API service.

Transcriptions

  • model = whisper-1 (default choice)

To use the transcription method we need to POST to https://api.openai.com/v1/audio/transcriptions. POST request gets handled automatically when openai library is used in Python.

Translations

  • model = whisper-1 (default choice)
Similarly, to use the translation method we need to POST to https://api.openai.com/v1/audio/translations. POST request again gets handled automatically if and when openai library is used in Python. See example above.

Audio File Size: Whisper can handle up to 25Mb at once.

Whisper’s current model whisper-1 comes with an audio file size limit of 25Mb.

  • whisper-1: 25Mb
  • new-whisper-model: 250Mb file size limit (?) – this is only a speculation based on our experiences with OpenAI models’ evolution up to this point. A new whisper model will most likely introduce a higher file size limit.
Especially given the recency of the OpenAI’s Whisper API announcement, we can expect an updated model after some feedback from early adopters from various industries as well as new startups solely founded based on OpenAI’s models.
Whisper API Audio Limits
Whisper-1 25Mb file size maximum

For the time being, there is a practical solution around this limitation. Developers and API users can split audio files to 25Mb chunks using open source Python libraries such as PyDub and its AudioSegment module.

Training of Whisper:

Whisper's training took 680K hours of diverse audio combinations

In the 680K hour training of Whisper, its transformer architect was fed with many languages (98+) for transcription and translation purposes.

In addition to various audio data with speech, the model was provided with music without any speech so it can learn not to provide false positive output.

Languages supported by Whisper

Afrikaans,
Arabic,
Armenian,
Azerbaijani,
Belarusian,
Bosnian,
Bulgarian,
Catalan,
Chinese,
Croatian,
Czech,
Danish,
Dutch,
English,
Estonian, Finnish,
French,
Galician,
German,
Greek,
Hebrew,
Hindi,
Hungarian,
Icelandic,
Indonesian, Italian,
Japanese,
Kannada,
Kazakh,
Korean,
Latvian,
Lithuanian,
Macedonian,
Malay,
Marathi,
Maori,
Nepali,
Norwegian,
Persian,
Polish,
Portuguese,
Romanian,
Russian,
Serbian,
Slovak,
Slovenian,
Spanish,
Swahili,
Swedish,
Tagalog,
Tamil,
Thai,
Turkish,
Ukrainian,
Urdu,
Vietnamese,
Welsh.

3- Special AI Use Cases for Whisper

Now that we have an extremely precise, capable, multi-language speech-to-text model along with all the other capable AI models that are being released, we can expect to see more and more development in the AI field related to audio, music, video and other multimodal AI applications.

1. Transcription Services: Large businesses and legal, medical, and academic institutions require the transcription of documents daily. With speech-to-text applications, businesses can save their valuable time and money by automating the transcription just by speaking to a device or uploading an audio file.

2. Customer Service: With AI-powered speech-to-text software, customer services teams can convert speech into text and analyze it to determine specific customer demand, complaints or feedback to derive meaningful insights and improve customer retention and customer experience.

3. Education: Speech-to-text transcription can benefit students with learning disabilities, who may have difficulty with writing or keyboarding, by converting speech into text in real-time.

4. Voice-Controlled Applications: Several voice-powered applications, such as voice-controlled virtual assistants like Siri or Alexa, require speech-to-text technology to function effectively.

5. Creating Subtitles/Captioning: Speech-to-text technology is widely used to create subtitles and captions for movies, TV shows, online courses, and presentations.

6. Automated Translation Services: With AI-powered speech-to-text software, converting speech into text in one language can be automated and then translated into multiple languages, streamlining the localization process significantly. Speech-to-Text and in general AI will have a tremendous impact on Travel industry. Check out this interesting article about Future of AI in Tourism.

7. Public Safety: Real-time transcription and analysis of public events, police radio communications, and other sources of audio content can be done with speech-to-text applications, aiding public safety efforts.

8. Healthcare: doctors and nurses can use speech-to-text software while documenting patient information to save them time and prevent human errors.

9. Fast and Convenient Data Entry: This technology can help field agents, reporters, and other workers quickly and conveniently enter crucial data into their systems by speaking their notes rather than typing them.

10. Competitive Advantages for Businesses: Speech-to-text applications can analyze the tone, sentiment, and keywords used in customer interactions, advertising, and social media mentions to drive business insights and identify opportunities for improvement.

It is crucial and very exciting to stay on top of the technological and scientific developments in the field to stay inspired and to be able to produce relevant IT products and services that make our world a more efficient and better place for everyone and seek answers to unknown questions.

Whisper provides an additional modality in the AI tech stack via speech-to-text

Whisper Open Source

Whisper Open Source Models Learning Parameter Size English-only
model
Multilingual
model
tiny 39M yes yes
base 74M yes yes
small 244M yes yes
medium 769M yes yes
large 1550M no yes
large-v2 yes no yes

Summary

In this Python Tutorial, we’ve continued to explore the latest advancements in OpenAI’s AI models with a focus on their speech-to-text model, Whisper. The newly announced Whisper API endpoint is an exciting opportunity for AI development that will allow developers like us to integrate cutting edge speech recognition capabilities into projects that solve real world problems.

re speech-to-text AI use cases

Whether it’s building voice assistants, transcribing audio files, or creating accessibility tools for those with hearing impairments, the possibilities that Whisper model and its API unlock are truly limitless. We look forward to leveraging this powerful tool to enhance the quality and functionality of our applications. It will be exciting to see the positive impact it’ll have on our visitors and the broader community.

By staying up-to-date with the latest breakthroughs in AI, we can continue to provide the best possible AI tutorials for our users, and help to drive adoption and innovation in the ever-evolving world of programming.