Unlock Real-Time AI Conversation Co-Pilot for Your Phone

Unlock real-time AI conversation co-pilot for your phone. Build a powerful tool that transcribes and analyzes conversations in real-time, providing instant suggestions and feedback to improve communication. Enhance your interviews, meetings, and social interactions with this AI-powered assistant.

July 14, 2024

party-gif

This blog post explores the potential of a real-time AI conversation co-pilot that can assist with tasks like job interviews and user research. The author showcases the development of a web and mobile application that leverages advanced speech-to-text and language models to provide instant transcription and suggestion capabilities, highlighting the benefits of such a tool in enhancing communication and productivity.

Intro to Real-time AI Conversation Co-pilot

Almost a year ago, around March 2023, when ChatGPT just came out and became the hottest topic in the world, I remember clearly seeing a demo from Arony where he built an "Interview Breaker" - a ChatGPT tool that helps you crack job interviews. This week, I built something called the "Interview Breaker" - a proof of concept made with ChatGPT for cracking job interviews. It takes your previous experience, listens in on your conversation with your interviewer, and tells you what to say, filling you in on things you might not know.

As a senior architect, when prioritizing what to focus on for a backend service, I prioritize scalability. This sort of tool is going to wreak havoc on the job interview process. Usually, when these big technologies emerge, like computers or the internet, they change all of the processes that emerged before them. That means some of these questions might not make sense to ask anymore if we're looking very far into the future.

I thought that was a fantastic idea because back then, I was actually going through some job interview processes, so I would love to have a real-time tool that can actually help me crack it. I actually did try to build that prototype where it used a speech-to-text model to generate the transcript and also used a large language model to generate answers, but that prototype never worked well in real life. One of the hard requirements for those real-time interview or conversation co-pilots is that they have to be low latency and real-time. If it would take 30-40 seconds for it to generate some results, it's not going to really work. Unfortunately, that was the case back in March last year, as both the speech-to-text model and the large language model took quite a long time to inference. This was a simple project in theory, but very hard to build a usable product in reality.

However, a couple of months later, I saw another product showcasing a similar scenario but with almost close to real-time performance. In aerospace engineering, such as in jet engines or spacecraft re-entry, how do you approach these challenges?

Challenges in Building Real-time Transcript and Fast Inference

There are two key components to build a real-time conversation companion: real-time transcript and fast inference.

Real-time Transcript

Achieving real-time transcription is one of the biggest challenges. Typical speech-to-text models like Whisper are not designed for streaming scenarios, where the audio is processed in small chunks rather than the entire recording.

To overcome this, a common solution is to create a recurring loop that continuously captures small audio chunks (e.g., every 2-5 seconds), sends them to the speech-to-text model, and stitches the small transcripts together. This approach requires optimizations to ensure accuracy, such as comparing timestamps of connecting words to enhance the final transcript.

Fortunately, the speech-to-text technology has evolved rapidly, and there are now solutions that enable real-time transcription, such as using incredibly fast models hosted on platforms like Replicate or deploying lightweight models like Whisper Kit directly on mobile devices.

Fast Inference

The second challenge is achieving very fast inference with the large language model to generate suggestions in real-time. To address this:

  1. Choose a fast and small language model: Models like Meteo 7B are much smaller and faster than GPT-4, allowing for quicker response generation with less computing resources.

  2. Reduce input size: As the conversation gets longer, the input to the language model can become too large. Techniques like language model summarization can be used to extract only the relevant information and reduce the input size.

  3. Optimize output generation: Further optimizations can be done to reduce the output token count, such as using prompt engineering methods.

By combining these techniques for real-time transcript and fast inference, it is possible to build a highly responsive real-time conversation companion that can provide valuable suggestions and support during conversations.

Building a Web-based Conversation Co-pilot Demo

To build a web-based conversation co-pilot demo, we'll use a combination of Flask (a Python web framework) and Replicate (a platform for running open-source AI models).

The key components are:

  1. Real-time Transcript: We'll use a fast speech-to-text model from Replicate to generate a real-time transcript of the conversation. This involves continuously capturing small audio chunks, sending them to the speech-to-text model, and stitching the results together.

  2. Fast Inference: We'll use a small, fast language model from Replicate (like Minitram) to generate suggestions and answers based on the transcript in real-time. We'll also explore techniques like reducing the input size and summarizing the conversation to improve the speed.

The web app will have the following features:

  • A text input for the user to provide context about the conversation.
  • A "Record" button to start and stop the audio recording.
  • A "Get Suggestion" button to trigger the language model and get suggestions.
  • A real-time display of the transcript.
  • A display of the generated suggestions.

Here's the step-by-step process:

  1. Set up the Flask app:

    • Create the app.py file and import the necessary libraries, including the Replicate Python SDK.
    • Define the Flask routes for the index page and the audio processing endpoint.
    • Set up the AWS S3 bucket and credentials for temporarily storing the audio recordings.
  2. Implement the real-time transcript functionality:

    • Use the Replicate Whisper model to continuously capture and transcribe audio chunks.
    • Optimize the transcript by handling word boundaries and maintaining context between chunks.
  3. Implement the fast inference functionality:

    • Use the Replicate Minitram (or a similar small, fast language) model to generate suggestions based on the full transcript.
    • Explore techniques like reducing the input size and summarizing the conversation to improve the inference speed.
  4. Build the front-end with HTML and JavaScript:

    • Create the index.html file in the templates folder.
    • Define the HTML structure with the text input, record button, and suggestion display.
    • Implement the JavaScript logic to handle the recording, audio upload, and API calls to the Flask backend.
  5. Test and deploy the web app:

    • Run the Flask app locally and test the functionality.
    • Deploy the app to a hosting platform (e.g., Heroku, AWS, or your own server).

By following these steps, you'll be able to build a web-based conversation co-pilot demo that can listen to conversations, generate real-time transcripts, and provide suggestions based on the context.

Leveraging Whisper Kit for a Mobile Conversation Co-pilot

After seeing the impressive demo of the web-based conversation co-pilot, I decided to explore the potential of building a mobile version using the Whisper Kit open-source framework. Whisper Kit provides a Swift package that allows for the deployment of the Whisper speech-to-text model directly on iOS devices, enabling real-time transcription with minimal latency.

To get started, I cloned the Whisper Kit GitHub repository and opened the example project in Xcode. The project includes a whisper-ax folder, which contains the source code for a sample iOS app that demonstrates the use of the Whisper Kit.

In the ContentView.swift file, I first defined a few additional state variables to handle the prompt input and the API response summary from the large language model. I then added an input field for the user to customize the prompt, which will be used to provide context to the large language model.

Next, I implemented the getSuggestion() function, which is responsible for sending the transcript and prompt to the Replicate API to generate a response from the Mistral language model. This function handles the streaming nature of the Replicate API, continuously checking the status until the response is complete and then updating the API_response_summary state variable with the generated suggestion.

Finally, I added a "Get Suggestion" button that triggers the getSuggestion() function, and displayed the API response summary below the real-time transcript.

The resulting iOS app allows users to start a conversation, see the transcript in real-time, and receive instant suggestions from the large language model to help guide the conversation. The use of Whisper Kit for the speech-to-text functionality, combined with the integration of the Replicate API, provides a seamless and responsive conversation co-pilot experience directly on the user's mobile device.

This approach unlocks new possibilities for real-time, context-aware conversational assistance, empowering users with intelligent support during important discussions, interviews, and social interactions. By leveraging the latest advancements in speech recognition and large language models, the mobile conversation co-pilot can become a valuable tool for improving communication and productivity.

I'm excited to continue refining and polishing this mobile conversation co-pilot app, and I look forward to sharing it with the community once it's ready for release. Please let me know if you have any interest in trying out the app or providing feedback on its development.

Conclusion

In conclusion, the development of a real-time conversation co-pilot is a complex task that requires addressing several key challenges. The primary challenges include:

  1. Real-Time Transcript Generation: Achieving low-latency, accurate speech-to-text transcription is crucial for providing real-time feedback. Techniques like using a streaming speech recognition model and optimizing the connection between audio chunks and transcribed text are essential.

  2. Fast Large Language Model Inference: Generating relevant suggestions and responses quickly requires using smaller, specialized language models that can provide fast inference times. Techniques like reducing input token size and summarizing the conversation history can help improve performance.

  3. Seamless Integration: Combining the real-time transcript generation and the large language model inference into a cohesive, user-friendly application is crucial for providing a smooth and effective experience.

The demonstration showcased how these challenges can be addressed using a combination of technologies, including the Whisper speech-to-text model, the Minstrel language model, and the Replicate platform for easy deployment. The resulting web and mobile applications provide real-time transcription and suggestion generation, showcasing the potential of this technology to enhance various conversational scenarios, such as job interviews, user research interviews, and social interactions.

Overall, the development of a real-time conversation co-pilot is a promising area of research and development, with the potential to significantly improve the quality and effectiveness of human-to-human communication.

FAQ