2
6 Comments

How to implement realtime translation for WebRTC Video meeting?

Hi, IHs. I'm making real time translation video conference system, and thinking about network design. I would like to hear advice how you would translate the remote participant audio.

On the system, where 3-10 people will participate,

  • Local user can choose between remote native sound and translated remote sound.
  • Translated sound is synthesized by speech-to-text and speaker voice.

My hypothetical design is:

  • Remote participant voice is recorded by MediaRecorder
  • Local user receives remote participant audio via data channel
  • local user sends remote participant voice to Google Speech To Text API, Google Translate and Speech to Text, then play the sound (turning off remote participant voice).

If I could get an audio data from MediaStream, it would be easier, but I assume MediaStream doesn't allow.
https://developer.mozilla.org/en-US/docs/Web/API/MediaStream

Do you agree my hypothetical design? Or do you have better idea? Any comment welcomed. Thank you!

on January 13, 2021
  1. 1

    Hi there!

    Usually it goes this way:

    For the local audio listener
    Browser =(webrtc)=> Mediaserver =(webrtc)=> Browser

    The streamer transmits the stream to a media server, it distributes the stream to the listeners

    For the listener of the translated audio
    Browser =(webrtc)=> Mediaserver
    =(rtp)=> ffmpeg =(flac)=> google speech to text => google translate => google text to speech => ffmpeg =(rtp)=> Mediaserver =(webrtc)=> Browser

    The streamer transmits the stream to a media server

    From the media server we convert the stream to a RTP-stream for FFmpeg and then to FLAC for Google Speech to Text to support
    Once we get a response, we convert the text into audio
    Convert the audio into FFmpeg
    FFmpeg delivers the stream to the media server
    The users get the translated stream from the media server via HTTP

    In the algorithm you’ve suggested it’s not quite clear why use MediaRecorder and transmit the data via Data Channel since you can enable MediaRecorder for the recipient as well. Plus, if several people listen to the translated audio, each listener will translate the audio separately instead of just receiving the audio translated once for every listener.

    And you also haven’t mentioned whether you use a media server. If you plan on 3-10 people participating in the stream, we recommend you use one :)

    1. 1

      Thanks for the comment. I used Agora RTC and it doesn't provide an audio track. It was encapsulated in their SDK. Not sure if it is possible with Agora.

      1. 1

        We’re currently working on a project and use Agora (Web SDK). The interface allows to extract audio tracks and attach them natively to the audio tag. What particular SDK do you use? Have you tried getting the track from audio tag as an alternative?

        https://docs.agora.io/en/All/downloads?platform=All Platforms

        1. 1

          Ah, really. I remember you can do it with Web SDK. I was using Flutter SDK and it doesn't allow you to get an audio track, so I gave up
          making transcriptions.

          1. 1

            Maybe try extracting the audio track not straightforwardly but somehow get it from the Flatter elements?
            There must be some way to get audio tracks via Agora SDK that’s not described in API

            1. 1

              I'm not sure. Maybe. It can end up developing new SDK.

Trending on Indie Hackers
I shipped a productivity SaaS in 30 days as a solo dev — here's what AI actually changed (and what it didn't) User Avatar 271 comments Never hire an SEO Agency for your Saas Startup User Avatar 110 comments 85% of visitors leave our pricing page without buying. sharing our raw funnel data User Avatar 46 comments Are indie makers actually bad customers? User Avatar 40 comments We automated our business vetting with OpenClaw User Avatar 38 comments I Found Blue Ocean in the Most Crowded Market on the Internet User Avatar 29 comments