How to implement realtime translation for WebRTC Video meeting?

Hi, IHs. I'm making real time translation video conference system, and thinking about network design. I would like to hear advice how you would translate the remote participant audio.

On the system, where 3-10 people will participate,

Local user can choose between remote native sound and translated remote sound.
Translated sound is synthesized by speech-to-text and speaker voice.

My hypothetical design is:

Remote participant voice is recorded by MediaRecorder
Local user receives remote participant audio via data channel
local user sends remote participant voice to Google Speech To Text API, Google Translate and Speech to Text, then play the sound (turning off remote participant voice).

If I could get an audio data from MediaStream, it would be easier, but I assume MediaStream doesn't allow.
https://developer.mozilla.org/en-US/docs/Web/API/MediaStream

Do you agree my hypothetical design? Or do you have better idea? Any comment welcomed. Thank you!

Watanabe Naoki

on January 13, 2021

Say something nice to watanabe…

Post Comment

1

Hi there!

Usually it goes this way:

For the local audio listener
Browser =(webrtc)=> Mediaserver =(webrtc)=> Browser

The streamer transmits the stream to a media server, it distributes the stream to the listeners

For the listener of the translated audio
Browser =(webrtc)=> Mediaserver
=(rtp)=> ffmpeg =(flac)=> google speech to text => google translate => google text to speech => ffmpeg =(rtp)=> Mediaserver =(webrtc)=> Browser

The streamer transmits the stream to a media server

From the media server we convert the stream to a RTP-stream for FFmpeg and then to FLAC for Google Speech to Text to support
Once we get a response, we convert the text into audio
Convert the audio into FFmpeg
FFmpeg delivers the stream to the media server
The users get the translated stream from the media server via HTTP

In the algorithm you’ve suggested it’s not quite clear why use MediaRecorder and transmit the data via Data Channel since you can enable MediaRecorder for the recipient as well. Plus, if several people listen to the translated audio, each listener will translate the audio separately instead of just receiving the audio translated once for every listener.

And you also haven’t mentioned whether you use a media server. If you plan on 3-10 people participating in the stream, we recommend you use one :)

forasoft

·
4 years ago
·
Reply
1. 1
  
  Thanks for the comment. I used Agora RTC and it doesn't provide an audio track. It was encapsulated in their SDK. Not sure if it is possible with Agora.
  
  watanabe
  
  ·
  4 years ago
  ·
  Reply
  1. 1
    
    We’re currently working on a project and use Agora (Web SDK). The interface allows to extract audio tracks and attach them natively to the audio tag. What particular SDK do you use? Have you tried getting the track from audio tag as an alternative?
    
    https://docs.agora.io/en/All/downloads?platform=All Platforms
    
    forasoft
    
    ·
    4 years ago
    ·
    Reply
    1. 1
      
      Ah, really. I remember you can do it with Web SDK. I was using Flutter SDK and it doesn't allow you to get an audio track, so I gave up
      making transcriptions.
      
      watanabe
      
      ·
      4 years ago
      ·
      Reply
      1. 1
        
        Maybe try extracting the audio track not straightforwardly but somehow get it from the Flatter elements?
        There must be some way to get audio tracks via Agora SDK that’s not described in API
        
        forasoft
        
        ·
        4 years ago
        ·
        Reply
        
        1
        
        I'm not sure. Maybe. It can end up developing new SDK.
        
        watanabe
        
        ·
        4 years ago
        ·
        Reply