Hi, IHs. I'm making real time translation video conference system, and thinking about network design. I would like to hear advice how you would translate the remote participant audio.
On the system, where 3-10 people will participate,
My hypothetical design is:
If I could get an audio data from MediaStream, it would be easier, but I assume MediaStream doesn't allow.
https://developer.mozilla.org/en-US/docs/Web/API/MediaStream
Do you agree my hypothetical design? Or do you have better idea? Any comment welcomed. Thank you!
Hi there!
Usually it goes this way:
For the local audio listener
Browser =(webrtc)=> Mediaserver =(webrtc)=> Browser
The streamer transmits the stream to a media server, it distributes the stream to the listeners
For the listener of the translated audio
Browser =(webrtc)=> Mediaserver
=(rtp)=> ffmpeg =(flac)=> google speech to text => google translate => google text to speech => ffmpeg =(rtp)=> Mediaserver =(webrtc)=> Browser
The streamer transmits the stream to a media server
From the media server we convert the stream to a RTP-stream for FFmpeg and then to FLAC for Google Speech to Text to support
Once we get a response, we convert the text into audio
Convert the audio into FFmpeg
FFmpeg delivers the stream to the media server
The users get the translated stream from the media server via HTTP
In the algorithm you’ve suggested it’s not quite clear why use MediaRecorder and transmit the data via Data Channel since you can enable MediaRecorder for the recipient as well. Plus, if several people listen to the translated audio, each listener will translate the audio separately instead of just receiving the audio translated once for every listener.
And you also haven’t mentioned whether you use a media server. If you plan on 3-10 people participating in the stream, we recommend you use one :)
Thanks for the comment. I used Agora RTC and it doesn't provide an audio track. It was encapsulated in their SDK. Not sure if it is possible with Agora.
We’re currently working on a project and use Agora (Web SDK). The interface allows to extract audio tracks and attach them natively to the audio tag. What particular SDK do you use? Have you tried getting the track from audio tag as an alternative?
https://docs.agora.io/en/All/downloads?platform=All Platforms
Ah, really. I remember you can do it with Web SDK. I was using Flutter SDK and it doesn't allow you to get an audio track, so I gave up
making transcriptions.
Maybe try extracting the audio track not straightforwardly but somehow get it from the Flatter elements?
There must be some way to get audio tracks via Agora SDK that’s not described in API
I'm not sure. Maybe. It can end up developing new SDK.