I mostly relied on text input as before, so I did not notice this until now, but recently decided to give the Android app another shot, as the latency during voice input was the main barrier for me. Yet, it still doesn't feel smoother or low-latency than before, and additionally, I can't find the camera mode that processes input nearly instantly. It seems like I'm still using Whisper for voice input, and I just got a popup stating that these features would come to me "soon". I am also one of the Beta testers for the App via the Play Store.
Can anyone explain this behavior, or do you have similar experiences with the app? It is possible that GPT-4o is currently being run with Whisper as an input for some users, like they did with GPT-4v and other predecessors? If so, does anyone know why? I was under the impression that being able to handle the different inputs via one central model was in part a gain in efficiency, so rolling out the new model but keeping the old system with different models converting content into text before having that handled by the model makes little sense to me, from a cost perspective alone.
Mainly, I am interested in understanding how specifically GPT-4o works in connection with Whisper, considering not needing Whisper was the main reason for this launch and why they'd do it this way, even if only for a limited time, considering it should cost them more per user.