HACKER Q&A
📣 raidicy

Is there any work being done in speech-to-code with deep learning?


Is there any work being done on speech to code in a deep learning area . I have severe RSI which prevents me from coding at all . I have tried to use speech recognition software such as vocola and windows speech engine . but it required me to speak in such a way that I always would hurt my throat . I have also injured my throat multiple times so I am searching for a solution that is more conversational then command driven . I have written over 10000 lines of command Fargo Cola and they're still too many edge cases which require me to continually speak in an Abrupt manner that causes strain on my throat .


  👤 daanzu Accepted Answer ✓
Windows Speech Recognition is far from the best, so perhaps your trouble could be partly caused by how you had to speak in order to be understood, rather than the command style? I used to use WSR to code by voice, and it was far more laborious than my current setup.

I develop kaldi-active-grammar [0]. The Kaldi engine is state of the art for command and control. Although I don't have the data and resources for training a model like Microsoft/Nuance/Google, being an open rather than closed system allows me to train models that are far more personalized than the large commercial/generic ones you are used to. For example, see the video of me using it [1], where I can speak in a relaxed manner without having to over enunciate and strain my voice.

Gathering the data for such training does take some time, but the results can be huge [2]. Performing the actual training is currently complicated; I am working on making it portable and more turnkey, but it's not ready yet. However, I am running test training for some people. Contact me if you want me to use you as a guinea pig.

[0] https://github.com/daanzu/kaldi-active-grammar

[1] https://youtu.be/Qk1mGbIJx3s

[2] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...


👤 apeddle
I came across Serenade (https://serenade.ai/) recently. It's still beta but I was very impressed. In the past I've used vocola, and a few other open-source options. Serenade felt much more natural and powerful. The founders are also super hands-on and genuinely seem to care about the problem.

👤 tbabej
While (likely) not using directly deep learning, I found the following talk [1] by Emily Shea on her code dictation setup (based on Talon Voice) both insightful and impressive.

EDIT: Actual demo with coding starts at 18.00: https://youtu.be/YKuRkGkf5HU?t=1076

[1] https://www.youtube.com/watch?v=YKuRkGkf5HU


👤 bmc7505
Shameless plug, but I have been working on an open source IDE plugin [1] for the IntelliJ Platform which attempts to do this. Previously, we used an older HMM-based speech toolkit called CMUSphinx [2], but are currently transitioning to a deep speech recognition system. We also tried a number of cloud APIs including Amazon Lex and Google Cloud Speech, but they were too slow -- offline STT is really important for low latency UX applications. For navigation and voice typing, we need something customizable and fairly responsive. Custom grammars would be nice for various contexts and programming languages.

There are a few good OSS offline deep speech libraries including Mozilla DeepSpeech [3], but their resource footprint is too high. We settled on the currently less mature vosk [4], which is based on Kaldi [5] (a more popular deep speech pipeline), and includes a number of low-footprint, pretrained language models for real-time streaming inference. Research has shown how to deploy efficient deep speech models on CPUs [6], so we're hoping those gains will translate to faster performance on commodity laptops soon. You can follow this issue [7] for updates on our progress. Contributions are welcome!

[1]: https://github.com/OpenASR/idear/

[2]: https://cmusphinx.github.io/

[3]: https://github.com/mozilla/DeepSpeech

[4]: https://github.com/alphacep/vosk-api

[5]: https://github.com/kaldi-asr/kaldi

[6]: https://ai.facebook.com/blog/a-highly-efficient-real-time-te...

[7]: https://github.com/OpenASR/idear/issues/52


👤 mkl
I use dictation a bit for prose, but my voice wouldn't be able to handle more than a couple of hours a day of that.

Can you use a touch screen or mouse? I went ~13 years without using a keyboard, and typed with mice (some customised), trackballs, and touch screens, mostly using predictive typing software I wrote. In that time I did a lot of programming, including a whole applied maths PhD.

One of the best mouse setups I came up with a variety of versions of was moving the cursor with one hand, and clicking with the other. Holding the mouse still to click the button accurately is a surprisingly problematic movement. I made a button-less mouse with just a flat top to rest the side of my hand on, with a bit sticking up to grip. Standalone USB numeric keypads can be remapped to mouse clicks and common keys.

Touch screens can also be very good, if set up right, as all the movement can come from the big muscles and joints of your upper arm and shoulder, and your fingers and wrist don't need to do much. The screen needs to be positioned well, not out in front of you, but down close and angled in a comfortable position to hold your arm for long periods.


👤 downerending
Not sure exactly how bad "severe" is, but I had a lot of luck with my RSI switching to two-fingered typing for a (long) while. It's crucial to keep everything below your elbows utterly relaxed, like a pianist, sort of.

Also, I bought a keyboard tray that supported a deep negative angle, which helped me keep a very anatomical (relaxed and natural) position.

Also, figure out that mouse, somehow. Something like the above, plus switch sides frequently.

I've no idea if that could help you, but after a few years, I'm largely in remission.

I know this isn't really what you were asking, but I'm somewhat hopeful you can find relief. Good luck.


👤 xenonite
Is the hackernews process broken? I currently see three comments being downvoted without any apparent reason: https://news.ycombinator.com/item?id=23507041 https://news.ycombinator.com/item?id=23506992 https://news.ycombinator.com/item?id=23507486

👤 setzer22
I've been working on a similar use case at work (going from discoursive speech to cli-like commands, using a semi-rigid language), and I didn't find any off-the-shelf purely ML-based solution that would work for us.

In my experience, I've found any services claiming to do deep learning produced far worse results than what we could get with simple approaches. That is, when faced with non-grammatical sentences (or rather, sentences with a different grammar than English's). Of course that's because models are not typically trained with this use-case in mind! But the fact that you need a huge load of data to even slightly alter the expected inputs of the system, to me, was a deal breaker.

For the specific case of programming with voice, Silvius comes to mind. It's built and used by a developer with this same problem. It's a bit wonky having to spell words sometimes with alpha-beta-gamma speech, and it won't work without some customization, but on the other hand it's completely free and open source: https://github.com/dwks/us


👤 O_H_E
Related: there was a famous thread here a few months ago that would be very helpful.

Ask HN: I'm a software engineer going blind, how should I prepare? (https://news.ycombinator.com/item?id=22918980)


👤 byteface
On a mac there is a tool called 'voice control' which can trigger custom 'commands' or keyboard shortucts. You can use it to trigger shortcuts in any IDE. So if your IDE supports custom shortcuts for templating you're away.

👤 suby
I'm also curious about this.

The best project I've seen for voice coding is Talon Voice, but I doubt anything novel is being done with it and deep learning. I'd suggest trying it out if you haven't. They also have a pretty active slack channel, you might have some luck asking them if they know about anything on the horizon.

https://talonvoice.com/


👤 gtmtg
Check out https://serenade.ai - another startup working on this!

👤 cellis
I really hate to see programmers/typists suffering from RSI when it is entirely preventable with the right ergonomics. Having worked on production NLP systems, I have to say I think typing will remain a more effective way of coding for many years to come ( for many reasons, but primarily because syntaxes change often and training for syntax and context is hard ). I also had RSI for many years, and finally it started to affect me playing sports and esports.

So first, I switched my mouse to my non dominant hand ( left hand for me ), as that hand already has many things to deal with. I'm also using a workstation that allows me to mount my displays at eye-level while sitting or standing. Not hunching over is ergonomics 101. Second, I switched from a standard keyboard to a split keyboard. I tried many -- Goldtouch, Kinesis Advantage2, Kinesis Freestyle -- and ultimately settled on the Ultimate Hacking Keyboard.

I could write many more paragraphs on how I customized it and why it won out, but the most important thing is that is is split and it "felt" best, once I mastered the key placements ( arrows are in different places ).

Third, I started learning VIM. Vim is really awesome but up until recently didn't have great IDE or other editor support. Now it does so there's no reason to not use it. I mostly use it for quickly jumping around files and going to line numbers.

Fourth, I'm always looking to optimize non-vim shortcuts in my editor. For example, expand-region ( now standard in VSCode ) is one of my favorite plugins.

Fifth, I'm very conscious of using my laptop for long stretches of time. Mousing on the mousepad is much more RSI inducing than using a nice gaming mouse and the UHK keyboard.

All of this to say that RSI doesn't have to be career ending. If you're doing software work and you have functioning hands and wrists you should definitely look to optimize typing before looking to speech to code. Good luck!


👤 tluyben2
Probably not good for your case, but end of the summer we are going in beta launch for our product which is a visual + speech controlled programming language. It's very niche as it's a new language and IDE, from scratch, but so far it's been fun working on it.

👤 mk4
Not coding - but openai's API in beta has a speech to bash function https://openai.com/blog/openai-api/

👤 xchaotic
Not sure if off topic but it wasn’t the RSI that put the nail in the coffin for my programming. It was spine injury and I had to have a surgery. There’s lots of jobs around programming that don’t require as much typing and even when you do, it’s easier to dictate email than code. Basically you got RSI from coding and generally spending too much time with the keyboard. Maybe at least consider alternatives where you are not spending lots of screen time again.

👤 netman21
I was hoping this was a question about applying NLP to coding tasks, but based on the answers it is about voice to text for the special use case of coders.

I am not a coder, I am a writer. I wonder why all these AI people are trying to create things that will displace my means of earning a living instead of something that will create applications?

Why can't I tell my Mac: "Computer: take this collection of files and extract all the addresses of people in Indiana."


👤 riedel
Here is an article summarizing some nice stuff : https://www.nature.com/articles/d41586-018-05588-x

I always wanted to learn vimspeak: https://news.ycombinator.com/item?id=5660633


👤 fareesh
I remember watching a video of this guy from maybe 10-12 years ago who used IBM Dragon natural speak into some kind of vim commands and shorthand language which enabled him to write code faster than he could using the keyboard. It was a fascinating demo, but the efficiency seemed to come from his short language that he developed for his workflow like "bop chop hop" etc.

👤 boomersooner
I have similar issues. I use a combination of kinesis advantage, penguin mouse, and dragonfly/DNS. Having a good microphone does make a difference, as does retraining/tweaking command vocab. The biggest thing overall is the ergonomics of desk work - I take a break every 15 minutes (or try to) by setting timers.

👤 mulmen
I can’t help you with actually converting speech to code but it occurs to me this would be a benefit to everyone. Speaking the words that are represented by the code we write would require a much deeper understanding of what we are doing and why.

Food for thought for sure. Good luck.


👤 tibu
One of my friends was just diagnosed with ALS. Such softwares listed in this topic make their life useful and enjoyable for their remaining years. Guys, keep up the good work! I'll definitely check where I could contribute too.

👤 mtrimpe
Have you tried VoiceCode? https://voicecode.io/

That in combination with switching to a Lisp (Clojure) almost made it feasible for me to code with RSI.

I just became a manager instead because I couldn’t work from home and talking like that in the office was a no-go for me.

If that’s your cup of tea you’d be surprised at how happy upper management is to have someone who’s actually good at technology be willing to engage with them.


👤 idontevengohere
I'd love to help! If anyone's working on this in this thread, lemme know :)

👤 j88439h84
The options are Caster and Talon. Talon is closed source.

👤 mpourmpoulis
Though unfortunately I cannot provide you with the conversational solution you are looking for, I believe there are some steps you can take/solutions that are currently available and that could help make your voice programming experience less exhausting, so it might be worth it if you gave them a try

1) try to minimize the amount you have to speak by leveraging auto completion as much as possible. For me TabNine [1] has been great help in that regard

2) try to use snippets as much as possible to reduce boilerplate code and because you can simply tab through the various fields. For me it has been great help that with sublime it is possible [2] without installing anything to have all of my snippets inside dragonfly grammars or even generate them dynamically [10] providing for much-needed structural control over what you write. I know this is more primitive (at least for the time being, there are ideas to improve it) than what you are asking for but for me it has been enough to make C++ enjoyable again! unfortunately my pull request to integrate this into Caster [3] has fallen behind but all of the basic functionality along with various additional utilities is there if you want to give it a try. Just be aware of these little bugger [4] that applies here as well!

3) not directly related to code generation but if you find yourself spending a lot of time and vocal effort for navigation consider either adding eye tracking to the mix or utilizing one of the at least three project that provide syntactical navigation capabilities. As author and more importantly as a user of PythonVoiceCodingPlugin [5], I have seen quite a bit of difference since I got it up to speed, because a) even though it is command driven ,command sound natural and smooth b) though they can get longer ,in practice utterances are usually 3 to 5(maybe 6) words , which makes them long enough so that you do not to speak abruptly but short enough that you do not have to hurry to speak them before you run out of breath c) and yeah I personally need less commands compared to using only keyboard shortcuts so less load for your voice! The other two project in this area I am aware of are Serenade [6] and VoiceCodeIdea [7] so see if something fits your use case!

4) use noise input where you can to reduce voice strain. Talon [8][9] is by far the way to go in this field but you might be able to get inferior but decent results with other engines as well. For instance, DNS 15 Home can recognize some 30+ letter like "sounds" like "fffp,pppf, tttf,shhh,ssss/'s,shhp,pppt,xxxx,tttp,kkkp" , you just have to make sure that you use 4 or more letters in your grammar (so for instance ffp will not work). recognition accuracy is going to degrade if you overloaded too much but it is still good enough to simplify a lot of common tasks.

5) give it a try with a different engine, I was not really that much satisfiedwith WSR either

6) see if any of the advise from [11] helps and seek out professional help!

I realize that my post diverges from what you originally asked for but I feel the points raised here might help you lessen the impact of voice strain for the time being until more robust solutions like the gpt3 mentioned in one of the comments above are up and running. My apologies if this is completely off topic!

[1] https://www.tabnine.com/ [2] https://github.com/mpourmpoulis/CasterSublimeSnippetInterfac... [3] https://github.com/dictation-toolbox/Caster [4] https://github.com/mpourmpoulis/PythonVoiceCodingPlugin/issu... [5] https://packagecontrol.io/packages/PythonVoiceCodingPlugin [6] https://serenade.ai/ [7] https://plugins.jetbrains.com/plugin/10504-voice-code-idea [8] https://talonvoice.com/ [9] https://noise.talonvoice.com/ [10] https://github.com/mpourmpoulis/CasterSublimeSnippetInterfac... [11] https://dictation-toolbox.github.io/dictation-toolbox.org/vo...


👤 redis_mlc
I don't know about your specific RSI case, but moving from Java or C to a scripting language like Perl or Python can be helpful, since there's up to 10x less LOC.

Also, talk to an ergonomics person about it, and it sounds like notebooks are out at this point unless you have an external keyboard, mouse and monitor.


👤 Vinceo
Are you sure you have RSI and not TMS (tension myositis syndrome)? It's a condition that causes real physical symptoms (of which wrist pain is a common one) that are not due to pathological or structural abnormalities. Rather, the symptoms are caused by stress and repressed emotions.

Check out this success forum of people who have healed from all kinds of chronic pain symptoms by dealing with stress and changing their mindset:

https://www.tmswiki.org/forum/forums/success-stories-subforu...