HACKER Q&A
📣 throwawaystress

Too much code cleaning, not enough results (data science)


How important is it for data scientists to have clean, modular, reusable code? Here’s my problem: while working on a project, I’ll start off in Jupyter notebooks, toying around with the data, doing some EDA, etc. Eventually I’ll pull out some of that code into functions in a Python file, and call those functions from the Jupyter. Neat.

The problem is, as I get more and more functions, I want to organize them more, make them more generalizable and consistent, etc. I’ll also get carried away with organizing files and source control, cleaning up my notes, and making documentation to explain what models/data/source files/results exist, what they mean, etc.

And then I realize I’ve been spending less and less time getting results, and more on this “overhead”. I struggle to balance the desire the rush ahead and get results with the compulsion to make the code “beautiful” and to have the project in the cleanest possible state. I’ve seen plenty of other projects with terrible organization, no documentation, and confusing, poorly formatted code. But if I’m not producing value, my neatness doesn’t matter.

All in all, I’m feeling pretty unproductive because of these habits. Any advice?


  👤 lordkrandel Accepted Answer ✓
It depends, so I'm asking you some questions to give you ideas.

How much of this code is going to be read, reused, modified, studied by you or other people?

Is it opensource or foundational?

Is it of any interest for the general public?

Could you actually spend more time in doing something else which is more productive?

Is this refactor make you learn a new tecnique?

Can you find or develop an auto-formatter that makes messy code just neat and clean?

If you are building models for a process or phenomenon, can the results be the subject of an article, maybe in the future, to show your tecniques and ask for feedback? Notebooks are just great for that.


👤 itqwertz
A good rule to follow is to get it done dirty, add some tests, then refactor. Real-world code is not always pretty or academic quality.

Automation is also a good way to get rid of monotonous tasks and boilerplate.