Incremental programming is the process of writing code one step at a time and checking the output after each one. It’s common practice for data exploration workflows, and it goes a little something like this:
You’re using pandas to do some Python data analysis. The first thing you do with your new data set is print df.head() to get a sense of what the data looks like. You notice there are a few NaN values in the first five rows, so you print df.info() to see how prevalent they are throughout the data set. Depending on their distribution, you’ll either leave them be or figure out a way to remove them.
Incremental programming is one of those things where using the right tool can make you 10 times faster. Trying to explore and clean a dataset by writing a script and executing the entire thing at once (aka the main.py style of programming) is equivalent to proofreading this blog with only a paper dictionary.
REPL programming, notebooks, and spreadsheets each have massive improvements over the main.py approach, but what if we combined all three?
REPL (Read-Eval-Print Loop) is a programming paradigm that reads the user’s program, evaluates it, and prints the result back to the user. In fact, if you’re first experience programming was printing “hello world” to the screen in a Python shell, then you’re first experience programming used the REPL paradigm.
The benefits of REPL programming are quite obvious. For iterative tasks like data exploration, it’s more natural and efficient to write and execute your program in a piece-wise fashion rather than executing the entire script top to bottom after each iteration. REPL programming lets you apply a single transformation to your data, look at the output, and then apply a new transformation to that already altered state.
Executing a single transformation and immediately seeing the result lets you focus on the specific piece of your program that you’re interested in. Like all feedback, feedback when you’re programming is best received immediately and in context. In contrast, in the traditional main.py style of programming, you have to wait dozens of seconds for your program to finish executing before you get any feedback on your code. A workflow that can only get choppier if it requires opening Excel or Notepad at the end of the execution to see your results. Trust me, I’ve been there — it’s not how you want to set up your data exploration process.
Using a REPL programming environment stops you from context switching, and at least gives you a chance of remaining in flow.
Notebooks, like Jupyter Notebooks, Google Collab, and Deepnote, let you execute individual code cells at a time. And as a result, notebooks still provide the immediate feedback of a command line REPL, but also view and control over the current state of your entire program… a big improvement over the command line where it’s difficult to even view the previous commands you’ve executed, let alone to reorder their execution.
In a notebook, you’re afforded the standard main.py style control over the grouping and order of execution of your program, but with the immediate feedback of REPL programming. Notebooks turn the REPL workflow from a command line-based one to a pretty, ergonomic UI-based workflow that gives you some of the benefits of traditional IDE programming.
In traditional notebooks, REPL programming is still a choppy process because their code-first display obscures the underlying data, which is ultimately the driver of your transformations. Trying to understand the current state of your data requires switching your attention back and forth between your code cell and output cell in order to update your mental model of the data. If the REPL loop was a hamster wheel, after the wheel completes each full loop, notebooks and command lines make the hamster get out of the wheel, run a long a flat surface and start over on a different wheel starting from rest. Notebooks and command lines don’t let you build up momentum.
Spreadsheets have two fundamental advantages to notebook based REPL programming.
- Because spreadsheets let you edit your data in the same place that you visualize it, you’re able to literally stare at the same place on the screen throughout your entire analysis, without any context switching or browser scrolling.
- Spreadsheets make sure that you’re always aware of the current state of your data. You no longer need to keep track of how your transformations altered the state of your data — the spreadsheet does it for you.
Over the past year of talking with hundreds of Excel power users, I’ve learned that what they truly love about spreadsheets is their reactive, visual environment (read: IDE), that makes moving from transformation to transformation so fast. They never have to get off the hamster wheel.
Where spreadsheets fall short is that they obscure the current state of the program. If you’ve ever had the unfortunate luck of trying to untangle an Excel workbook, you know exactly what I’m talking about. There’s only so many clicks of the trace dependents button, or “Ctrl Alt ]” if you’re a pro, that one can keep track of.
The ultimate iterative workflow: Spreadsheet enabled Notebooks
By integrating spreadsheets into the notebook environment, we get the best of all three worlds:
- REPL programming’s immediate feedback
- Notebook’s view and control over the full state of the program
- Spreadsheet’s view and direct editing of the current state of the data
Mito is designed specifically to combine these three paradigms.
Mito is a spreadsheet extension to the JupyterLab notebook that lets you manipulate your data directly within the spreadsheet. Mito takes care of generating the equivalent Python code for each edit you make. It provides the immediate feedback and data-first-ness of a spreadsheet, with the traceability and programmatic control of a notebook, for the ultimate iterative workflow. Give it a try to see what I mean. We’d love to hear your thoughts.