Tech Blog
mops: Data Systems from Pure Python Functions

Trilliant Health is in the business of extracting insights from data. A large amount of our value to our customers is in the depth and breadth of the analytical processes that we encode into the many, many data pipelines that we produce and maintain. All data start out more-or-less raw, but our ML Engineers and subject matter experts keep finding new ways to derive useful knowledge from that data.
The bigger and more complex our data pipelines get, the harder it can be to reason about and operate the system as a whole. Software tools, product lines and even entire businesses have been dedicated to solving different aspects of this problem for the better part of the current millennium.
Today, we’re releasing for public use mops
– a Python library developed at Trilliant Health dedicated to a quartet of goals:
- Data transformations must be able to run in disparate computing environments, to satisfy wildly differing resource requirements and their attendant computing models.
- Developers should be able to write pure Python both for our data insights and our control flow.
- Fault tolerance and reproducibility are the sine qua non of any collection of complex data transformations.
- Nothing lasts forever – a good tool is one you can pick up while it’s useful and drop when circumstances change.
These four goals drove our technical design choices. As we've built and used this library, we've found that one of its technical features – the way we cache computational results for fault tolerance and reproducibility – has grown into something much more: it's reshaping how our teams share results, minimize coordination overhead and seamlessly discover and connect their data dependencies. We plan to write more in the future about the downstream impact and organizational benefits we've found from using this library, but this post focuses on the technical foundations of our approach.
So, let’s unpack each of these four technical goals a bit!
Heterogeneous computing environments
At Trilliant Health, most of our core data insights are implemented using Python in one way or another. We have a team of data scientists and machine learning engineers whose lingua franca is Python – it’s easiest for us to maintain a large codebase if we’re all speaking the same language.
But CPython alone isn’t a powerful enough runtime to satisfy all our use cases; some transforms are so data-intensive that a system designed from the ground up for scale, such as PySpark, must be used in order to allow us to express those transforms at a very high level. In other cases, where our transforms are highly complex pure Python or we’re training with machine learning models that require very specific computing environments with GPUs, a closer-to-the-machine abstraction like containers running on Kubernetes may be a better fit. We needed a tool that would help us target and integrate these different computing environments without a lot of boilerplate.
Pure Python for everything
Coordinating complex data dependency graphs across heterogeneous environments can be difficult, especially when to do so requires people with a non-software background to learn completely new frameworks or approaches, like Airflow, Luigi, Step Functions, etc. The cognitive overhead of switching back and forth between different ways of expressing our intent can be pretty high. What if you could orchestrate large pipelines of many different Python functions without ever writing anything that isn’t garden-variety Python?
In fact, many of our internal transforms are themselves just Python, with completely standard unit and integration tests, standard debugging approaches, etc. It’s extremely convenient to be able to develop and run the pipeline logic on our local laptops before "upgrading" to the more capable Kubernetes environment. Developers who later maintain the code benefit from the ability to "downgrade" the computing environment to their laptops without changing a single line of code.
From fault tolerance to memoization
Fault tolerance has been a buzzword in computing for ages, but all forms of fault tolerance boil down to the same basic idea: checkpointing. If you can track small chunks of work and whether or not they have completed, then you can avoid re-running things that have already succeeded if your system fails partway through. In today’s cloud-based computing systems, errors are not just a fact of life – they’re quite common! Kubernetes might take your spot instance away from you at any moment. Network errors pop up at the worst times, and your orchestrating process may not know how to pick up where it left off.
Transferring execution between environments naturally requires some form of message passing. The core abstraction in mops
is “just write Python functions.” And writing pure functions turns out to give us the necessary message-passing, plus several other benefits, “for free.” A “pure function” is something that produces a result with reference to explicit arguments only – no implicit or hidden state. Once you’ve taken this step architecturally, mops
can remember, or memoize, the result of a previously-computed function. Because mops
captured your unique-for-all-time inputs in order to pass them to the computation environment, we can efficiently look up whether that result has previously been computed and return it if it’s available. Every single function becomes a checkpoint.
Once you’ve adopted this approach, you start reaping organizational benefits, including:
- Since the results are mathematically paired with the inputs that determined them,
mops
provides data provenance for free. - Memoization also allows multiple users of the same system to share their computations with each other without explicit coordination. Collaborators on a feature branch can run their separate parts of a very expensive computation graph, and they each automatically get each other’s results.
mops
recognizes and waits for functions under active computation, even across multiple distributed callers.
We plan to explore these in more depth and paint the full picture of how mops
has changed our organizational workflows.
Droppability
One of our touchstones at Trilliant Health is that we’re constantly re-evaluating our processes, our products and our tools. Change for the sake of change, or prematurely optimizing for a change that will never come, is a known pitfall in the wide world of software development, and yet we’ve all seen the systems that had to be rebuilt from the ground up because they couldn’t be evolved to meet the demands of new opportunities. mops
has been designed from the beginning to have low integration cost and be extremely “unsticky.” The core abstraction – pure functions – is not going away any time soon, and the value of expressing our data transforms as pure functions includes portability, and also dovetails nicely with our ability to tell a cohesive start-to-finish story about how we produced our results. But mops
is a means to an end, and someday it, like any other tool, will need to be “dropped” – and we believe that designing our system around the simple abstraction of functions means that we get the best of our current tools and the abstractions of the future without extra effort.
You can (and should) develop and debug your code without using mops
– and when it comes time to transfer execution to Kubernetes or elsewhere, you slot mops
in and shift the execution without modifying the code. You can (and should) integrate mops
with tooling that you already have for visualizing data flows – or build your own tooling that operates without knowledge of the serialization or distribution layer. It’s more than likely that the mops
code will not outlive the domain logic that it helps wrangle – in the end, mops
is as much a philosophy as it is a library.mops
is not the first system to solve for some of our stated goals (and it certainly won’t be the last), and some of those goals will be less applicable to differently shaped and sized teams. For instance, non-Python-using teams at Trilliant Health have recently used Restate, which provides a useful UI layer but is not Python-centric and therefore cannot be as lightweight or low-boilerplate as mops
.
Why open source?
We’re open-sourcing mops
not solely out of the belief that it is a useful tool that may benefit the Python data community but also because we want to encourage the ecosystem to double down on the shared value of building tools that are droppable – not tools that enter your codebase promising the sun and the moon, but wind up lodged there forever only because of the impossibly high cost of moving on to something better.
And, in the meantime, there’s a lot of opportunity for mops
to evolve and improve. Whether that’s community plugins for alternate blob stores, like S3, Redis, Postgres, etc., or alternate remote runtime shims, or improvements to our developer ergonomics and API – your feedback and contributions are welcome.
Get started with mops
You can install mops
today with pip install thds.mops
. It can be used entirely locally – there are local thread and subprocess.run
runtime shims that will allow you to take advantage of memoization and fault tolerance on shared memory machines.
from thds.mops import pure
@pure.magic() # uses `~/.mops` on your local machine for memoization by default
def fibonacci(n: int) -> int:
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
assert 832040 == fibonacci(30)
But to get the most use out of it, you’ll want to provide your own cloud infrastructure. mops
is flexible. Anywhere Python already runs, mops
will allow you to plug in your own runtime shim – it will handle the serialization, the argument transfer and deserialization and calling your function in the Python environment. Internally, we maintain a Databricks-focused runtime shim and code deployment strategy. Our Kubernetes API runtime shim is baked in if installed with the k8s
extra as thds.mops[k8s]
. If you have a Kubernetes cluster and Docker images for your code, you’ll be able to use mops
to run thousands of Python functions in parallel with just a few lines of configuration.
Its "strongest" requirement is pickle
– in order to transfer your function arguments and result, mops
needs to be able to serialize them, and Python offers one obvious way to do this. It would be possible and even interesting to build around non-pickle
forms of serialization, but so far across several years of use, we’ve found no internal use cases that compelled us to investigate anything more complex. Network overhead tends to dominate serialization cost for small objects, and large objects are usually best managed as out-of-memory readable and writeable streams using existing on-disk formats like parquet
or sqlite
.
We are releasing mops
with bundled support for data transfer and global coordination via Azure Data Lake Storage, which is the blob store we use internally at Trilliant Health. If you don’t use Azure, you’ll want to plug in your own implementation of the mops.pure.types.BlobStore
interface that can get
and put
blobs on your chosen store. In fact, mops
has a built-in concurrent computation detection and lease algorithm that is intentionally low-tech, so that just about any blob store imaginable can support its semantics.
We’re open to contributions, and mops
is intentionally plugin-driven, so you’ll be able to implement and integrate your own runtime shims and blob stores without making any changes to mops
itself.
Give it a try!
mops
represents our belief that tools can empower teams without constraining their future choices. By standing on the shoulders of pure functions, we've created a library that provides immediate technical benefits like fault tolerance and reproducibility. Whether you're running complex machine learning pipelines across multiple computing environments, or simply want to avoid re-running the code you just ran yesterday, mops
can help you get there with low overhead and high portability. But as we've hinted, there's a broader story to tell about how these technical choices reshape the way teams collaborate. We invite you to give mops
a spin – the documentation includes examples and technical deep dives – and stay tuned for our upcoming post about how memoization has evolved from a technical feature into an organizational architecture.
- Data Science
- Python
- Machine Learning
- Open Source