df = pandas.read_sql("SELECT * FROM USERS", connection)Which of the following is not reproducible?
Read from a database
Draw a sample at random
coin_tosses <- sample(c("heads","tails"), 10, replace = TRUE)Write to the filesystem
readr::write_csv(data, "~/input.csv")Which of these statements are not reproducible?
None of them are reproducible!
None of them are. This was a trick question!
Reproducibility requires that doing the same thing gives you the same result.
None of those statements - even writing to disk - are reproducible.
Let’s explore why and what you can do about it.
This talk hopes to share some lessons from a data-engineering perspective with data-scientists and analysts.
To begin let’s think about what it means to be reproducible.
Pure functions are reproducible
Pure functions are reproducible.
The output of a pure function only depends on it’s input. The result doesn’t change if it’s calculated a second time or by another person - for the same input you always get the same output. You could replace the function call with it’s return value in your program. This is known as referential transparency. Indeed you could replace the function body with a lookup table that records the relevant output for each input.
A pure function has no side effects. Running it doesn’t change the state of the world, the only consequence is the output value it returns.
Side-effects aren’t reproducible
This is all very well in theory but we can’t continue piping data around in circles ad nauseum. At some point we have to interact with the outside world - read from a database, deploy a website, email a report. In practice we need functions with side-effects.
Side-effects are intended consequences that happen outside of a function or a pipeline’s output. We also use the term “side-effect” to refer to causes that exist outside of a function or a pipeline’s inputs.
These side-effects are what make our pipelines useful allowing them to interact with the world. They’re also what cause our pipelines to become non-reproducible.
This mightn’t be very obvious if it’s the first time you’re hearing this, so let’s look at some examples.
Non-local state makes functions sensitive to context
counter = 0
def show_count():
  print(f'Count is {counter}')
show_count()Count is 0Sometime later…
counter += 1
show_count()Count is 1Here we can see a side-effecting show_count() function. It implicitly depends on the counter variable that is defined in the global scope. This means that it’s result depends on the context in which it is run.
Any time you call show_count() you could get a different result. It isn’t reproducible. If the counter changes, then so does the result.
This function also has no explicit return value. The call to print is itself a side-effect. We can’t use it in a reproducible pipeline.
Let’s refactor this into a pure function.
Explicit inputs/ outputs let us separate code and context
counter = 0
def describe_count(count):
  return f'Count is {count}'
print(describe_count(counter))Count is 0Sometime later…
counter += 1
print(describe_count(counter))Count is 1We define describe_count in terms of the count value. Now the dependency on this variable is explicit. Given the same input, this function will always return the same output. It’s reproducible. Indeed it now returns an explicit output value reproducibly. The call to print happens outside.
The result is essentially the same as before but because it’s now reproducible it’s much easier to reason about. We’ve made the dependency on the context explicit. The focus is on the value of the count, not the variable and it’s place in memory.
This might seem like a toy example but context mutations like this are exactly what’s happening when you reassign a variable or use one of Pandas’ inPlace operations. This problem looks trivial but state mutation is a pernicious source of subtle bugs in notebooks where the execution order and so kernel state may not have evolved linearly.
Mutating state in place may yield savings in computer memory, but it imposes costs on the human capacity to reason about the flow of data through your code.
Pure functions operating on immutable data are reproducible. You lose these guarantees once your pipeline has side-effects and state mutations.
The context isn’t always apparent
toss_coins <- function(n) sample(c("heads","tails"), n, replace = TRUE)
toss_coins(5)[1] "tails" "heads" "heads" "tails" "heads"Sometime later…
toss_coins(5)[1] "heads" "heads" "heads" "tails" "heads"Some side effects are not obvious.
The sample function depends not just on the arguments you pass it but also the state of a random number generator (RNG).
In software the RNG is not truly random but rather a pseudo-random process. It gives highly erratic results but follows a predictable process if you know the starting state. The starting state is seeded by some source that varies such as the date or a hardware source like /dev/random which collects noise from device drivers.
This is a side-effect for our function, causing it’s output to differ each time it’s run.
Each time I generate these slides I get a different set of coin tosses. Although executable, this slide (and a consequence the whole deck) isn’t reproducible.
We can make the context explicit
set.seed(1234) # set state deterministically
toss_coins(5)[1] "tails" "tails" "tails" "tails" "heads"Sometime later…
set.seed(1234) # reset the state again
toss_coins(5)[1] "tails" "tails" "tails" "tails" "heads"We can make this function call reproducible by fixing the initial state of the random number generator to a constant value with set.seed.
This level of purity may be helpful from an engineering perspective (for example this slide has a consistent checksum meaning it can be cached), but it would completely undermine certain analytical procedures. For example in cross-validation where we want to see the test/ train split vary to ensure we’re not over-fitting a statistical model.
We might not always want perfect reproducibility. The key is in making that choice consciously.
I/O is side-effecting
Input is a side-effect
df <- readr::read_csv("/home/robin/data-040422-final.csv")Output is a side-effect
readr::write_csv(result, "~/results/output.csv")A more ubiquitous source of side-effect is I/O (input/ output). This doesn’t just apply to third-party APIs or our own database, even the filesystem is non-local state as far as our programs are concerned.
It’s not uncommon to see pipelines or notebooks start like this.
The filepath is idiosyncratic and this code won’t be reproducible on other people’s machines unless they coincidentally have a user called robin with this file in the home directory. Indeed this probably won’t be reproducible on Robin’s machine at a future date unless care is taken to fix that file in place.
More generally, even if the filepath is dependable, there’s no guarantee that the content of the CSV file itself won’t change.
Even the writing of files is not reproducible.
This function alone can’t guarantee that the destination directory ~/results exists.
These problems become all the more apparent with APIs or databases when network interruptions and other external concerns jeopardise the reproducibility of your program.
Dependency-injection is explicit
import pandas as pd
my_conn = snowflake.connector.connect(...)Instead of relying on global state:
def get_data():
  return pd.read_sql("SELECT * FROM USERS", my_conn)
  
get_data()We can make dependencies explicit:
def get_data(connection):
  return pd.read_sql("SELECT * FROM USERS", connection)
get_data(my_conn)The problem isn’t that these side-effects exist; they’re necessary.
The problem is that when they’re implicit they tend to hide dependencies.
We can make them explicit with techniques like dependency injection. The function receives a database connection, for example, as an input instead of relying on it existing in the programs global state.
Now the function declares it’s requirements. It draws attention to potential side-effects and forces you to think about them.
Execution context is explicit
Command-line arguments
python pipeline.py input.csv output.parquetEnvironment variables
env DB="http://user:pw@localhost:1337" python pipeline.pyConfiguration data
python pipeline.py configuration.yamlLikewise we can pull side-effects through to the very edges of our pipelines, passing configuration in only at the execution context.
This means the pipeline no longer has to be concerned with coordinating state. We may want some graceful error handling, but the pipeline itself should be context-free and reproducible.
Indeed extending this practice into the execution context itself is what leads to the Infrastructure as Code and DevOps movement where machines and services can themselves be provisioned reproducibly from declarative configuration.
Functional pipeline,
configured context
Pursuing a separation of these concerns - what Gary Bernhart has called functional core, imperative shell - leads us to a point where all of the dependencies are captured explicitly and their values gathered together into configuration.
Here we have a pipeline composed of pure-functions with all the necessary side-effects contained to explicitly configured contexts at each end.
This makes it easier to maintain reproducible code. When the code is changed you can see how the pieces fit together and what the consequences of a refactoring are on the rest of the pipeline’s code base. When the infrastructure changes you may be lucky enough to only need to change the configuration and not the code at all. This also helps sub-divide the code into modules as each component explicitly declares its requirements making them easier to test in isolation. The preceding examples makes it trivial to pass in a test database connection.
You’ll note that this hasn’t really fixed the ultimate cause of our problems. As Rick explained, we shouldn’t expect to ever be able to fix the rest of the universe in place. The best we can do is hope to contain the unreliable bits, pushing them to the edges so we can carve out a space to pursue reproducibility.
Lessons from engineering
We can go a step further than the above and strive for reproducibility in inputs and outputs, in our interaction with data outside of the pipeline.
In these final slides I’ll discuss some lessons the data community can learn from software development. In particular version control and continuous delivery.
Versions are values over time
It is impossible to step in the same river twice
Heraclitus c.a. 500 BC, possibly apocryphal
We’ve seen that the functional approach leads us to immutability. Steps in your pipeline can only communicate with one another through their arguments and return values. We aren’t mutating the objects passing through the pipeline or a global state. The intermediate values are immutable.
But we need to change things to do useful work over time as the external context changes, for example as upstream data sources are updated. How do we cope when the upstream source is mutable? We can use versions to control external change.
Versions identify immutable states of data as it evolves over time.
Heraclitus noted that one can’t step in the same river twice. Time marches on. It will be different water and you’re a different person.
We have to distinguish the riverbed and the water running through it. We distinguish between an output and instances of it.
Ultimately the version is identified by the values in the data. This becomes cumbersome and we need a more succinct version identifier. Instead we can identity versions by a name formed of two parts. One is a location or label we can use as a reference. The other is the version or state in time. The named output and versioned instances.
Thus we treat data as artifacts, frozen in time.
The parallel here is software releases, not source version control. You can use e.g. git to version your data but it’s unlikely that line-by-line diff and patches will be very efficient way to store it. You should use version control for the pipeline source code though.
You can deploy files to artifact repositories. Even a simple S3 bucket will let you update a key in place while it records the version history. Some databases support versioning natively or you can create your own snapshots.
This version history let’s us retrieve the exact conditions for a pipeline run and reproduce the results.
Be wary of any upstream source that can mutate over time without providing some means of identifying and distinguishing versions. You can of course defend against this to an extent by keeping a track of the data you receive with hashes or recording copies in caches and things like cassettes to record API transactions.
Be a good data citizen and surface versioning information about your own outputs to downstream consumers so that they might ensure reproducibility in their workflow.
Automation proves reproducibility
If you can run your pipeline locally, even if it’s repeatable, it’s really just an executable analytical pipeline. It’s not demonstrably reproducible until it’s running on emphemeral resources in a build system.
A build system runs your workflow recording the versions of the input data, source code, and output. The run pipeline run itself is immutable and identified for posterity with a build number. Done correctly, we don’t need to reproduce the pipeline. This contract serves as a guarantee that you’ll simply get the same result.
This requires that the process of assembling your dependencies is automated and reproducible itself. No more hunting through Slack or chasing colleagues to find the random excel file that makes the pipeline work. You can’t expect a build server to read the readme and figure it out for themselves. Tacit knowledge must be explicitly codified. You can’t claim that everything’s fine because “it works on my machine”. The build server is a shared consensus on configuration. A canonical source you can refer to to see how things work.
Building on an emphemeral stack further enforces the discipline by preventing you from relying on state. I’d argue that the success of containerisation comes as much from virtualisation as from the fact that Dockerfiles, for example, automate dependency management down to the operating system level.
There are plenty of choices of Continuous Integration or Continous Deployment tools and increasingly data-specific Workflow systems too.
Don’t just say it’s reproducible. Prove it!
How to engineer reproducibility
- Keep the core of your pipeline as pure as possible
- Contain side-effects and make dependencies explicit
- Track (data) version history and automate workflows for reproducible builds
The world won’t let you rap,
and what to do about it
 rick@swirrl.com
robin@infonomics.ltd.uk