“Experiment is the sole source of truth. It alone can teach us something new; it alone can give us certainty.” ― Henri Poincaré

“What we observe is not nature itself, but nature exposed to our method of questioning.” ― Werner Heisenberg,


There is a real tension existing between these two quotes, truth and our perception of truth. Experiments are the key to science including theory where it is tested and ideas are prodded from our heads based on what we observe. At the same time our observations are imperfect and biased. In moving science forward both concepts are key to progress and keeping things in perspective. For the person engaged in computational science the challenge is uniquely fraught with conflict. This includes the new concept of computational experiments and their rightful role in advancing knowledge. Their perspective is undoubtably useful although having an artificial view of reality taking the role of “truth” is largely inappropriate. That said, the “truth” of experimental observations is also an illusion to the extent that observations are flawed as well; however these flaws are of an entirely different sort than simulation’s flaws.

Observations are flawed by our ability to correctly sense reality, or the distortions made through our means of detection, or the outright changes to reality made through our attempts to observe something. Simulations largely do not suffer from these issues in that we can perfectly observe them, but instead the reality we observe through simulation is itself intrinsically flawed. On the one hand we have a flawed view of the truth, and on the other we have a flawed truth with perfect vision. The key is that neither is perfect, and that both are useful.

Science is fundamentally predicated on experiments. Experiments are the engine of discovery and credibility. Not all experiments serve the same intent, nor should they follow the same protocols. There are many different types of experiments and it is useful to develop taxonomy of experiments to keep things organized. Ultimately since we all want to be better scientists, it might just help us do better science.


The classic experiment is the test of a hypothesis and it still holds the center of any discussion of science. Every other kind of experiment is a subset of this kind, but it useful to enrich the discussion with other experiments types. The differing types of experiments are constructed with a particular end in mind, and with that end in mind the choice to emphasize different qualities can be made. A key example is the notion of a specific validation experiment where the goal is to primarily provide data for ascertaining the credibility of computational simulations.

Measurement is the key to experiments. Measurement is by its very nature imprecise, we cannot exactly measure everything. Moreover, we don’t necessarily measure the right things. Often what we choose to measure is guided by theory, and if the theory is too flawed, we may not measure the important things. In other cases we simply cannot measure what is really important. In other words, the core of measurement is error. We need to be very exacting in our analysis of how much error is associated with an experimental measurement. Too often we aren’t very clear about this. For example, some experiments measure a quantity that actually fluctuates. The tendency is to report the mean value measured, and then some statistical measure of variation like the standard deviation. Rarely, if ever, the statistical choices made by the experimental analysis are justified. Does the quantity actually fall into a normal distribution? In spite of the fluctuations what is the experimental measurement error? Is this error biased?

Replicate experiments are another area where far too few examples exist. Experiments are often complex and expensive. In addition they are not repeatable, nor are they repeated. This results in certain uncertainties being completely unknown. Or to take the famous Donald Rumsfeld quip, the repeatability becomes a known unknown that is willfully unexplored. Usually the temptation to do a different experiment is too great to overcome. In this case any statistical evidence simply does not exist even though many of these cases are extremely sensitive to the initial conditions. If one is looking at a system described by a well-posed initial value problem and the initial conditions are impeccably well described, a single experiment might be justified. If all of this does not hold, the single experiment is outright dangerous. For complex systems the situation where the experiment is demonstrably repeatable does not usually present itself. An archetype of the sort of experiment that is not repeatable is the Earth’s climate, and in this case we have no choice.

Discovery experiments are where science most classically lives, or at least it is the ideal. A scientist makes a hypothesis about something, and an experiment is devised to test it. If the experiment and related measurements are good enough, a result is produced. The hypothesis is either confirmed (or no evidence against it), or it is disproven. These experiments are in fact far and few between, but when they can be done (correctly) they are awesome.


Computational experiments are a modern invention, and rightly the source of great controversy. I’d argue strongly they should be even more controversial than they are generally characterized. Generically, a computer code is a model (or a hypothesis) and a problem can be devised based on the model. Calculations can then be done to test the given hypothesis. The problem most succinctly is that the computational experiments are not proofs in the same sense as a physical experiment. Just as physical experiments have measurement error, computational experiments have computational error, but they also have more problems. The model itself may not be correct, or incomplete. The data used by the code may be incorrect or the experiment may be set up in flawed manner. Because of the artificial nature of the computational experiment, the whole enterprise is subject to an extra level of scrutiny. If such scrutiny produces evidence of correctness, the experiment can be taken more seriously, but rarely as seriously as the physical experiment. The benefit of computation is that it is more flexible than nature and most often much cheaper or less dangerous.

Often the statement is made that the computation is a “direct numerical simulation (DNS)” or “first-principles”. Very rarely is this statement actually justified or supported by any evidence. Most often it is false. These labels seem to be an excuse to avoid doing any analysis of the errors associated with the calculation, or worse yet claim they are small and unimportant without the slightest amount of justification. This is proof by authority, and it ultimately harms the conduct of science. If one is claiming to do DNS then the burden of proof should be very high. To be blunt, the use of DNS usually is offered with even less proof than admittedly cruder approximations. This isn’t to day that DNS should not be employed as a scientific tool, but rather its application should be taken with a rather large grain of salt. Scientists should demand more evidence of quality from a proposed DNS, and reject its results if such evidence is not provided. Doing anything less threatens both science in general, and poses an existential risk to computational science.

The concept of validation experiments is a new “invention,” or more properly a refinement on the basic concepts in experimental science. The primary purpose of these experiments is the validation of computer simulations. A simple-minded view would say that any other experiment would serve this purpose. The simple-minded view is correct, but this purpose is served poorly by classic experiments and the standards of reporting results. More importantly, many essential details for a successful simulation of the experiment are left out of the description. The definition of the experiments is more complete in the sense of providing key details for a high fidelity simulation of the precise experimental setup. Usual experimental science often leaves out many details that can cloud the sense of validation received by comparison, or at the very least offer substantial uncertainty as to the source of any discrepancies.

The point of this discussion isn’t to over-complicate things, but rather clarify differing intent for experiments. One simply doesn’t “experiment” for the same reasons, but rather many different reasons. The texture of the distinction can help provide a better environment for focus on why things are done and where the emphasis should be. Exploring a scientific hypothesis in the classical sense is different than validating a computer code. These differing purposes call for a refinement of emphasis in the conduct of the experiment. I will note that validation is a form of hypothesis testing, i.e., “is a computer simulation a representation of reality and to what degree and purpose can it be trusted?”   Computational experiments are another problem altogether, and require even greater attention to detail.