Data farming - scientific methodology for data-oriented science

Data farming is a scientific research methodology, which utilizes High Performance and Throughput Computing to generate large amount of data with computer simulations. Data farming aims at enhancing understanding of studied process or phenomena by examining the landscape of potential outcomes of simulations, finding surprises and outliers. Such an approach is suitable, when the relationship between input and output of your application cannot be established in an analytical way, and running the application many time, each with slightly different input parameter values is the key aspect of the study. Each application execution constitute a single data point in a multi-dimensional space of possible output. However, as a single data point does not provide any useful information, one has to generate many more of them. Only then, a meaningful insight can be obtained by analyzing relationships between the generated data points with respect to input parameter values. The more complex the studied phenomena is the more simulations need to be run to provide meaningful insights, thus data farming experiments often involve thousands of data points and more.

A data farming experiment is a straightforward iterative process as depicted in Fig. 1.

Fig. 1: The Data Farming process.
  • 1. We start by stating questions about some kind of phenomena that we would like to answer.
  • 2. We prepare a simulation model, which will help in answering the stated question.
  • 3. Then we desing an experiment by specifing input parameter space that we want to explore in order to answer the questions.
  • 4. We run the simulation model for each data point in out input parameter space and we collect results.
  • 5. We analyze the collected results and we decide if we can answer our the stated questions, if so then our work is done.
  • 6. Sometimes however we will need to extend our experiment with additional data points, create a new experiment from scratch or use another simulation model.

In the context of data farming experiments we are focusing mainly on three main phases: input parameter space spacification, application execution and data analysis.

Input parameter space specification

This phase is crucial for the experiment efficiency, as the number of all possible input value vectors can be very large. In a sample case, when the simulation has 10 input parameters and each parameter has only 10 possible values, we do not have the luxury of running all the combinations in our lifetime. Thus, we often use Design of Experiment (DoE) methods to reduce the number of elements in the set of input parameter values while preserving these vectors of input parameter values, which can lead to meaningful results. The most commonly used DoE methods include:

  • ● Near Orthogonal Latin Hypercubes
  • ● 2k methods
  • ● and different factorial variations.

Application execution

When we have a specified input parameter space we know how many times our application should be executed and with what input parameter values. As the number of application executions can be large, e.g. millions of combinations, this step often exploits both High Performance and High Throughput Computing. In particular we are thinking here about computer clusters, compute clouds, grid environments and even physical servers.

Application results analysis

After completing all or some of the application execution, we can analyse collected results. To analyze the obtained results, we often utilize statistical methods, e.g. histogram analysis or regression trees, which are generic enough to be used in various domains. By using the information gathered throughout the experiment, one can answer the questions stated in the initial phase. However, in many cases, a single iteration of such a loop does not give enough insight to answer all the stated questions, e.g. due to badly sampled input data space. Hence, another iteration of the process loop can be required but with different input data space, e.g. which focuses only on a particular, interesting subspace. Besides answering the stated questions, another important aspect of data farming is discovering surprises and outliers. It is especially important in order to enhance the simulation model, which in many cases provides only an approximation of the studied phenomena, hence it can be unpredictable, e.g. when applying boundary conditions.

A little bit of data farming history

First applications of Data Farming concerned verification and enhancement of existing procedures and analytic culture in the Department of Defense. By performing computer simulations in the Data Farming process, it was possible to gain new understanding of this phenomena in an iterative manner and improving the simulation model. A sample example of Data Farming applications developed within the project was considering the question of maneuver vs. attrition in combat scenarios. The simulation model involved Red forces as defenders, and Blue forces as attackers. Measure of Effectiveness for this simulations included the number of elimited Blue entities and the information whether or not the Blue forces where stopped from penetrating the area defended by Red forces. Input parameters included the fire range of Red forces, their probability of hit, and the attack strategy of the Blue forces, i.e. heading straight for the objective or engaging in a more maneuver type of behaviour. By running multiple simulations, it was possible to collect knowledge that it was better to maneuver only when Red forces have high range and probability to hit, while it was better to go straight for the objective when the fire range and hit probability was small. Moreover, based on this information, analysts decided to enhance the simulation model with aggression and sensor range parameters for the Red forces, and run the next set of simulations. Based on the new results, they could deduct that Red forces should be more aggressive in order to increase their effectiveness.

More recent application of the Data Farming methodology can be found in the EUSAS project, which stands for "European Urban Simulation for Asymmetric Scenarios". It was financed by 20 nations under the Joint Investment Program Force Protection of the European Defense Agency (EDA). The main purpose of the project was to support a training process of security forces, such as military, policemen, etc. in asymmetric scenarios, i.e. where the number of participants on each side is different. The project aimed at improving tactics, techniques and procedures of security forces by running and analyzing multi-agent simulations, which utilize advanced psychological models of human behaviour. The project was oriented on scenarios involving security and military forces, which must deal with crowd control in urban environments. A sample scenario regards controlling the access to a military camp during elections in a mission abroad. This type of scenario can have several variations, e.g. including different number of civilians who are waiting in front of a camp entrance. The military forces were tasked to ensure open entrances by calming and dispersing the crowd. Input parameters of the scenario included localization of civilian groups, prestige of civilians leaders, initial anger level of civilians, their readiness for aggression and the rules of engagement for security forces, e.g. how to react to civilian aggression. Measures of Effectiveness of this simulation included the number of injuries and other damages that occurred during the simulation and other emotional-related parameters. Generally, the stress level on both sides is high, and a higher fear level can be observed in the local population. In such a case, direct contact with security forces can lead to unpredicted escalation of aggression, which can lead to injuries and casualties, if the security forces will act inadequate to the current situation. To study different cases, thousands of simulations were run. Results show that it is essential for security forces to asses the aggression level of civilians and react very rapidly to any escalation. In particular, when civilians start to show extremely aggressive behaviors it is more effective to fire a warning shot than try to negotiate to decrease the aggression level with fear.