This page contains some advice on how to choose parameters for your tuning.
Note that there exist some parameters which change the employed tuning algorithm. These are not discussed here. Hints about selecting the best tuning algorithm for your application can be found on the page presenting the different algorithms.
Introductory Remarks: Overall Tuning Runtime Estimation
To estimate a rough upper bound for the overall tuning runtime, please download the prepared Excel file and follow the instructions in this section.
In general, the overall tuning runtime mainly depends on three factors: the number of performed target algorithm evaluations, their runtime and the tuner's
Degree of Parallelism.
While you can limit the target algorithm's runtime by providing a cpu timeout, the number of performed target algorithm evaluations depends on multiple parameters: A rough upper bound is given by
numGens * (
popSize / 2) *
Average Number of Instances per Generation. Here, the
Average Number of Instances per Generation can be determined by [(
endInstanceNumber) / 2 *
endInstanceNumber * (
Moreover, the OPTANO Algorithm Tuner will start
maxParallelEvaluations per computing node at the same time and can even make use of multiple computing nodes, if executed in distributed fashion. Hence, the tuner's
Degree of Parallelism is given by
Number of Nodes. Here, the
Number of Nodes can be determined by
Number of Worker Nodes + 1, if local evaluations on the master node are allowed (see
--allowLocalEvaluations for details).
Finally, a rough upper bound for the overall tuning runtime is given by:
popSize/ 2) *
Average Number of Instances per Generation*
Degree of Parallelism
In practice, the actual tuning runtime is often less than half of this estimated tuning runtime for several reasons:
1) Already evaluated genome instance pairs will not be evaluated again in subsequent generations.
1) Enabling racing can significantly reduce the number of performed target algorithm evaluations.
1) Good configurations will likely finish (at least some of the) instances before reaching the
--maxParallelEvaluations specifies how many instances of your target algorithm may be executed in parallel. To set it, you should be aware of
- the maximum memory usage of your target algorithm
- the number of cores your target algorithm uses
- the available memory on each of your computing nodes
- the number of cores of each of your computing nodes
The parameter should be chosen such that sufficent cores and memory are available for the selected number of parallel runs.
The number of generations usually depends on how long you want to spend on tuning. We do not recommend to lower it too drastically from its default value 100. In addition to
--numGens, you can also set
--evaluationLimit to directly bound the number of target algorithm runs.
The default population size is 128, and for best results, your choice should be around the same size.
The default mini tournament size is 8, and for best results, your choice should be around the same size.
The maximum number of instances used for tuning should be sufficiently high, else the found parameter combinations will be 'specialized' to the instances you chose.
These parameters apply to the model-based crossover operator. They become relevant (i.e. the ML model is trained during iterations) when
--engineeredProportion > 0, or
This is not an extensive list. The parameters below are those that require the most thought. For a complete list of parameters, see here.
If this parameter is set to a value greater than
0, some of the new offspring will be "genetically engineered", instead of being produced by the default crossover operator. Given two parents, a targeted sampling is performed, during which thousands of potential offspring is generated and scored by using the predictions of a machine learning model. From all potential offspring, the genome with the best predicted score (combined with some penalization[*] for being "too similar" to the current population) is used in the next iteration.
*): The parameter
--maxRanksCompensatedByDistance you can specify some "bonus score" that is given to boost the score for "very unique" genomes, i.e. genomes that don't share many features with the current population. See the documentation of that parameter for further information.
By default, the non-competitive mates for the crossover (classical and model-based) are selected uniformly at random. By setting
--enableSexualSelection=true, the random selection is performed with a so-called roulette-wheel selection, using the predicted performance of the non-competitive mates as weight. The predicted performance is obtained by querying the machine learning model that is otherwise used for the model-based crossover operator.
Note: You can even enable sexual selection if you perform a run that uses
--engineeredProportion=0. The model will be trained/updated in the same way that it would be during the model-based crossover. If you do not want to use sexual selection (and model-based crossover) in the beginning of your tuning, but still want to leave you the option to do so on a
--continue, you can set
--trainModel=true when you start the tuning. If this parameter is set to
true, the machine learning model will always be trained/updated, so that you can
--continue with a model that has the "correct" state.
maxRanksCompensatedByDistance and featureSubsetRatioForDistance
During the model-based crossover, many offspring candidates of two parents are scored using a machine learning model that predicts the expected performance. Since this method is very greedy, you can specify a bonus for parameterizations that are "very different", when compared to the current population:
For each potential offspring, its distance to each of the members of the current population is computed, using the specified
--distanceMetric. The aggregated distances (e.g. average, 3-NN, ...) are normalized over all potential offspring and then multiplied by
--maxRanksCompensatedByDistance. The resulting value in the range of
[0, maxRanksCompensatedByDistance] is subtracted from the predicted score for each potential offspring. This means genomes with a mediocre predicted score may still be favored over genomes with better predicted score that are "very similar" to the current population.
Note: The distance computations may have a huge impact on the duration of the crossover-phase. You can disable the computations by setting
--maxRanksCompensatedByDistance=0 (not recommended).
Alternatively, you can reduce the number of features that are are considered during the distance computation via the parameter
--featureSubsetRatioForDistance. Fewer number of considered features results in a shorter computation time (recommended).
You can select a distance metric from the
- Average of the sum of absolute differences between each of the features.
- Also known as Minkowski distance, or Taxicab geometry.
- Counts the number of different features between the potential offspring and the current population.
- Uses the sum of the
k=3smallest hamming distances (i.e. to the "closest" genomes) as reported distance for a potential offspring.
- Two features are considered to be different, if their relative difference exceeds
--hammingDistanceRelativeThreshold(i.e. a "small" number).
This value controls the ratio of the genome's double-representation will be "shown" to the trees that are used in the model-based crossover operator. The length of the double-representation depends on your parameter tree and the chosen categorical encoding. The length of a categorical domain's double-representation varies between different encodings. E.g., for the
OrdinalEncoding, all categorical feature lengths will be 1. For the
OneHotEncoding, the length will be equal to the number of categorical values in a respective category (i.e. 1 column per possible domain value).
A rule of thumb (according to literature) is to train trees using feature counts of about #features/3 or sqrt(#features). Since you need to specify the ratio, those numbers need to be divided by the number of features, yielding a
--forestFeaturesPerSplitRatio of 1/3 or 1/sqrt(#features). We decided to use ratios because this makes it easier to provide "meaningful" (default-)values without actually knowing the exact number of features.