Prerequisites for cube build size and time optimization
- The cube should be built for 1-2 partitions of base partitions. For example: If the partition is on Day then 1-2 Day, Or for week Partition 1-2 weeks and so on.
- You should have at least 10-20 different business queries available to be executed on the cube just build and get the query performance numbers.
Optimizations through cube design
Check the Cube build summary and thereafter analyze the following:
- Explore the possibility to Reduce the Number of Dimensions by:
- Combining dimensions: At the Register File level, whenever possible, combine dimensions with a fewer number of attributes (1-3) into one dimension to reduce the number of dimensions. For example, some dimensions can be merged at the register file level to reduce the number of dimensions at the cube level.
- Multiple hierarchies: Consider using multiple hierarchies if you need two types of time data such as year-month-day and year-quarter-month-day for different purposes. Or if you need two types of location data such as division-region-district-location and state-county-city-location.
- Dimension merging: At the time of cube building, the dimensions and combinations of different dimensions are pre-aggregated/materialized and stored on a disk. If the number of dimensions in a cube is too high, the size of materialization becomes large to accommodate existing architecture. This leads to high build time, increased size on disk, and higher read time while querying.
You can merge two or more dimensions related to the same or different dataset into a single dimension. The subset of facts to which these multiple datasets are related should be the same. This allows saving disk space, reduces build time to materialize cubes, and improves query performance.
- RevisitDistinct Count Measures
- Explore the possibility to see if any Distinct count measures can be removed if not needed.
- Or Accurate counts can be converted to Approximate Counts?
- Explore the possibility of using Boundary Based Distinct Counts wherever possible for High Cardinalities.
- Also, explore if Sum or Count functions can be used instead of Distinct Count to derive the same requirement.
Optimization through aggregation strategy
On the Aggregation Strategy tab in cube designer, you can modify the aggregation properties to control the dimension combinations and materializations.
- Selective dimension materialization: Use this property to control dimension combinations to be pre-aggregated in a cube build. You can specify the dimension names for materialization using the property dialog. In each dimension combination, dimensions are kept in a specific order (defined in property kyvos.build.dimension.order).
While choosing dimension combinations to materialize using this property, only the combinations starting with the selected dimension(s) are materialized resulting in reduce cube size and build time.
Reducing aggregation can impact query performance so you need to cautiously identify which dimensions can be selectively materialized.
- Selective hierarchy materialization: Use this property to specify the highest level of a dimension to be materialized, allowing you to reduce the cube build time and size. Levels higher than the specified level will be aggregated at run time. Dimensions not specified in this property will be materialized based on default settings. You can also specify to materialize individual levels if needed. The property value comes into effect after a full cube build.
Consider a TIME dimension with 4 levels of hierarchy as YEAR, QTR, MONTH, DAY. If you select QTR, Kyvos will pre-aggregate only the QTR and DAY (as DAY is the lowest level). Any query containing YEAR will be served from QTR, while MONTH will be served from DAY, and they will be aggregated at run time. To materialize YEAR, you can select it individually too.
- Recommendation Aggregates Configurations: Click Recommend me to view a recommended strategy including base and subpartitions, recommendation details, and reasons. Also, you can get aggregation strategy recommendations. These recommendations are based on how the data is being used for querying. Kyvos automatically recommends aggregates based on its internal logic, which will improve performance and displays the number of recommendations. See Using aggregates for additional details.
Optimization through Spark properties
Users should have prior knowledge of the available cluster resources (nodes, cores, memory) and should observe each level time while doing the cube building. This will help them to identify the bottlenecks and Optimize the time further by using the below spark properties to increase task parallelism and use Spark distributed environment within the confinement of given resources.
- kyvos.build.spark.levelJob.tasks: Use this property to configure the number of tasks for reducing stage of Level1 and Level_DistCount jobs for Full or/and Incremental cube build job when executing through Spark engine. You can set this property to any positive integer and thereafter that number of tasks will get launched for the reducer stage to increase parallel tasks if the required resources are available on the cluster.
Default value is automatic based on data loads.
- spark.dynamicAllocation.minExecutors: Use this spark property sets the lower bound for the number of executors if the dynamic allocation is enabled. See details.
- spark.yarn.executor.memoryOverhead: Use this spark property to set the amount of off-heap memory (in megabytes) to be allocated per executor. This memory accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). See details.
- spark.driver.memory: Use this property to set the amount of memory to be used for the driver process, i.e., where SparkContext is initialized. (e.g., 1g, 2g).
- kyvos.spark.executor.memory.level1: Use this property to set the spark executor memory for level1 job(s) launched during the full and/or incremental cube build.
- kyvos.spark.executor.cores.level1: Use this property to set the spark executor cores for level1 job(s) launched during the full and/or incremental cube build.
- spark.executor.memory: Use this property to set the amount of memory to be used per executor process (e.g., 2g, 8g). See details.
- spark.executor.cores: Use this property to set the number of cores to be used on each executor. This property is for YARN and standalone mode only in spark. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker if there are enough cores on that worker. Otherwise, only one executor per application will run on each worker. See details.
- spark.dynamicAllocation.maxExecutors: Use this property to set the upper bound for the number of executors if the dynamic allocation is enabled. See details.
For Azure Databricks based environment you need explicitly define/modify the spark properties in Databricks Advance Spark Configuration.
Optimization through MapReduce properties
- mapreduce.job.reduces: Use this property can be used to set the number of tasks for reduce stage of cube build jobs. The value should be set based on the data volume to be processed, the capacity of the cluster, and the memory allocated to the job.
- mapreduce.map.java.opts: Use this property to define the heap size allocated to the map task of MapReduce job executed from Kyvos. The value should be set based on the amount of data to be processed and the number and cardinality of dimensions.
- mapreduce.map.memory.mb: Use this property to define the amount of memory allocated to the map task of MapReduce jobs executed from Kyvos. The value should be set based on the amount of data to be processed and the number and cardinality of dimensions. See details.
- mapreduce.reduce.java.opts: Use this property to define the heap size allocation for reducing tasks of a MapReduce job executed from Kyvos. The value should be set based on the amount of data to be processed and the number and cardinality of dimensions.
- mapreduce.reduce.memory.mb: Use this property to define the amount of memory allocated to reduce the tasks of MapReduce jobs executed from Kyvos. The value should be set based on the amount of data to be processed and the number and cardinality of dimensions.
Optimization through cluster configuration
- For Cloud-based environments, please configure the Autoscaling and define minimum and maximum worker nodes per the computational loads. This will ensure proper utilization of the cloud resources and assists in cube build time optimization if used judiciously.
- Additional details are mentioned in the cloud best practices.