Page tree
Skip to end of metadata
Go to start of metadata
  1. Configure Cluster and Query Engine Scheduling to save cost and use cloud resources only when needed. 
    1. You can create a schedule to: 
    2. Shutdown cluster for any time interval 
    3. Start cluster for any time interval. 
    4. Schedule query engines for any time interval 
  2. Auto Scaling and Auto Termination Policy on Databricks Cluster to save cost and achieve high cluster Utilization. 
    1. Auto Termination: You can also set auto termination for a cluster. During cluster creation, you can specify an inactivity period in minutes after which you want the cluster to terminate. If the difference between the current time and the last command run on the cluster is more than the inactivity period specified, Databricks automatically terminates that cluster. 
      A cluster is considered inactive when all commands on the cluster, including Spark jobs, Structured Streaming, and JDBC calls, have finished executing. This does not include commands run by SSH-ing into the cluster and running bash commands. 
      Standard clusters are configured to terminate automatically after 120 minutes, You can modify the default value as need. 
    2. Auto Scaling: When you create a Databricks cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and the maximum number of workers for the cluster.  
      When you provide a fixed size cluster, Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling. 
      With autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).  
      Autoscaling makes it easier to achieve high cluster utilization because you don’t need to provision the cluster to match a workload. This applies especially to workloads whose requirements change over time, but it can also apply to a one-time shorter workload whose provisioning requirements are unknown.
      Autoscaling thus offers two advantages: 
  • Workloads can run faster compared to a constant-sized under-provisioned cluster. 
  • Autoscaling clusters can reduce overall costs compared to a statically sized cluster. 

You can modify the Auto Scaling configuration later as well. 

  1. Enable Logging for DBFS and provide a location to Persist Event Logs, Driver Logs and Executor Logs to analyze later.

To Enable Logging, you can Navigate to Cluster-> Advance Options -> Loggingthereafter you can mention the dbfs path. 

Event Logs, Driver Logs and Executor Logs will get persisted at this location.  

  1. Ganglia Metrics

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters. 

You can navigate Metrics from Databricks UI and then analyze the Live Metrics or periodical Historical snapshots for CPUDiskIOMemory and several other parameters in case you observe any node slowness, node lost, or jobs stuck. You can also download the metrics snapshots for future reference. 

  1. Resume cube buildin event of any Failure. 

You can use the Resume Cube build Feature in case a cube is resumable to save to cloud cost.  

If you resume a failed build the steps that were successfully completed are skipped reducing the build time. There are several ways to resume a build after it fails to complete.  

  • From View Job Histories, right-click a failed job and choose Resume job. This option is available when the build failed after some of the steps were successfully completed.  
  • When you add a job and have selected a cube that failed to build on the previous attempt, you may see an option to resume from last failure. You are prompted to confirm that you want to resume the build.  
  1. Use Delta Table instead of directlyHive table on parquet.

Query performance 

As data grows exponentially in size, being able to get meaningful information out of your data becomes crucial. Using several techniques, Delta boasts query performance of 10 to 100 times faster than with Apache Spark on Parquet. 

  • Data Indexing – Delta creates and maintains indexes on the tables. 
  • Data Skipping – Delta maintains file statistics on the data subset so that only relevant portions of the data is read in a query. 
  • Compaction – Delta manages file sizes of the underlying Parquet files for the most efficient use. 
  • Data Caching – Delta automatically caches highly accessed data to improve run times for commonly run queries. 

Data reliability 

The end users of the data must be able to rely on the accuracy of the data. Delta uses various techniques to achieve data reliability. 

  • ACID transactions – Delta employs an all or nothing approach for data consistency. 
  • Snapshot isolation – Ensures that multiple writers can write to a dataset simultaneously without interfering with jobs that are reading the dataset. 
  • Improve data integrity through schema enforcements. 
  • Checkpoints to ensure data is delivered and read only once even if there are multiple incoming and outgoing streams. 
  • Upserts and deletes support – Being able to handle late arriving and changing records and cases where records should be deleted. 
  • Data versioning capabilities allow organizations to rollback and reprocess data as necessary. 
  1. Ensure that all the cube which are not eligible for querying must be set to cuboid replication type as None
  2. Query Engines, BI Server and ADLS storage must be in same region
  3. Ensure that there should be enough amount of local disk space available on Query Engine to replicate the built cubes.
  4. For the environments where we are not having sufficient local disk available (Local disk less than Cube size) - create a segment, create dedicated metadata folder and allocate the prod cubes to this segment and rest of the cubes to the default segment.
  • No labels