Power Query - Optimization with templates and dataflows

By Simon-Pierre Morin, Eng.

Following on from previous optimization articlesLet's take a look at some of the other tools available. This time, we'll be looking at improving report loading performance. Note that performance enhancement is independent of any of the optimizations covered in the previous articles, and can be carried out at any time.

First of all, as a data acquisition engine, Power Query allows several types of manipulation. However, performing these manipulations at this stage is not always the optimal solution.

Among the non-optimal operations, we have, for example :

Certain operations related to data manipulation and transformation.
Some raw data acquisition.
Data decentralization.
Redundant operating effort.
Data inconsistencies caused by asynchronous maintenance.

Some of these operations are so costly that they can slow down report generation. It is sometimes more optimal to perform them in the model rather than at source. What's more, centralizing certain operations, especially if they are reused in several reports, brings a gain in performance, maintenance and uniformity across all reports affected by these transformations. This makes it worthwhile to share the burden of data transformation.

With this in mind, I'd like to introduce you to model manipulation and dataflows.

Processing in the model

The first observation is that you don't have to do everything in Power Query. It is possible to give some of the work to the model using the DAX (Data Analysis Expressions) language. In Excel, it is also possible to use formulas and tables. It would also be possible to use code (VBA or NET), but I won't dwell on that in this article.

Excel

In Excel, always work with tables, not cells, to improve performance. Tables enable better reference management. By default, when you receive your Power Query data or any other data source, Excel will automatically organize it into a table. You can also convert your cells into tables.

This will give you access to your column headings in your formulas, which will automatically apply to the entire column.

It's also a good idea to always give your tables a meaningful name.

In this way, you can refer directly to your table and its columns in your formulas, even if its dimensions change.

In addition, you may wish to activate Power Pivot in Excel to facilitate model manipulation and benefit from the advantages of DAX. This is already provided in Excel, but is deactivated by default. To activate it:

Go to File > Options (File > Options).
In the dialog window, go to the Add-ons section (Add-ins).
Select COM complements (COM Add-ins)
Check "Microsoft Power Pivot for Excel".

This operation is optional.

DAX in general

Model or table operations are more efficient for operations directly linked to records or their aggregates. These are calculated only when necessary. The rest of the time, the model will use the dataset stored in memory. Some examples of operations best left to the model are :

full copy of a table,
table resulting from a filter on another table,
information calculated on the basis of another column,
measurements based on one or more filters,
aggregate calculation,
recursive hierarchy table management (hierarchy path).

All these operations require few resources compared with the equivalent in Power Query. Bear in mind that DAX will only calculate the data required by the display filter, unlike Power Query, which always calculates all the records for each acquisition.

As an example, for a table copy made in Power Query, when the data is acquired, even if it's only a reference internally, all the data will be loaded twice into the model. The same copy in DAX will load the data only once on acquisition, and will use the information in memory for the copy. The same applies to sub-tables that filter a raw data table or aggregate it.

In a real-life situation, in terms of performance gains, I was able to reach a result in DAX up to fifty times faster than its pure Power Query version. Depending on the operations to be performed, the question to ask is: "Does the operation have to be done at source to alter the data set, or is it information required for a display?" In the former case, Power Query is the ideal candidate; in the latter, it may be preferable to pass the torch to the model for performance reasons.

Dataflow

For users of Power BI Service Pro licenses, there's another advantage available. The use of data flows. These enable the creation of datasets available to the entire workspace. The various advantages of data flows are as follows:

Configured with Power Query (same language)
Centralized data transformation
Recovery in the event of a discounting error
Fast, high-performance reporting
The connection burden is passed on to the data stream and does not affect report performance
Only one connection required to access the stream dataset.
Can be used as a data source in Power BI, Excel and other applications.

The main advantage is therefore that it is possible to centralize datasets in a single data stream and connect to it as if it were a conventional database. From a reporting point of view, the source of the data stream gives us access to the exposed tables. What's more, table results are not recalculated at each update. This offers increased performance gains for complex queries requiring multiple transformations.

As this is a centralized source, any modification to the data flow will be reflected in all reports using it as a data source. When several reports use the same data, this also makes it possible to standardize certain transformations between reports. On the other hand, if you "break" your data flow, all your reports using this source will be impacted.

How do you create them?

First of all, the Data Flow editor also uses Power Query. In most situations, simply copy the M code into the Data Flow editor. However, there are two important points to bear in mind:

Some functions are not available in data streams.
It is not possible to have a type other than Table from primitives output.

Unsupported functions include third-party functions and certain proprietary functions (such as the VSTS function suite for connecting to Azure DevOps). It is still possible to shift the burden of the report by moving the supported part into a data stream.

In terms of output data type, anything that returns a single primitive value (variable, constant, parameter), a set (list, record) or a function will be converted to a table if the query's export option is activated. You must therefore ensure that objects and anything else that is not to be converted to a table are not exported/exposed, so that the data flow remains functional.

Furthermore, only primitive types are allowed as exported values in the table. It is therefore not possible to have a table containing sets or functions. Unfortunately, this means that it is not possible to use the data stream to create a centralized, shareable function library.

During the data stream validation stage, the engine will ensure that each output is a table, that each column has a defined primitive type, and that columns have no errors (errors will be replaced by the value null).

For more information on configuration, see the Microsoft documentation.

When should I use them?

There are several use cases. If you need to reuse queries, host your transformations in the cloud and simplify data modeling, you'll want to use Dataflow. If you're primarily working on a specific report, a direct Power Query within the report will suffice.

Dataflows are a powerful tool for data preparation and modeling. Here are some points to consider when deciding when to use a dataflow rather than a Power Query directly in the report:

Reusing and sharing Dataflows are designed to be reusable by a large number of users. If you have queries that are part of several data models, dataflows make it easier to share and reuse them.
Cloud hosting : These are queries carried out and hosted on the Power BI service. This centralizes data preparation and avoids duplication of effort.
Modeling and performance If you need to analyze large quantities of data and present them in reports and dashboards with relevant performance indicators, dataflows can be more efficient. They prepare unified data for modeling and optimize performance.
Entities and relationships Dataflows enable entities to be mapped, facilitating access to more complex data sources and the linking of different data sources.

The combined benefits also mean that the loading burden is shared. As reports are dissociated from data transformations, report loading is very fast. Data updates take place in the background, and a "fail-safe" mechanism prevents a report breaking if a source is not accessible, simply giving the latest version that has been successfully updated.

Still curious?

The subjects of DAX, Excel, Power Query and Power BI are covered by several specialists in the field. For those of you who would like additional examples, concrete ideas, or are simply eager to learn more, I suggest you take a look at the examples from Chandeep, Bas Dohmen and the Guy in a Cube (video sources).

If you would like to learn more about a new topic, visit our Blog to find several articles. This library is growing, so check back regularly or subscribe to our LinkedIn feed to be kept informed of new content.