The Clearly Podcast

Dealing with Large Datasets

Summary

This week we look at big datasets.  And we should be clear, we are simply talking about large volumes - not "Big Data" as defined by volume, variety, and velocity.

We discuss options such as using Analysis Services to offload the data processing workloads away from Power BI.  Premium per User is now in public preview, so this could offer a more cost effective long term solution for managing large datasets in Power BI.

Direct Query does offer a last-resort solution but typically this is the least performant option, so should be used with caution since slow performance will put users off using the tool.

We discuss monitoring performance, and managing a changing landscape.  A report with a low volume of data today could become a large dataset in time and need the approach changing.

Finally, we come to data quality - and why it should get fixed at source not in the reporting tool.

You can download Power BI Desktop from here.

If you already use Power BI, or are considering it, we strongly recommend you join your local Power BI user group here.

Transcript

Andy: Hey everyone, welcome to the Clearly Podcast Episode 6: Dealing with Large Datasets. Today, we’ll discuss how to manage large datasets effectively, including the tools and techniques that can help.

Shailan: First off, a small dataset is something that can fit into Excel. A large dataset, on the other hand, is something with over 100 million rows or transactions. For instance, we’ve dealt with datasets where thousands of transactions per second were coming from various locations.

Tom: It's not just the number of rows, but also the number of columns. For example, a dataset with 100 million rows and only three columns might still be manageable. However, when you have 20 or 30 columns, even 10 million rows can become challenging.

Andy: Indeed, large datasets require different management techniques. With smaller datasets, you can handle data quality at a transactional level. But with millions of rows, that’s impractical. Tom, what are the main tools for managing big datasets?

Tom: First, question if you need all the data. Reduce the dataset before importing it into tools like Power BI. If you need every row and column, consider using SQL Server Analysis Services or premium capacities in Power BI. Direct Query should be a last resort due to performance issues.

Andy: Being minimalistic is key. Only carry the data you absolutely need. This reduces the load and helps in maintaining performance.

Shailan: Using the right visuals is crucial. Some visuals can lag with large datasets. Also, perform calculations at the query level rather than on-the-fly in the report to enhance performance.

Andy: Responsiveness is critical. Users won’t use reports that take too long to load. We once reduced the load time of a report from 7 minutes to just a few seconds by optimizing the model and visuals.

Tom: Treat reporting as an iterative process. Regularly review and optimize your reports as data grows and business needs change.

Shailan: Implement a data quality dashboard to highlight anomalies, duplications, and errors. This helps in maintaining data quality over time.

Andy: Data quality should be an ongoing process, not a one-time event.

Tom: Absolutely. Users should report any issues they encounter to help maintain data quality and performance.

Andy: If there’s one takeaway from today, it’s to be ruthless with the data you carry. Only use what you absolutely need.

Shailan: And ensure you implement and maintain proper query and calculation optimizations.

Tom: Regularly review your reports and keep an eye on data usage statistics.

Andy: For more information, visit our website or Microsoft's resources on big data. We also have relevant blogs and upcoming podcasts on related topics. Thank you for joining us. We’ll be back on December 6th with more on Power BI row-level security. Until then, stay safe and take care. Goodbye!