Building a Data Lakehouse With Databricks

A Tail Wind Talker Webinar

Presenters: Andy Walz, Ross McNeely, Andy Bolk, Frank Matteson

Andy Walz:

Welcome to another Tail Wind Talker webinar. Today’s Talker is titled Building a Successful Data Lakehouse with Databricks. I’m Andy walls and I’m part of the sales team here at Tail Wind. I’ll do a brief introduction before we get started. For those of you that haven’t worked with us, Tail Wind is a strategic information management and data management and analytics consulting firm.

We partner with a handful of select organizations in the data and analytics space, specifically Microsoft, Snowflake, AWS, and Databricks. We help our clients design build and deploy successful data and analytics solutions. We’ve led numerous data projects for clients in every industry. From mid-sized companies to the Fortune 100, many of our engagements begin with assessments or roadmaps to demonstrate ROI.

If you’d like to learn more about our services, please give us a call anytime, or you can visit our website to learn more about us. We should have plenty of time to take questions today.

I will do a brief introduction of our presenters today. Ross McNeely and Andy Bolk. Ross is our Vice-President of I.T. here at Tail Wind. He started with us over eight years ago. He’s been a Principal Consultant, Practice Manager, and a top data architect for us during his time here.

Ross has led many of our key projects for clients across the country in Healthcare, Financial Services, Manufacturing, Retail, and more. Andy Bolk as a Solutions Architect at Tail Wind. He’s been here over seven years and he’s also led many of our largest and most complex projects. We have Frank Madison on the line with us today to help field any questions that you might have. Frank is a Practice Manager in Data and Analytics. So with that, I will hand it over to you Ross.

Ross McNeely:

Hi, welcome to the Tail Wind Talker. Today we will be talking about building a data Lakehouse with Databricks. The focus is on the Lakehouse paradigm shift. After covering the Lakehouse concept, I will dig into the components so that you are comfortable with the architecture.

We will also have Andy Bolk talk through a Databricks use case, before we dive deep into building a Lakehouse. It’s a great use case because it shows just how fast you can get up and running with Databricks. I’ll first describe what the “new normal” is when you have a Lakehouse, We’ll also talk about a key benefits like simplifying your architecture.  How to create a Single Source of Truth SSOT. And ultimately how does the data lake and the data warehouse combine to bring you a Lakehouse?

The “new normal” with the Lakehouse. The Lakehouse expands the types of projects and reduces the risk to allow for more “moonshot projects” as the new normal. This allows previous labeled as unfeasible to be reclassified as high impact. So, the Lakehouse concept gives us more flexibility. It allows us to do more things with structured, semi-structured, and unstructured data. That’s a real big impact.

Impact Feasibility Graph

Before we started talking more in depth on the Lakehouse, I want to provide a quick use case that is easy to implement.  I’ll hand it over to Andy Bolk.

Andy Bolk:

We are currently engaged with a Financial company and one of our smaller projects that we’ve been working on is pulling down daily loan rates from Bloomberg. We’ve been working with Bloomberg data they provided us with a Python script that we can use to download some rates from them.

Those data files are saved off into our data lake. And then once we get it into the data lake we need to process that data. So how do we do that? We decided to use Databricks for this project and once you have the data stored in the data lake and the data bricks instance set up, you are ready to proceed.

Data Lake Diagram

You’ll need to go ahead and create a connection from Databricks so that it can access the data within the data lake. There are a few steps that need to be done to make this happen. The first one this needs to be handled by more of a person who has an administrative role with Azure.

If you are just a contributor or an owner within your subscription or resource group, you may not be able to do this, but you’ll need to create an active directory application registration. And what this basically is it’s a service principle that allows Databricks to gain access to the data. The next thing you’ll need to do as far as best practices go is you have to also make use of key vault.

What you’ll do in key vault is you’ll go ahead and store the credentials for that service principle you just created so that then once you get further into Databricks you’ll create what’s called a secret within Databricks. That basically just gives Databricks access to the key vault to get access to those credentials, to that service principle.

And then the final step, as far as setting up access is you’ll need to go into the data lake and assign the appropriate permissions for that service principle. So now we have all that taken care of, we can start getting into developing the code for this. So you’ll go into Databricks, open up a new note.

We already have our data downloaded from Bloomberg within the data lake. So the first step that we did in our case, as we pulled it into a data frame using a Python script, and that will get the current data from the file into Databricks so that we can start working with it. The next step that we took was we created a temporary view and being a SQL guy, this allows us to write a SQL queries against the data.

And it’s just from my personal standpoint, it’s a little bit easier to work with when we’re using SQL. The next step after that is we will compare it to the data that’s in our Delta table. And this is a more permanent table within Databricks. And this will contain all the historical information.

And what we do between the temporary view and adults table is basically just doing a merge statement that probably a lot of you are familiar with SQL. So it’s managing any changes that come through on the latest daily file. And finally, once we have all the data processing taken care of in our case, again, we created a power BI report, which we can use to connect to the Delta table that our users can then go in and see what the current rates are, which are received from Bloomberg.

There are many ways that this could have been accomplished. We chose Databricks and using the data to produce a power BI report. But in the future, we are anticipating that there may be needs to do a little bit more in depth analytics. And since we already have the data within Databricks, that makes it a lot easier to do those analytics.

So with that, I will hand it back over to Ross.

Ross McNeely:

Great. Thank you, Andy.

That’s a great example and really enjoy the hearing about quick wins. It’s a good example of just using some of the base components and doing something quick for an easy win. And as Andy mentioned this sets up for future success if they want to get into some machine learning and data science concepts.

What I want to do next is just talk through what’s a traditional data roadmap look like that we see at clients. We do typical integration work. We’re going to have a data platform. Maybe we have a concept of a data lake to begin with.

Data Roadmap

Then we’re going to get into marts and data warehousing, so you can see this entire flow here and it’s quite involved. And what I really want to start to focus in on is how do we simplify this architecture?

The complexity and opportunity for error occurs because business intelligence developer and data scientists are developing from different sources.  They also have different goals for delivery which often prevents sharing logic.

You have lots of hops, you have different components, you have different specialties, different products, even though you’re all in the cloud you still have a complex process and architecture here.

How can we make the data lake work for both data science and business intelligence?  We can start to build, a raw data zone where we’re going to bring all data in, and it can now be discoverable.  Then the data can be processed with transformations and sit in the cleansed zone.

Data Lakehouse Paradigm

We now have that bronze, silver notion where the data has now been cleaned up. We can now add some structure to it, which we’ll go through. And then ultimately you can have some more advanced, curated data that may be in the form of a mart or data warehouse schema. That is the Lakehouse paradigm.  By using Delta Lake to manage the Lakehouse we can now support both business intelligence and data science machine / machine learning.

Data Lakehouse Paradigm With Delta Lake

To support business intelligence the Delta Lake needs to support ACID transactions. When we write data we need a guarantee that it is going to get committed. We also need to have the ability to add some performance over the data lake with indexes.  Schema validation is also critical within business intelligence needs. Delta Lake enables us to have the schema concept from data warehousing on top of the data that we have in a data lake.

Delta lake is going to take over and we’re going to use the tools and capabilities with inside data Delta lake to create the bronze (raw), the silver (standardized) and the gold (curated) off of our data lake. And that’s going to give us an ability to, capture everything, apply transformations, business logic, and ultimately provide aggregated data or maybe dimensional data in a star schema. Also, data science works well on Delta lake since Databricks uses standard file formats allowing Python scripts to be written.

Raw Standardized Curated Data

What are some other Delta Lake (Lakehouse) capabilities. If there’s missing columns, we can just accept those missing columns and land the data. If the data type was a nulltype, allowing the new data type or up cast if the column is not nulltype from a tinyINT, to INT you can accept that evolution change.  We also have the flexibility at validation on missing columns to pass nulls.

Another key benefit from the data warehouse world is upsert and delete.  Updating files in a data lake is very difficult and process heavy.  Delta Lake allows developers to write typical SQL statements for inserts, updates, merges, and deletes.

Update And Delete Commands In Delta Lake

A little more depth around performance tuning helps to show the completeness of Delta Lake.  There is automate optimization with Delta Lake, and the ability to design additional enhancements with Z ordering technique. That groups related data into the same set of files. This has a big benefit it terms of data skipping. It basically allow for seek operations versus having to scan all the petabytes of the Lakehouse.

Automatic Optimization Command In Delta Lake

Now I’m going to turn it over for questions with Frank.

Frank Matteson:

Yeah, this is Frank again, the practice manager, I’m just looking through the Q and A’s here. A couple of questions coming in…

Question: How do we implement a data Lakehouse on top of an existing data lake that contains many different raw file formats?

Answer: There’s a number of different ways. If you already have a data lake in place and we’re trying to fill the data Lakehouse structure that Ross has presented on. If we’re thinking about is let’s just call it a bronze or silver zone.

The benefit of Delta lake is that once you’ve got data into it, it starts maintaining offsets for you. So it knows what’s been processed and what hasn’t been processed, what new data has come in, because now you’re saying that your data already exists, but not in the Delta format, but maybe parquet files, CSV files, or whatever else, Avro files.

Databricks has a feature called auto loader, which will enable you to essentially use. This is making an assumption that you’re using Azure for your data lake. And so if that’s the case, you can use autoloader to tie into the event grid on that ADLS subscription or account. Use the event queue and subscription to know when new files have landed, essentially when then they land Databricks will then be aware that a new file has arrived.

And now you can process that data through the rest of the zones. So you’ve got  either, if you’re doing batch processing or if you’re using stream processing through Databricks, which is amazing for stream processing now. Continue on with building out that bronze, silver, gold as if it was in Delta even in your raw format.

Question:  How does the data lake support schema changes?

Answer: I think Ross, you touched on this a little bit with the schema evolution. That’s another really great benefit of using Delta. Once you’re in that format, you can enable schema evolution in your spark code, if you’re using that.

So for instance, there’s options to just say and I forget what exactly the semantic service or the syntax are, but you support schema evolution. And then all it does is as long as you’re following the laws of data types is except the data.

So for instance, if you’re trying to do something like convert an attribute from string to an integer that will obviously not be supported. It will still cause an error, but if you’re adding a new attribute and that attribute change is supported like, length change, then it will automatically pick it up.

It’ll apply those changes directly to the Delta table and there’s no code changes needed or required by the developer. So it’s a really cool feature.

Question: What are the costs of using Delta lake?

Answer: Yeah, so pricing there’s a couple of things about that. Delta lake itself is open source. So if you want to use Delta. You are not required, but there are benefits to using Databricks. There’s some performance gains and using their version only comes within Databricks, but it is open source. That’s another great thing about this format is that there’s lots of integration tools out there. If you want it to move away from Databricks you can take that data. You still own your data. That’s the great benefit of Databricks over some of the other tools is that once you’ve put it into the Delta lake you don’t have to continue using it. Your data is in your storage. And they’re still just parquet files. So if you wanted to download them or move them or do something else and other than continuing to use Delta, like you can do that if you so choose you own that data, where it is, you have access to it. It’s not in some proprietary format. It’s open source. So there’s great benefits again around that, and there’s no cost to using it, but there is obviously cost to using the cluster and the compute and storage in your ADLS. You can see the pricing right on Databricks and Azure’s pricing tools.

Need help scoping a data project?

Contact us today for a free onsite consultation.