Yeah, this is Frank again, the practice manager, I’m just looking through the Q and A’s here. A couple of questions coming in…
Question: How do we implement a data Lakehouse on top of an existing data lake that contains many different raw file formats?
Answer: There’s a number of different ways. If you already have a data lake in place and we’re trying to fill the data Lakehouse structure that Ross has presented on. If we’re thinking about is let’s just call it a bronze or silver zone.
The benefit of Delta lake is that once you’ve got data into it, it starts maintaining offsets for you. So it knows what’s been processed and what hasn’t been processed, what new data has come in, because now you’re saying that your data already exists, but not in the Delta format, but maybe parquet files, CSV files, or whatever else, Avro files.
Databricks has a feature called auto loader, which will enable you to essentially use. This is making an assumption that you’re using Azure for your data lake. And so if that’s the case, you can use autoloader to tie into the event grid on that ADLS subscription or account. Use the event queue and subscription to know when new files have landed, essentially when then they land Databricks will then be aware that a new file has arrived.
And now you can process that data through the rest of the zones. So you’ve got either, if you’re doing batch processing or if you’re using stream processing through Databricks, which is amazing for stream processing now. Continue on with building out that bronze, silver, gold as if it was in Delta even in your raw format.
Question: How does the data lake support schema changes?
Answer: I think Ross, you touched on this a little bit with the schema evolution. That’s another really great benefit of using Delta. Once you’re in that format, you can enable schema evolution in your spark code, if you’re using that.
So for instance, there’s options to just say and I forget what exactly the semantic service or the syntax are, but you support schema evolution. And then all it does is as long as you’re following the laws of data types is except the data.
So for instance, if you’re trying to do something like convert an attribute from string to an integer that will obviously not be supported. It will still cause an error, but if you’re adding a new attribute and that attribute change is supported like, length change, then it will automatically pick it up.
It’ll apply those changes directly to the Delta table and there’s no code changes needed or required by the developer. So it’s a really cool feature.
Question: What are the costs of using Delta lake?
Answer: Yeah, so pricing there’s a couple of things about that. Delta lake itself is open source. So if you want to use Delta. You are not required, but there are benefits to using Databricks. There’s some performance gains and using their version only comes within Databricks, but it is open source. That’s another great thing about this format is that there’s lots of integration tools out there. If you want it to move away from Databricks you can take that data. You still own your data. That’s the great benefit of Databricks over some of the other tools is that once you’ve put it into the Delta lake you don’t have to continue using it. Your data is in your storage. And they’re still just parquet files. So if you wanted to download them or move them or do something else and other than continuing to use Delta, like you can do that if you so choose you own that data, where it is, you have access to it. It’s not in some proprietary format. It’s open source. So there’s great benefits again around that, and there’s no cost to using it, but there is obviously cost to using the cluster and the compute and storage in your ADLS. You can see the pricing right on Databricks and Azure’s pricing tools.