Identify the major differences between a traditional data warehouse and a data mart? Explain the differences between the traditional data warehousing process compared to newly designed data warehouse in less than 90 days?
Key differences: data warehouses vs. data lakes
A data warehouse and a data lake are two related but fundamentally different technologies. While data warehouses store structured data, a lake is a centralized repository that allows you to store any data at any scale. A data lake offers more storage options, has more complexity, and has different use cases compared to a data warehouse. Key points of difference are given below.
Data sources
Both data lakes and warehouses can have unlimited data sources. However, data warehousing requires you to design your schema before you can save the data. You can only load structured data into the system. Conversely, data lakes have no such requirements. They can store unstructured and semi-structured data, such as web server logs, clickstreams, social media, and sensor data.
Preprocessing
A data warehouse typically requires preprocessing before storage. Extract, Transform, Load (ETL) tools are used to clean, filter, and structure data sets beforehand. In contrast, data lakes hold any data. You have the flexibility to choose if you want to perform preprocessing or not. Organizations typically use Extract, Load, Transform (ELT) tools. They load the data in the lake first and transform it only when required.
Data quality
A data warehouse tends to be more reliable as you can perform processing beforehand. Several functions like de-duplication, sorting, summarizing, and verification can be done in advance to assure data accuracy. Duplicates or erroneous and unverified data may end up in a data lake if no checks are being done ahead of time.
Performance
A data warehouse is designed for the fastest query performance. Business users prefer data warehouses so they can generate reports more efficiently. In contrast, data lake architecture prioritizes storage volume and cost over performance. You get a much higher storage volume at a lower cost, and you can still access data at reasonable speeds.
Characteristics | Data Warehouse | Data Lake |
---|---|---|
Data |
Relational data from transactional systems, operational databases, and line of business applications |
All data, including structured, semi-structured, and unstructured |
Schema |
Often designed prior to the data warehouse implementation but also can be written at the time of analysis (schema-on-write or schema-on-read) |
Written at the time of analysis (schema-on-read) |
Price/Performance |
Fastest query results using local storage |
Query results getting faster using low-cost storage and decoupling of compute and storage |
Data quality |
Highly curated data that serves as the central version of the truth |
Any data that may or may not be curated (i.e. raw data) |
Users |
Business analysts, data scientists, and data developers |
Business analysts (using curated data), data scientists, data developers, data engineers, and data architects |
Analytics |
Batch reporting, BI, and visualizations |
Machine learning, exploratory analytics, data discovery, streaming, operational analytics, big data, and profiling |
Learn more about Data Warehouses | Learn more about Data Lakes |