Data Lake with AWS¶
Talend's definitions for data lake and data warehouse: "A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose."
In AWS context, a set of AWS services support your data strategy:
- Amazon Redshift
- S3
- Aurora
- EMR: Created in 2009, it is a managed service to run Spark, Hadoop, Hive, Presto, HBase... Per-second pricing and save 50%-80% with Amazon EC2 Spot and reserved instances.
- DynamoDB
- Athena
- Glue
- OpenSearch
- Lake Formation
- SageMaker
Big Data¶
The 5 V's of big data are:
- Volume: terabytes, petabytes and even exabytes level.
- Variety: includes data from a wide range of sources and formats.
- Velocity: data needs to be collected, stored, processed and analyzed within a short period of time.
- Veracity: Trust the data. Ensure data integrity within the entire data chain: security and free from compromise.
- Value: get the usefulness from the data, by querying the data and generate reports.
Data engineering capabilities¶
- Continuous or scheduled data ingestion: ingest petabytes of data with auto-evolving schemas, read from files or streaming sources.
- Declarative ETL pipelines: intent-driven development. Automatic data lineage.
- Data quality validation and monitoring: support defining data quality and integrity controls.
- Fault tolerant and automatic recovery
- Data pipeline observability with data lineage, data flows diagrams, job monitoring.
- Support both batch and streaming processing
Lake Formation¶
AWS Lake Formation is a service that makes it easy to set up a secure data lake in days instead of month. A data lake is a centralized, curated, and secured repository that stores all the data, structured and unstructured, both in its original form and prepared for analytics.
If we already use S3, we typically begin by registering existing S3 buckets that contain our data. Lake Formation creates new buckets for the data lake and import data into them. It adds its own permissions model, with fine grained access control down to the column, row or cell level.
Amazon S3 forms the storage layer for Lake Formation.
AWS Lake Formation is integrated with AWS Glue which we can use to create a data catalog that describes available datasets and their appropriate business applications. Lake Formation lets us define policies and control data access with simple “grant and revoke permissions to data” sets at granular levels.
Features¶
-
Ingestion and cleaning
- AWS Glue
- Serverless Spark
- Blueprints: predefined template to ingest data in one shot or incremental load.
- ML transforms: deduplications
-
Security
- Data catalog
- Centralized permissions
- Real time monitoring
- Auditing
-
Analytics & ML - Integration
- Redshift Spectrum
- EMR
- Glue
- Athena
- Quicksight