What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

Many Amazon Web Services (AWS) customers require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. A data lake is a new and increasingly popular way to store and analyze data because it allows companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository.

The AWS Cloud provides many of the building blocks required to help customers implement a secure, flexible, and cost-effective data lake. These include AWS managed services that help ingest, store, find, process, and analyze both structured and unstructured data. To support our customers as they build data lakes, AWS offers Data Lake on AWS, which deploys a highly available, cost-effective data lake architecture on the AWS Cloud along with a user-friendly console for searching and requesting datasets.

Data Lake on AWS automatically configures the core AWS services necessary to easily tag, search, share, transform, analyze, and govern specific subsets of data across a company or with other external users. The Guidance deploys a console that users can access to search and browse available datasets for their business needs. It also includes a federated template that allows you to launch a version of the solution that is ready to integrate with Microsoft Active Directory.

The diagram below presents the data lake architecture you can build using the example code on GitHub.

What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

 Click to enlarge

The code configures a suite of AWS Lambda microservices (functions), Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) for robust search capabilities, Amazon Cognito for user authentication, AWS Glue for data transformation, and Amazon Athena for analysis.

Data Lake on AWS leverages the security, durability, and scalability of Amazon S3 to manage a persistent catalog of organizational datasets, and Amazon DynamoDB to manage corresponding metadata. Once a dataset is cataloged, its attributes and descriptive tags are available to search on. Users can search and browse available datasets in the console, and create a list of data they require access to. It keeps track of the datasets a user selects and generates a manifest file with secure access links to the desired content when the user checks out.

What is a data lake?

Store all your data in one centralized repository at any scale

Learn about data lakes and analytics on AWS

What is a data lake?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

Why do you need a data lake?

Organizations that successfully generate business value from their data, will outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth. These leaders were able to do new types of analytics like machine learning over new sources like log files, data from click-streams, social media, and internet connected devices stored in the data lake. This helped them to identify, and act upon opportunities for business growth faster by attracting and retaining customers, boosting productivity, proactively maintaining devices, and making informed decisions.

Data Lakes compared to Data Warehouses – two different approaches

Depending on the requirements, a typical organization will require both a data warehouse and a data lake as they serve different needs, and use cases.

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust.

A data lake is different, because it stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.

As organizations with data warehouses see the benefits of data lakes, they are evolving their warehouse to include data lakes, and enable diverse query capabilities, data science use-cases, and advanced capabilities for discovering new information models. Gartner names this evolution the “Data Management Solution for Analytics” or “DMSA.”

Characteristics Data Warehouse Data Lake
Data Relational from transactional systems, operational databases, and line of business applications Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications
Schema Designed prior to the DW implementation (schema-on-write) Written at the time of analysis (schema-on-read)
Price/Performance Fastest query results using higher cost storage Query results getting faster using low-cost storage
Data Quality
Highly curated data that serves as the central version of the truth Any data that may or may not be curated (ie. raw data)
Users Business analysts Data scientists, Data developers, and Business analysts (using curated data)
Analytics Batch reporting, BI and visualizations Machine Learning, Predictive analytics, data discovery and profiling

The essential elements of a Data Lake and Analytics solution

As organizations are building Data Lakes and an Analytics platform, they need to consider a number of key capabilities including:

Data movement

Data Lakes allow you to import any amount of data that can come in real-time. Data is collected from multiple sources, and moved into the data lake in its original format. This process allows you to scale to data of any size, while saving time of defining data structures, schema, and transformations.

Securely store, and catalog data

Data Lakes allow you to store relational data like operational databases and data from line of business applications, and non-relational data like mobile apps, IoT devices, and social media. They also give you the ability to understand what data is in the lake through crawling, cataloging, and indexing of data. Finally, data must be secured to ensure your data assets are protected.

Analytics

Data Lakes allow various roles in your organization like data scientists, data developers, and business analysts to access data with their choice of analytic tools and frameworks. This includes open source frameworks such as Apache Hadoop, Presto, and Apache Spark, and commercial offerings from data warehouse and business intelligence vendors. Data Lakes allow you to run analytics without the need to move your data to a separate analytics system.

Machine Learning

Data Lakes will allow organizations to generate different types of insights including reporting on historical data, and doing machine learning where models are built to forecast likely outcomes, and suggest a range of prescribed actions to achieve the optimal result.

The value of a Data Lake

The ability to harness more data, from more sources, in less time, and empowering users to collaborate and analyze data in different ways leads to better, faster decision making. Examples where Data Lakes have added value include:

Improved customer interactions

A Data Lake can combine customer data from a CRM platform with social media analytics, a marketing platform that includes buying history, and incident tickets to empower the business to understand the most profitable customer cohort, the cause of customer churn, and the promotions or rewards that will increase loyalty.

Improve R&D innovation choices

A data lake can help your R&D teams test their hypothesis, refine assumptions, and assess results—such as choosing the right materials in your product design resulting in faster performance, doing genomic research leading to more effective medication, or understanding the willingness of customers to pay for different attributes.

Increase operational efficiencies

The Internet of Things (IoT) introduces more ways to collect data on processes like manufacturing, with real-time data coming from internet connected devices. A data lake makes it easy to store, and run analytics on machine-generated IoT data to discover ways to reduce operational costs, and increase quality.

The challenges of Data Lakes

The main challenge with a data lake architecture is that raw data is stored with no oversight of the contents. For a data lake to make data usable, it needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted resulting in a “data swamp." Meeting the needs of wider audiences require data lakes to have governance, semantic consistency, and access controls.

Deploying Data Lakes in the cloud

Data Lakes are an ideal workload to be deployed in the cloud, because the cloud provides performance, scalability, reliability, availability, a diverse set of analytic engines, and massive economies of scale. ESG research found 39% of respondents considering cloud as their primary deployment for analytics, 41% for data warehouses, and 43% for Spark. The top reasons customers perceived the cloud as an advantage for Data Lakes are better security, faster time to deployment, better availability, more frequent feature/functionality updates, more elasticity, more geographic coverage, and costs linked to actual utilization.

Build your Data Lakes in the cloud on AWS

AWS provides the most secure, scalable, comprehensive, and cost-effective portfolio of services that enable customers to build their data lake in the cloud, analyze all their data, including data from IoT devices with a variety of analytical approaches including machine learning. As a result, there are more organizations running their data lakes and analytics on AWS than anywhere else with customers like NETFLIX, Zillow, NASDAQ, Yelp, iRobot, and FINRA trusting AWS to run their business critical analytics workloads. Learn more.

More resources on the data lake

Learn more about data lakes from industry analysts.

Aberdeen: Angling for Insight in Today's Data Lake

ESG: Embracing a Data-centric Culture Anchored by a Cloud Data Lake
What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

451: The Cloud-Based Approach to Achieving Business Value From Big Data
What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

A Perfect Match: Big Data and the Cloud in Action
What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

Get started with AWS

What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

Sign up for an AWS account

Instantly get access to the AWS Free Tier

What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

Build a secure data lake in days

Read about AWS Lake Formation

What type of infrastructure service stores and manages corporate data and provides capabilities for analyzing the data?

Start building with AWS

Read about deploying data lakes on AWS

Get started with data lakes on AWS

Learn about Data Lakes and Analytics on AWS Deploy a data lake with AWS Quick Starts

Have more questions?

Contact us