Well-Architected Framework
Well-architected framework is a set of principles.
These principles are documented as 5 pillars:
Operational Excellence
Security
Cost Optimization
Reliability
Performance Efficiency
General Design Principles
Stop guessing capacity needs - scale up and down as required
Automate everything - automated systems ensure consistency and reliability.
Test at scale - test an accurate replica of production on-demand.
Adapt and evolve - adapt the architecture as needed to meet new challenges.
Be data driven - drive decisions through data.
Game days - practice, practice, practice.
The Five Pillars
https://aws.amazon.com/blogs/apn/the-5-pillars-of-the-aws-well-architected-framework/
Operational Excellence
Does your architecture work? Will it continue to work?
There are six design principles for operational excellence in the cloud:
Perform operations as code
Annotate documentation
Make frequent, small, reversible changes
Refine operations procedures frequently
Anticipate failure
Learn from all operational failures (and success)
Prioritize to align with business priorities
What is the business goal?
What are the critical pieces needed to meet that goal?
Any compliance restrictions/requirements?
Dependencies between services?
Design your architecture to support business priorities
Is the design observable?
Is the entire design code? Can it be redeployed in even of a failure?
Are your logs and observations actionable? Can you derive values from data you're collecting?
Is your workload ready to go live
Are your processes consistent?
Is operational code properly managed?
Are tests in place?
Are you anticipating failure?
Ensure your workloads are actually working
Metrics indicate health of each service
Metrics show overall health
Are you monitoring business metrics too?
Responding to events
Anticipate planned and unplanned events
Respond in code
Connect observations with 3rd party tools as needed
Learn from success or failure
Post-event, have runbooks changed?
Are teams evaluating their processes?
Test assumptions
Experiment early and often to find better solutions
Cost Optimization
Spend only what you have to. Deliver business value for the lowest price point.
There are five design principles for cost optimization in the cloud:
Adopt a consumption model
Measure overall efficiency
Stop spending money on data center operations
Analyze and attribute expenditure
Use managed services to reduce cost of ownership
Use the appropriate resources and configurations
Provision for current needs with an eye to the future
"Right size" to lowest resource that meets the needs
Use data to choose purchase options
Optimize by geography
Default to managed services
Optimize data transfer
Matching supply and demand
Know how much you're spending and where
Understand your stakeholders
Implement a governance model
Attribute cost to teams/projects
Tag AWS resources
Track lifecycle of the resources
Continuously work to maximize value delivered
Align utilization with requirements
Report and validate findings
Evaluate new services for value
Continue push for managed services, if they're cost-effective
Reliability
There are five design principles for reliability in the cloud:
Test recovery procedures
Automatically recover from failure
Scale horizontally to increase aggregate system availability
Stop guessing capacity, reduce idle resources
Manage change in automation
Will this system work consistently and recover quickly
Recover from issues automatically
Scale horizontally first for resiliency
Reduce idle resources
Manage change through automation
Understand the default and requested limits
Are you planning beyond current limits for a resource?
Will you scale past specific resource limits?
Can those limits be lifted?
Can you plan around those limits?
Networking
IP address space management (are you considering IPv6)
Subnets structures
Resilient topologies
Ability to handle sudden increase in traffic
Provide consistent performance regardless (latency)
Ensure your application is ready for business use
Can users access your application?
Deploy without an issue
Can you push issue to a planned downtime?
Can your application withstand partial outages?
Performance Efficiency
There are five design principles for performance efficiency in the cloud:
Democratize advanced technologies
Go global in minutes
Use serverless architectures
Experiment more often
Mechanical sympathy
Remove bottlenecks, reduce waste
Let AWS do the work whenever possible
Reduce latency through regions and AWS Edge
Serverless whenever possible, then containers, only then fall down to instances
Experiment as new services are released
Think about the user, not your tech stack
Is this the optimal solution for this workload
What type of compute best suits?
Which data store is ideal for this workload?
Does your network design complement compute and data store choices?
Continuously ensure choices work for your workloads
Is infrastructure stored as code?
Are deployments simple and automated?
Can benchmarks be taken automatically?
Does load testing interfere with production?
Monitoring
Use active and passive monitoring where appropriate
Understand the 5 phases of monitoring - generation, aggregation, real-time processing, storage, analytics
Create actionable metrics
Security
There are six design principles for security in the cloud:
Implement a strong identity foundation
Enable traceability
Apply security at all layers
Automate security best practices
Protect data in transit and at rest
Prepare for security events
Does this system work only as intended?
Identities have the least privileges required
Know who did what and when
Security is woven into the fabric of the system
Automate security tasks
Encrypt all data at rest and in transit
Prepare for the worst
Look for abnormal behavior in your logs
Capture and analyze logs
Regularly audit controls and configurations (AWS CloudFormation drift, AWS Config)
Defense in depth
Establish trust boundaries
Protect the network in/out
Protect all hosts
Configure services to meet security posture needs
Enforce service level protection
Classify and protect data
How sensitive is the data?
Who should have access to the data and when?
Encrypt in transit and at rest
Backup your data, test backups
Contain and recover from an unplanned event
Do you have a plan to tag affected resources?
Can you adjust permissions to allow for containment?
Can you redeploy to recover quickly?
Did you learn from the incident and adjust?
Last updated