Well-Architected Framework

  • Well-architected framework is a set of principles.

  • These principles are documented as 5 pillars:

    • Operational Excellence

    • Security

    • Cost Optimization

    • Reliability

    • Performance Efficiency

General Design Principles

  • Stop guessing capacity needs - scale up and down as required

  • Automate everything - automated systems ensure consistency and reliability.

  • Test at scale - test an accurate replica of production on-demand.

  • Adapt and evolve - adapt the architecture as needed to meet new challenges.

  • Be data driven - drive decisions through data.

  • Game days - practice, practice, practice.

The Five Pillars

https://aws.amazon.com/blogs/apn/the-5-pillars-of-the-aws-well-architected-framework/

  • Operational Excellence

    • Does your architecture work? Will it continue to work?

    • There are six design principles for operational excellence in the cloud:

      • Perform operations as code

      • Annotate documentation

      • Make frequent, small, reversible changes

      • Refine operations procedures frequently

      • Anticipate failure

      • Learn from all operational failures (and success)

    • Prioritize to align with business priorities

      • What is the business goal?

      • What are the critical pieces needed to meet that goal?

      • Any compliance restrictions/requirements?

      • Dependencies between services?

    • Design your architecture to support business priorities

      • Is the design observable?

      • Is the entire design code? Can it be redeployed in even of a failure?

      • Are your logs and observations actionable? Can you derive values from data you're collecting?

    • Is your workload ready to go live

      • Are your processes consistent?

      • Is operational code properly managed?

      • Are tests in place?

      • Are you anticipating failure?

    • Ensure your workloads are actually working

      • Metrics indicate health of each service

      • Metrics show overall health

      • Are you monitoring business metrics too?

    • Responding to events

      • Anticipate planned and unplanned events

      • Respond in code

      • Connect observations with 3rd party tools as needed

    • Learn from success or failure

      • Post-event, have runbooks changed?

      • Are teams evaluating their processes?

      • Test assumptions

      • Experiment early and often to find better solutions

  • Cost Optimization

    • Spend only what you have to. Deliver business value for the lowest price point.

    • There are five design principles for cost optimization in the cloud:

      • Adopt a consumption model

      • Measure overall efficiency

      • Stop spending money on data center operations

      • Analyze and attribute expenditure

      • Use managed services to reduce cost of ownership

    • Use the appropriate resources and configurations

      • Provision for current needs with an eye to the future

      • "Right size" to lowest resource that meets the needs

      • Use data to choose purchase options

      • Optimize by geography

      • Default to managed services

      • Optimize data transfer

    • Matching supply and demand

    • Know how much you're spending and where

      • Understand your stakeholders

      • Implement a governance model

      • Attribute cost to teams/projects

      • Tag AWS resources

      • Track lifecycle of the resources

    • Continuously work to maximize value delivered

      • Align utilization with requirements

      • Report and validate findings

      • Evaluate new services for value

      • Continue push for managed services, if they're cost-effective

  • Reliability

    • There are five design principles for reliability in the cloud:

      • Test recovery procedures

      • Automatically recover from failure

      • Scale horizontally to increase aggregate system availability

      • Stop guessing capacity, reduce idle resources

      • Manage change in automation

    • Will this system work consistently and recover quickly

      • Recover from issues automatically

      • Scale horizontally first for resiliency

      • Reduce idle resources

      • Manage change through automation

    • Understand the default and requested limits

      • Are you planning beyond current limits for a resource?

      • Will you scale past specific resource limits?

      • Can those limits be lifted?

      • Can you plan around those limits?

    • Networking

      • IP address space management (are you considering IPv6)

      • Subnets structures

      • Resilient topologies

      • Ability to handle sudden increase in traffic

      • Provide consistent performance regardless (latency)

    • Ensure your application is ready for business use

      • Can users access your application?

      • Deploy without an issue

      • Can you push issue to a planned downtime?

      • Can your application withstand partial outages?

  • Performance Efficiency

    • There are five design principles for performance efficiency in the cloud:

      • Democratize advanced technologies

      • Go global in minutes

      • Use serverless architectures

      • Experiment more often

      • Mechanical sympathy

    • Remove bottlenecks, reduce waste

      • Let AWS do the work whenever possible

      • Reduce latency through regions and AWS Edge

      • Serverless whenever possible, then containers, only then fall down to instances

      • Experiment as new services are released

      • Think about the user, not your tech stack

    • Is this the optimal solution for this workload

      • What type of compute best suits?

      • Which data store is ideal for this workload?

      • Does your network design complement compute and data store choices?

    • Continuously ensure choices work for your workloads

      • Is infrastructure stored as code?

      • Are deployments simple and automated?

      • Can benchmarks be taken automatically?

      • Does load testing interfere with production?

    • Monitoring

      • Use active and passive monitoring where appropriate

      • Understand the 5 phases of monitoring - generation, aggregation, real-time processing, storage, analytics

      • Create actionable metrics

  • Security

    • There are six design principles for security in the cloud:

      • Implement a strong identity foundation

      • Enable traceability

      • Apply security at all layers

      • Automate security best practices

      • Protect data in transit and at rest

      • Prepare for security events

    • Does this system work only as intended?

      • Identities have the least privileges required

      • Know who did what and when

      • Security is woven into the fabric of the system

      • Automate security tasks

      • Encrypt all data at rest and in transit

      • Prepare for the worst

    • Look for abnormal behavior in your logs

      • Capture and analyze logs

      • Regularly audit controls and configurations (AWS CloudFormation drift, AWS Config)

    • Defense in depth

      • Establish trust boundaries

      • Protect the network in/out

      • Protect all hosts

      • Configure services to meet security posture needs

      • Enforce service level protection

    • Classify and protect data

      • How sensitive is the data?

      • Who should have access to the data and when?

      • Encrypt in transit and at rest

      • Backup your data, test backups

    • Contain and recover from an unplanned event

      • Do you have a plan to tag affected resources?

      • Can you adjust permissions to allow for containment?

      • Can you redeploy to recover quickly?

      • Did you learn from the incident and adjust?