# Well-Architected Framework

* Well-architected framework is a set of principles.
* These principles are documented as 5 pillars:
  * Operational Excellence
  * Security
  * Cost Optimization
  * Reliability
  * Performance Efficiency

### General Design Principles

* **Stop guessing capacity needs** - scale up and down as required
* **Automate everything** - automated systems ensure consistency and reliability.
* **Test at scale** - test an accurate replica of production on-demand.
* **Adapt and evolve** - adapt the architecture as needed to meet new challenges.
* **Be data driven** - drive decisions through data.
* **Game days** - practice, practice, practice.&#x20;

### The Five Pillars

<https://aws.amazon.com/blogs/apn/the-5-pillars-of-the-aws-well-architected-framework/>

* **Operational Excellence**
  * Does your architecture work? Will it continue to work?
  * There are six design principles for operational excellence in the cloud:
    * Perform operations as code
    * Annotate documentation
    * Make frequent, small, reversible changes
    * Refine operations procedures frequently
    * Anticipate failure
    * Learn from all operational failures (and success)
  * Prioritize to align with business priorities
    * What is the business goal?
    * What are the critical pieces needed to meet that goal?
    * Any compliance restrictions/requirements?
    * Dependencies between services?
  * Design your architecture to support business priorities
    * Is the design observable?
    * Is the entire design code? Can it be redeployed in even of a failure?
    * Are your logs and observations actionable? Can you derive values from data you're collecting?
  * Is your workload ready to go live
    * Are your processes consistent?
    * Is operational code properly managed?
    * Are tests in place?
    * Are you anticipating failure?
  * Ensure your workloads are actually working
    * Metrics indicate health of each service
    * Metrics show overall health
    * Are you monitoring business metrics too?
  * Responding to events
    * Anticipate planned and unplanned events
    * Respond in code
    * Connect observations with 3rd party tools as needed
  * Learn from success or failure

    * Post-event, have runbooks changed?
    * Are teams evaluating their processes?
    * Test assumptions
    * Experiment early and often to find better solutions

* **Cost Optimization**
  * Spend only what you have to. Deliver business value for the lowest price point.
  * There are five design principles for cost optimization in the cloud:
    * Adopt a consumption model
    * Measure overall efficiency
    * Stop spending money on data center operations
    * Analyze and attribute expenditure
    * Use managed services to reduce cost of ownership
  * Use the appropriate resources and configurations
    * Provision for current needs with an eye to the future
    * "Right size" to lowest resource that meets the needs
    * Use data to choose purchase options
    * Optimize by geography
    * Default to managed services
    * Optimize data transfer
  * Matching supply and demand
  * Know how much you're spending and where
    * Understand your stakeholders
    * Implement a governance model
    * Attribute cost to teams/projects
    * Tag AWS resources
    * Track lifecycle of the resources
  * Continuously work to maximize value delivered

    * Align utilization with requirements
    * Report and validate findings
    * Evaluate new services for value
    * Continue push for managed services, if they're cost-effective

* **Reliability**
  * There are five design principles for reliability in the cloud:
    * Test recovery procedures
    * Automatically recover from failure
    * Scale horizontally to increase aggregate system availability
    * Stop guessing capacity, reduce idle resources
    * Manage change in automation
  * Will this system work consistently and recover quickly
    * Recover from issues automatically
    * Scale horizontally first for resiliency
    * Reduce idle resources
    * Manage change through automation
  * Understand the default and requested limits
    * Are you planning beyond current limits for a resource?
    * Will you scale past specific resource limits?
    * Can those limits be lifted?
    * Can you plan around those limits?
  * Networking
    * IP address space management (are you considering IPv6)
    * Subnets structures
    * Resilient topologies
    * Ability to handle sudden increase in traffic
    * Provide consistent performance regardless (latency)
  * Ensure your application is ready for business use

    * Can users access your application?
    * Deploy without an issue
    * Can you push issue to a planned downtime?
    * Can your application withstand partial outages?

* **Performance Efficiency**
  * There are five design principles for performance efficiency in the cloud:
    * Democratize advanced technologies
    * Go global in minutes
    * Use serverless architectures
    * Experiment more often
    * Mechanical sympathy
  * Remove bottlenecks, reduce waste
    * Let AWS do the work whenever possible
    * Reduce latency through regions and AWS Edge
    * Serverless whenever possible, then containers, only then fall down to instances
    * Experiment as new services are released
    * Think about the user, not your tech stack
  * Is this the optimal solution for this workload
    * What type of compute best suits?
    * Which data store is ideal for this workload?
    * Does your network design complement compute and data store choices?
  * Continuously ensure choices work for your workloads
    * Is infrastructure stored as code?
    * Are deployments simple and automated?
    * Can benchmarks be taken automatically?
    * Does load testing interfere with production?
  * Monitoring

    * Use active and passive monitoring where appropriate
    * Understand the 5 phases of monitoring - generation, aggregation, real-time processing, storage, analytics
    * Create actionable metrics

* **Security**
  * There are six design principles for security in the cloud:
    * Implement a strong identity foundation
    * Enable traceability
    * Apply security at all layers
    * Automate security best practices
    * Protect data in transit and at rest
    * Prepare for security events
  * Does this system work only as intended?
    * Identities have the least privileges required
    * Know who did what and when
    * Security is woven into the fabric of the system
    * Automate security tasks
    * Encrypt all data at rest and in transit
    * Prepare for the worst
  * Look for abnormal behavior in your logs
    * Capture and analyze logs
    * Regularly audit controls and configurations (AWS CloudFormation drift, AWS Config)
  * Defense in depth
    * Establish trust boundaries
    * Protect the network in/out
    * Protect all hosts
    * Configure services to meet security posture needs
    * Enforce service level protection
  * Classify and protect data
    * How sensitive is the data?
    * Who should have access to the data and when?
    * Encrypt in transit and at rest
    * Backup your data, test backups
  * Contain and recover from an unplanned event
    * Do you have a plan to tag affected resources?
    * Can you adjust permissions to allow for containment?
    * Can you redeploy to recover quickly?
    * Did you learn from the incident and adjust?
