Tag Archives: cloud

Fear and Loathing in Security Dashboards

Recently a colleague asked for my help in understanding why he was seeing a specific security alert on a dashboard. The message said that his database instance was “exposed to a broad public IP range.” He disagreed with this assessment because it misrepresented the configuration context. While the database had a public IP, only one port was available, and it was behind a proxy. The access to this test instance was also restricted to “authorized” IP address ranges. I explained that this kind of information is what security practitioners like to know as they evaluate risk, but then thought, “is this a reasonable alert for a user or just more noise?” When did security dashboards become like the news, more information than we can reasonably take in, overloading our cognitive faculties and creating stress?

I have a complicated relationship with security dashboards. Though I understand different teams need a quick view of what they need to prioritize, findings are broadly categorized as high, medium, and low without much background. This approach can create confusion and disagreements between groups because those categories are generally aligned to the Vienna Convention on Road Signs and Signals. Green is good, red is bad, and yellow means caution. The problem is that a lot of findings end up red and yellow, with categorization dependent upon how well the security team has tuned alerts and your organizational risk tolerance. Most nuance is lost.

The other problem is that this data categorization isn’t only seen as a prioritization technique. It can communicate danger. As humans, we have learned to associate red on a dashboard with some level of threat. This might be why some people develop fanariphobia, a fear of traffic lights. Is this an intentional design choice? Historically, Protection Motivation Theory (PMT), which explains how humans are motivated to protect themselves when threatened, has been used as a standard technique within the domain of cybersecurity to justify the use of fear appeals. But what if this doesn’t work as well as we think it does? A recent academic paper reviewed literature in this space and found conflicting data on the value of fear appeals in promoting voluntary security behaviors. It often backfires, leading to a reduction in desired responses. What does work? The researchers identify Stewardship Theory as a more efficacious approach leading to improved security behaviors by employees. They define it as “a covenantal relationship between the individual and the organization” which “connects both employee and employer to work toward a common goal, characterized by moral commitment between employees and the organization.”

Am I suggesting you should throw your security dashboards away? No, but I think we can agree that they’re a limited view, which can exacerbate conflict between teams. Instead of being the end of a conversation, they should be the beginning, a dialog tool that encourages a collaborative discussion between teams about risk.

Tagged , , , , , , , ,

Dancing with the Cloud

Recently, I’ve written about the dangers posed by technology fallacies and one of the most frustrating for me involves discussions of “best in class.” In my experience, this mindset causes technology teams to get themselves wrapped up in too many pointless discussions followed by never-ending proof-of-concept work all in search of that non-existent perfect tool. The truth is that most organizations don’t need the best, they need “good enough” so they can get on with business. But this delusion has more serious consequences within the cloud. When you choose tools solely based on the requirement of being “best in class,” you could compromise the integrity of your cloud architecture. Without considering the context, your selection could violate a sacred principle of data center design – minimizing the size of a failure domain.

For many, cloud’s abstraction of the underlying network reduced complexity in what often seemed like arcane and mystical knowledge. It also allowed development teams to work unencumbered by the fear of seemingly capricious network engineers. However, the downside of infrastructure-as-a-service (IaaS) is that this same obfuscation allows those without a background in traditional data center design to make critical errors in fault tolerance.  We’ve all heard the horror stories about cloud applications failing because they were only located in a single availability zone (AZ) or region. Even worse, partial outages that occur due to application dependencies that cross an AZ or region. While this could still happen with physical data centers, it was more obvious because you could see the location of a system in a rack and your network engineers would call you out when you didn’t build in fault-tolerance and thwarted their planned maintenance windows.   

Today, it’s also common for organizations to use a variety of software-as-a-service (SaaS) applications with no knowledge of which underlying cloud providers are being used or how those services are designed. While this simplicity is often beneficial for the business because it increases velocity in service delivery, it can also create blind spots that violate the same failure domain principles as with IaaS. Only the most persistent technologists can unmask the inner workings of these applications to determine how they’re deployed and whether they align with their organization’s infrastructure design. Unfortunately, the business doesn’t always understand this nuance and can fall prey to the “best in class” fallacy, leading to brittle environments due to large failure domains. By exchanging their on-premise, sluggish systems for SaaS, the organization often accepts a different set of problems associated with risk.

Ultimately, when choosing capabilities, it’s a better recommendation to “dance with the cloud that brought you.” Instead of worrying about “best in class,” you want to select the technologies that are closer to your infrastructure by leveraging the services of your cloud provider where feasible. This translates to a better technology architecture for your organization because the cloud provider is making sure that their managed services are low-latency, resilient and highly available. While it may not always be possible, by taking the failure domain principle into consideration during the selection and implementation of solutions, you’ll achieve better service delivery for your organization.

Tagged , , , , ,

Cloud-Native Consumption Principles

The promises of Cloud are alluring. Organizations are told they can reduce costs through a flexible consumption-based model, which minimizes waste in over-provisioning, while also achieving velocity in the development of new digital products without the dependence on heavy, centralized IT processes. This aligns closely with the goals of a DevOps transformation, which seeks to empower developers to build better software through a distributed operational model that delivers solutions more quickly with less overhead. However, most enterprise cloud journeys begin with a “lift and shift” from the on-premise data center to an IaaS provider. This seems like the easiest and fastest way to begin acclimating to the new environment by finding and leveraging similarities in deployment and consumption of digital assets. While this path may initially seem to expedite adoption, the migration is soon bogged down by the very issues that prompted the organization to adopt cloud: cumbersome, centralized processes that don’t support developers’ need for automation and speed.

With startups, which don’t have the existing processes and organizational hierarchy to be realigned to a new way of working, applications have no barriers to becoming Cloud-Native. They begin that way. Enterprises weren’t initially built around a Cloud model, so the implementation is often based Conway’s Law, the design and provisioning mirroring the existing organizational hierarchy. The only difference being that instead of a server team deploying bare-metal or on-premise virtual machines, they build an instance in the cloud. While there are some incremental gains, much of the latency from human middleware and legacy processes remain. After the short honeymoon based on a PoC or pilot projects, the realities of misaligned business processes grind progress to a halt. This also results in higher spend because cloud resources are not meant to be long-running snowflakes, but ephemeral and immutable. Cloud is made for cattle, not pets.

The source of this friction becomes clear. While cloud is referred to as “Infrastructure as a service,” many assume this is equivalent to data center hosting. However, Cloud is an evolution in the digital delivery model, where bare-metal is abstracted away from the customer, who now consumes resources through web interfaces and APIs. Cloud should be thought of and consumed as a software platform, i.e., Cloud-Native.  As defined by the Cloud Native Computing Foundation (CNCF):

Cloud-native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.

These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.

Therefore, to maximize the value of cloud adoption at scale, it is necessary to become Cloud-Native, and the effort must be tightly coupled to DevOps automation efforts.

In 1967, Paul Baran discussed the creation of a “national computer public utility system,” a metered, time-sharing model. Cloud, and by extension Cloud-Native, is the manifestation of that prediction and, as with other utilities, the consumption of “compute as utility” must be distributed and self-service in order to achieve cost benefits. What about governance and security concerns? Cloud Service Providers (CSP) have built-in capabilities to establish policy restrictions at the organization, account, resource and/or identity level. Native security controls can be embedded to function seamlessly, providing the automated monitoring, alerting, and enforcement needed to minimize risk and meet audit requirements. By decoupling compliance from control, these capabilities are more efficiently consumed through the platform via policy-as-code integrated into declarative Infrastructure-as-code (IaC). Alternatively, organizational risk is increased when using manual provisioning, abstraction layers or traditional controls that are not cloud-ready or Cloud-Native with this environment.

In an attempt to ease organizations’ struggle with cloud adoption, Azure and AWS have developed Well-Architected Frameworks to promote better cloud consumption and design. Both consist of five pillars to evaluate the quality of a solution delivery: 

  • Operational excellence
  • Security
  • Reliability
  • Performance (Efficiency)
  • Cost optimization

While helpful, these frameworks fail to communicate the urgent need for automation and tight coupling to the application development lifecycle in order to achieve a successful cloud migration. For example, from the AWS Operational Excellence Pillar, “operations as code” is only listed as a design principle to “limit human error and enable consistent responses to events.”

Ultimately, Cloud at scale, is best consumed as a software platform though the automated development processes essential to DevOps, otherwise the costs of side-channel pipeline provisioning and long-running, inefficiently sized workloads soon outweigh the initial benefits.

To summarize, the principles of a Cloud-Native consumption model include:

  • Automated provisioning of all resources as code through pipelines owned by product teams
  • Distributed self-service to achieve velocity and empower business segments
  • “Shift Everywhere” security through Policy-as-code embedded into the Infrastructure-as-Code
  • Decoupling of compliance from operational control through the use of CSP native capabilities to automate governance, monitoring, alerting and enforcement

To be effective, these principles are best operationalized through the unification of any cloud initiative with a DevOps effort. Otherwise, the cloud effort will be crippled by the existing technology bureaucracy.

Tagged ,

Your Pets Don’t Belong in the Cloud

At too many organizations, I’ve seen a dangerous pattern when trying to migrate to public Infrastructure as a Service (IaaS) i.e. Cloud. It’s often approached like a colo or a data center hosting service and the result is eventual failure in the initiative due to massive cost overruns and terrible performance. Essentially, this can be attributed to inexperience on the side of the organization and a cloud provider business model based on consumption. The end result is usually layoffs and reorgs while senior leadership shakes its head, “But it worked for Netflix!”

Based on my experience with various public and hybrid cloud initiatives, I can offer the following advice.

  1. Treat public cloud like an application platform, not traditional infrastructure. That means you should have reference models and Infrastructure-as-Code (IaC) templates for the deployment of architecture and application components that have undergone security and peer reviews in advance. Practice “policy as code” by working with cloud engineers to build security requirements into IaC.
  2. Use public cloud like an ephemeral ecosystem with immutable components. Translation: your “pets” don’t belong there, only cattle. Deploy resources to meet demand and establish expiration dates. Don’t attempt to migrate your monolithic application without significant refactoring to make it cloud-friendly. If you need to change a configuration or resize, then redeploy. Identify validation points in your cloud supply chain, where you can catch vulnerable systems/components prior to deploy, because it reduces your attack surface AND it’s cheaper. You should also have monitoring in place (AWS Config or a 3rd party app) that catches any deviation and  automatically remediates. You want cloud infrastructure that is standardized, secure and repeatable.
  3. Become an expert in understanding the cost of services in public cloud. Remember, it’s a consumption model and the cloud provider isn’t going to lose any sleep over customers hemorrhaging money due to bad design.
  4. Hybrid cloud doesn’t mean creating inefficient design patterns based on dependencies between public cloud and on-premise infrastructure. You don’t do this with traditional data centers, why would you do it with hybrid could?
  5. Hire experienced automation engineers/developers to lead your cloud migration and train staff who believe in the initiative. Send the saboteurs home early on or you’ll have organizational chaos.

If software ate the world, it burped out the Cloud. If you don’t approach this initiative with the right architecture, processes and people, there aren’t enough fancy tools in the world to help you clean up the result: organizational indigestion.

burping_cloud

 

Tagged , , , , , , , ,

The Five Stages of Cloud Grief

Over the last five years as a security architect, I’ve been at organizations in various phases of cloud adoption. During that time, I’ve noticed that the most significant barrier isn’t technical. In many cases, public cloud is actually a step up from an organization’s on-premise technical debt.

One of the main obstacles to migration is emotional and can derail a cloud strategy faster than any technical roadblock. This is because our organizations are still filled with carbon units that have messy emotions who can quietly sabotage the initiative.

The emotional trajectory of an organization attempting to move to the public cloud can be illustrated through the Five Stages of Cloud Grief, which I’ve based on the Kubler-Ross Grief Cycle.

  1. Denial – Senior Leadership tells the IT organization they’re spending too much money and that they need to move everything to the cloud, because it’s cheaper. The CIO curls into fetal position under his desk. Infrastructure staff eventually hear about the new strategy and run screaming to the data center, grabbing onto random servers and switches. Other staff hug each other and cry tears of joy hoping that they can finally get new services deployed before they retire.
  2. Anger – IT staff shows up at all-hands meeting with torches and pitchforks calling for the CIO’s blood and demanding to know if there will be layoffs. The security team predicts a compliance apocalypse. Administrative staff distracts them with free donuts and pizza.
  3. Depression – CISO tells everyone cloud isn’t secure and violates all policies. Quietly packs a “go” bag and stocks bomb shelter with supplies. Infrastructure staff are forced to take cloud training, but continue to miss project timeline milestones while they refresh their resumes and LinkedIn pages.
  4. Bargaining – After senior leadership sets a final “drop dead” date for cloud migration, IT staff complain that they don’t have enough resources. New “cloud ready” staff is hired and enter the IT Sanctum Sanctorum like the Visigoths invading Rome. Information Security team presents threat intelligence report that shows $THREAT_ACTOR_DU_JOUR has pwned public cloud.
  5. Acceptance – 75% of cloud migration goal is met, but since there wasn’t a technical strategy or design, the Opex is higher and senior leadership starts wearing diapers in preparation for the monthly bill. Most of the “cloud ready” staff has moved on to the next job out of frustration and the only people left don’t actually understand how anything works.

AWS_consumption

Tagged , , , , , , , ,

Security Group Poop

One of the most critical elements of an organization’s security posture in AWS, is the configuration of security groups. In some of my architectural reviews, I often see rules that are confusing, overly-permissive and without any clear business justification for the access allowed. Basically, the result is a big, steaming pile of security turds.
While I understand many shops don’t have dedicated network or infrastructure engineers to help configure their VPCs, AWS has created some excellent documentation to make it a bit easier to deploy services there. You can and should plow through the entirety of this information. But for those with short attention spans or very little time, I’ll point out some key principles and “best practices” that you must grasp when configuring security groups.
  • A VPC automatically comes with a default security group and each instance created in that VPC will be associated with it, unless you create a new security group.
  • “Allow” rules are explicit, “deny” rules are implicit. With no rules, the default behavior is “deny.” If you want to authorize ingress or egress access you add a rule, if you remove a rule, you’re revoking access.
  • The default rule for a security group denies all inbound traffic and permits all outbound traffic. It is a “best practice” to remove this default rule, replacing it with more granular rules that allow outbound traffic specifically needed for the functionality of the systems and services in the VPC.
  • Security groups are stateful. This means that if you allow inbound traffic to an instance on a specific port, the return traffic is automatically allowed, regardless of outbound rules.
  • The use-cases requiring inbound and outbound rules for application functionality would be:
    • ELB/ALBs – If the default outbound rule has been removed from the security group containing an ELB/ALB, an outbound rule must be configured to forward traffic to the instances hosting the service(s) being load balanced.
    • If the instance must forward traffic to a system/service outside the configured security group.
AWS documentation, including security group templates, covering multiple use-cases:
Security groups are more effective when layered with Network ACLs, providing an additional control to help protect your resources in the event of a misconfiguration. But there are some important differences to keep in mind according to AWS:
Security Group
Network ACL
Operates at the instance level (first layer of defense)
Operates at the subnet level (second layer of defense)
Supports allow rules only
Supports allow rules and deny rules
Is stateful: Return traffic is automatically allowed, regardless of any rules
Is stateless: Return traffic must be explicitly allowed by rules
We evaluate all rules before deciding whether to allow traffic
We process rules in number order when deciding whether to allow traffic
Applies to an instance only if someone specifies the security group when launching the instance, or associates the security group with the instance later on
Automatically applies to all instances in the subnets it’s associated with (backup layer of defense, so you don’t have to rely on someone specifying the security group)
Additionally, the AWS Security Best Practices document, makes the following recommendations:
  • Always use security groups: They provide stateful firewalls for Amazon EC2 instances at the hypervisor level. You can apply multiple security groups to a single instance, and to a single ENI.
  • Augment security groups with Network ACLs: They are stateless but they provide fast and efficient controls. Network ACLs are not instance-specific so they can provide another layer of control in addition to security groups. You can apply separation of duties to ACLs management and security group management.
  • For large-scale deployments, design network security in layers. Instead of creating a single layer of network security protection, apply network security at external, DMZ, and internal layers. 

For those who believe the purchase of some vendor magic beans (i.e. a product) will instantly fix the problem, get ready for disappointment. You’re not going to be able to configure that tool properly for enforcement until you comprehend how security groups work and what the rules should be for your environment.

aws_poop

Tagged , , , , ,

Why You Shouldn’t Be Hosting Public DNS

As a former Unix engineer who managed my share of critical network services, one of the first things I do when evaluating an organization is to validate the health of infrastructure components such as NTP, RADIUS, and DNS. I’m often shocked by what I find. Although most people barely understand how these services work, when they break, it can create some troublesome technical issues or even a full meltdown. This is especially true of DNS.

Most problems with DNS implementations are caused by the fact that so few people actually understand how the protocol is supposed to work, including vendors.The kindest thing one can say about DNS is that it’s esoteric. In my IT salad days, I implemented and was responsible for managing the BIND 9.x infrastructure at an academic institution. I helped write and enforce the DNS request policy, cleaned up and policed the namespace, built and hardened the servers, compiled the BIND binaries and essentially guarded the architecture for over a decade. I ended up in this role because no one else wanted it. I took a complete mess of a BIND 4.x deployment and proceeded to untangle a ball of string the size New Zealand. The experience was an open source rite of passage, helping to make me the engineer and architect I am today.  I also admit to being a BIND fangirl, mostly because it’s the core software of most load-balancers and IPAM systems.

This history makes what I’m about to recommend even more shocking. Outside of service providers, I no longer believe that organizations should run their own public DNS servers. Most enterprises get along fine using Active Directory for internal authentication and name resolution, using a DNS provider such as Neustar, Amazon or Akamai to resolve external services. They don’t need to take on the risk associated with managing external authoritative DNS servers or even load-balancing most public services.

The hard truth is that external DNS is best left to the experts who have time for the care and feeding of it. One missed security patch, a mistyped entry, a system compromise; any of these could have a significant impact to your business. And unless you’re an IT organization, wouldn’t it be better to have someone else deal with that headache? Besides, as organizations continue to move their services to the cloud, why would you have the name resolution of those resources tied to some legacy, on-premise server? But most importantly, as DDoS attacks become more prevalent, UDP-based services are an easy target, especially DNS. Personally, I’d rather have a service provider deal with the agony of DDoS mitigation. They’re better prepared with the right (expensive) tools and plenty of bandwidth.

I write this with great sadness and it even feels like I’m relinquishing some of my nerd status. But never fear, I still have a crush on Paul Vixie and will always choose dig over nslookup.

 

Tagged , , , , ,

Why You’re Probably Not Ready for SDN

While it may seem as though I spend all my time inventing witty vendor snark to post in social media,  it doesn’t pay the bills. So I have a day-job as a Sr. Security Architect. But after coming up through the ranks in IT infrastructure, I often consider myself “architect first, security second.” I’m that rare thing,  an IT generalist. I actually spend quite a bit of time trying to stay current on all technology and SDN is one of many topics of interest for me. Especially since vendors are now trying to spin it as a security solution.

Software-defined networking (SDN) is still discussed as if it’s the secret sauce of the Internet. This despite Gartner placing it at the bottom of its Networking Hype Cycle due to “SDN fatigue” and the technology’s failure, thus far, to gain much traction in the enterprise.

 However, the magical SDN unicorn still manages to rear its head in strategy meetings under the new guise of hyper-convergence and the software-defined data center (SDDC). This is probably due to IT leadership’s continued yearning for cost savings, improved security and the achievement of a truly agile organization. But is SDN, with its added complexity and startling licensing costs, really the answer?
You can read the rest of the article here. And yes, there’s a registration wall.
Tagged , , , , , , ,