Tag Archives: architecture

Dancing with the Cloud

Recently, I’ve written about the dangers posed by technology fallacies and one of the most frustrating for me involves discussions of “best in class.” In my experience, this mindset causes technology teams to get themselves wrapped up in too many pointless discussions followed by never-ending proof-of-concept work all in search of that non-existent perfect tool. The truth is that most organizations don’t need the best, they need “good enough” so they can get on with business. But this delusion has more serious consequences within the cloud. When you choose tools solely based on the requirement of being “best in class,” you could compromise the integrity of your cloud architecture. Without considering the context, your selection could violate a sacred principle of data center design – minimizing the size of a failure domain.

For many, cloud’s abstraction of the underlying network reduced complexity in what often seemed like arcane and mystical knowledge. It also allowed development teams to work unencumbered by the fear of seemingly capricious network engineers. However, the downside of infrastructure-as-a-service (IaaS) is that this same obfuscation allows those without a background in traditional data center design to make critical errors in fault tolerance.  We’ve all heard the horror stories about cloud applications failing because they were only located in a single availability zone (AZ) or region. Even worse, partial outages that occur due to application dependencies that cross an AZ or region. While this could still happen with physical data centers, it was more obvious because you could see the location of a system in a rack and your network engineers would call you out when you didn’t build in fault-tolerance and thwarted their planned maintenance windows.   

Today, it’s also common for organizations to use a variety of software-as-a-service (SaaS) applications with no knowledge of which underlying cloud providers are being used or how those services are designed. While this simplicity is often beneficial for the business because it increases velocity in service delivery, it can also create blind spots that violate the same failure domain principles as with IaaS. Only the most persistent technologists can unmask the inner workings of these applications to determine how they’re deployed and whether they align with their organization’s infrastructure design. Unfortunately, the business doesn’t always understand this nuance and can fall prey to the “best in class” fallacy, leading to brittle environments due to large failure domains. By exchanging their on-premise, sluggish systems for SaaS, the organization often accepts a different set of problems associated with risk.

Ultimately, when choosing capabilities, it’s a better recommendation to “dance with the cloud that brought you.” Instead of worrying about “best in class,” you want to select the technologies that are closer to your infrastructure by leveraging the services of your cloud provider where feasible. This translates to a better technology architecture for your organization because the cloud provider is making sure that their managed services are low-latency, resilient and highly available. While it may not always be possible, by taking the failure domain principle into consideration during the selection and implementation of solutions, you’ll achieve better service delivery for your organization.

Tagged , , , , ,

Trapped by Technology Fallacies

After a working in tech at several large companies over a couple of decades, I’ve observed some of the worst fallacies that cause damage to organizations. They don’t arise from malice, but from a scarcity of professional reflection in our field. Technologists often jump to problem solving before spending sufficient time on problem setting, which leads to the creation of inappropriate and brittle solutions. Donald A. Schön discusses this disconnect in his seminal work, The Reflective Practitioner,

…with this emphasis on problem solving, we ignore problem setting, the process by which we define the decision to be made, the ends to be achieved, the means which may be chosen.

Problem solving relies on selecting from a menu of previously established formulas. While many of these tactics can be effective, let’s examine some of the dysfunctional approaches used by technologists that lead to pain for their organizations.
  • Fallacy #1 – Hammer-Nail: Technologists often assume that all problems can be beaten into submission with a technology hammer.  It’s like the bride’s father in My Big Fat Greek Wedding, who believes that Windex can be used to cure any ill. Similarly, technologists think that every challenge is just missing a certain type of technology to resolve it.  This, even though we generally speak about maturity models in terms of people, process, technology, and culture. I can’t tell you how often I’ve seen someone design and implement a seemingly elegant solution only to have it rejected because it was developed without understanding the context of the problem.
  • Fallacy #2 – Best in Class. I’ve heard this so many times in my career that I just want to stand on a chair and shake my fist in the middle of a Gartner conference. Most organizations don’t need “best in class,” they need “good enough.” The business needs fast and frugal solutions to keep them productive and efficient, but technologists are often too busy navel gazing to listen.
  • Fallacy #3 – Information Technology is the center of the business universe. I once worked for a well-known bank that had an informal motto, “We’re a technology company that happens to be a bank.” The idea was that because they were so reliant on technology, it transformed them into a cool tech company. I used to respond with, “We also use a lot of electricity, does that make us a utility company?” Maybe a little hyperbolic, but I was trying to make the point that IT Doesn’t Matter. When Nicholas Carr used that phrase as the title of his Harvard Business Review article in 2003, he was considering technology in the historical context of other advances such as electricity and telephones, “When a resource becomes essential to competition but inconsequential to strategy, the risks it creates become more important than the advantages it provides.” In the early days of tech, it gave you an edge. Today, when a core system fails, it could sink your business. The best solutions are often invisible to the organization so it can focus on its core competencies.
While technology can be very effective at solving technical problems, most organizational issues are adaptive challenges. In The Practice of Adaptive Leadership, the authors identify this failure to differentiate between the two as the root cause of business difficulties,

The most common cause of failure in leadership is produced by treating adaptive challenges as if they were technical problems. What’s the difference? While technical problems may be very complex and critically important (like replacing a faulty heart valve during cardiac surgery), they have known solutions that can be implemented by current know-how. They can be resolved through the application of authoritative expertise and through the organization’s current structures, procedures, and ways of doing things. Adaptive challenges can only be addressed through changes in people’s priorities, beliefs, habits, and loyalties. Making progress requires going beyond any authoritative expertise to mobilize discovery, shedding certain entrenched ways, tolerating losses, and generating the new capacity to thrive anew.

The end goals that we’re trying to reach can’t be clearly established if we don’t sufficiently reflect on the problem. When we jump to problem solving over problem setting, we’re assuming a level of confidence that hasn’t been earned. We’ve made assumptions in the way systems should work, without thoroughly investigating how they are actually functioning. When Postmodern critic Michel Foucault speaks of “an insurrection of subjugated knowledges,” he’s questioning the certainty of our perceptions when we’ve disqualified information that might be important in gaining a broader perspective. Technologists are more effective when they recognize the inherent expertise of the non-technologists in the businesses they serve and operate as trusted partners who understand change leadership. Instead of serving the “religion of tech,” we should focus on delivering what organizations really need.
Tagged , , , ,

Architecture Frameworks: Meaningful or Ridiculous?

Earlier this week someone reached out to me on LinkedIn after listening to a podcast episode I was on where I discussed security architecture and cloud migration. He had been thinking about moving into architecture from security engineering and wanted some suggestions about making that transition successful. Specifically, he wanted to know what I thought of architecture frameworks such as SABSA (Sherwood Applied Business Security Architecture). This discussion caused me to reconsider my thoughts on architecture and the lengthy arguments I’ve had over frameworks.

I should say that I have a love-hate relationship with architecture frameworks. I’m passionate about organized exercises in critical thinking, so the concept of a framework appeals to me. However, in practice, they can turn into pointless intellectual exercises equivalent to clerics arguing how many angels can fit on the end of a needle. In my experience, no one ever seems to be all that happy with architecture frameworks because they’re often esoteric and mired in complexity.

From what I’ve seen across the organizations where I’ve worked, if there is an architecture framework, it is usually some derivative of TOGAF (The Open Group Architecture Framework). This reality doesn’t mean someone within the organization intentionally chose it as the most appropriate for their environment. It’s just that TOGAF has been around long enough (1995) to have become pervasive to the practice of architecture and consequently embedded in organizations.

Regardless of what a technology organization is using as their framework, I’ve found that for a security architect to effectively collaborate, you need to align with whatever the other architects are using.  That might be based on TOGAF, but it might be something else entirely. You’ll have an easier time plugging security into the practice if you follow their lead. I’ve never actually seen an organization follow TOGAF or other frameworks very strictly though. It’s usually some slimmed down implementation and trying to lay SABSA on top of that is generally too heavy and convoluted. In my experience, I’ve never actually seen large organizations with mature architecture practices use anything as detailed as SABSA or TOGAF.

But I admit to not having had much formal architecture training. Frankly, I don’t know many professional architects that have. Maybe that’s why there are a lot of bad architects or possibly it says something about how architects are created and trained, which is informal. However, I have spent significant time studying frameworks such as these to become a thoughtful technologist. I personally find the TOGAF framework and docs helpful when trying to center an architectural conversation on a common taxonomy. Most importantly, I believe in pragmatism: meet the other architects where they are. Try to identify the common framework they’re using to work with your colleagues successfully. Because it’s not about using the best framework, it’s about finding the one that works within the given maturity of an organization.

Tagged ,

The Five Stages of Cloud Grief

Over the last five years as a security architect, I’ve been at organizations in various phases of cloud adoption. During that time, I’ve noticed that the most significant barrier isn’t technical. In many cases, public cloud is actually a step up from an organization’s on-premise technical debt.

One of the main obstacles to migration is emotional and can derail a cloud strategy faster than any technical roadblock. This is because our organizations are still filled with carbon units that have messy emotions who can quietly sabotage the initiative.

The emotional trajectory of an organization attempting to move to the public cloud can be illustrated through the Five Stages of Cloud Grief, which I’ve based on the Kubler-Ross Grief Cycle.

  1. Denial – Senior Leadership tells the IT organization they’re spending too much money and that they need to move everything to the cloud, because it’s cheaper. The CIO curls into fetal position under his desk. Infrastructure staff eventually hear about the new strategy and run screaming to the data center, grabbing onto random servers and switches. Other staff hug each other and cry tears of joy hoping that they can finally get new services deployed before they retire.
  2. Anger – IT staff shows up at all-hands meeting with torches and pitchforks calling for the CIO’s blood and demanding to know if there will be layoffs. The security team predicts a compliance apocalypse. Administrative staff distracts them with free donuts and pizza.
  3. Depression – CISO tells everyone cloud isn’t secure and violates all policies. Quietly packs a “go” bag and stocks bomb shelter with supplies. Infrastructure staff are forced to take cloud training, but continue to miss project timeline milestones while they refresh their resumes and LinkedIn pages.
  4. Bargaining – After senior leadership sets a final “drop dead” date for cloud migration, IT staff complain that they don’t have enough resources. New “cloud ready” staff is hired and enter the IT Sanctum Sanctorum like the Visigoths invading Rome. Information Security team presents threat intelligence report that shows $THREAT_ACTOR_DU_JOUR has pwned public cloud.
  5. Acceptance – 75% of cloud migration goal is met, but since there wasn’t a technical strategy or design, the Opex is higher and senior leadership starts wearing diapers in preparation for the monthly bill. Most of the “cloud ready” staff has moved on to the next job out of frustration and the only people left don’t actually understand how anything works.

AWS_consumption

Tagged , , , , , , , ,

Infrastructure-as-Code Is Still *CODE*

After working in a DevOps environment for over a year, I’ve become an automation acolyte. The future is here and I’ve seen the benefits when you get it right: improved efficiency, better control and fewer errors. However, I’ve also seen the dark side with Infrastructure-as-Code (IaC). Bad things happen because people forget that it’s still code and it should be subject to the same types of security controls you use in the rest of your SDLC.

That means including automated or manual reviews, threat modeling and architectural risk assessments. Remember, you’re not only looking for mistakes in provisioning your infrastructure or opportunities for cost control. Some of this code might introduce vulnerabilities that could be exploited by attackers. Are you storing credentials in the code? Are you calling scripts or homegrown libraries and has that code been reviewed? Do you have version control in place? Are you using open source tools that haven’t been updated recently? Are your security groups overly permissive?

IaC is CODE. Why aren’t you treating it that way?

devops_borat

Tagged , , , , , ,

NTP Rules of the Road

There’s nothing more infuriating than watching organizations screw up foundational protocols and NTP seems to be one of the most commonly misconfigured. For some reason, people seem to think the goal is to have “perfect” time, when what is really needed is consistent organizational time. You need everything within a network to be synchronized for troubleshooting and incident management purposes. Otherwise, you’re going to waste a lot of energy identifying root causes and attacks.

It’s recommended to use a public stratum one server to synchronize with a few external systems or devices at your network perimeter, but this should only be configured if you don’t have your own stratum zero GPS with a stratum one server attached. I can’t tell you how many times I’ve seen a network team go to the trouble to set this up and the systems people still point everything to ntp.org.

Everything inside a network should cascade from those perimeter devices, which is usually a router, Active Directory system or stratum one server.  This design reduces the possibility of internal time drift, the load on public NTP servers and your firewalls, and the organizational risk of opening up unnecessary ports to allow outgoing traffic to the Internet. Over the last few years, some serious vulnerabilities have been identified in the protocol and it can also be used as a data exfiltration port by attackers.

In addition to the IETF’s draft on NTP “best practices,” the SEI also has an excellent guidance document.

While it’s not realistic to have your own stratum zero device in the cloud, within AWS, it is recommended to use the designated NTP pool specified in their documentation.

Oh, and for the love of all that is holy, please use UTC. I cannot understand why I’m still having this argument with people.

Tagged , , , , ,

Security Group Poop

One of the most critical elements of an organization’s security posture in AWS, is the configuration of security groups. In some of my architectural reviews, I often see rules that are confusing, overly-permissive and without any clear business justification for the access allowed. Basically, the result is a big, steaming pile of security turds.
While I understand many shops don’t have dedicated network or infrastructure engineers to help configure their VPCs, AWS has created some excellent documentation to make it a bit easier to deploy services there. You can and should plow through the entirety of this information. But for those with short attention spans or very little time, I’ll point out some key principles and “best practices” that you must grasp when configuring security groups.
  • A VPC automatically comes with a default security group and each instance created in that VPC will be associated with it, unless you create a new security group.
  • “Allow” rules are explicit, “deny” rules are implicit. With no rules, the default behavior is “deny.” If you want to authorize ingress or egress access you add a rule, if you remove a rule, you’re revoking access.
  • The default rule for a security group denies all inbound traffic and permits all outbound traffic. It is a “best practice” to remove this default rule, replacing it with more granular rules that allow outbound traffic specifically needed for the functionality of the systems and services in the VPC.
  • Security groups are stateful. This means that if you allow inbound traffic to an instance on a specific port, the return traffic is automatically allowed, regardless of outbound rules.
  • The use-cases requiring inbound and outbound rules for application functionality would be:
    • ELB/ALBs – If the default outbound rule has been removed from the security group containing an ELB/ALB, an outbound rule must be configured to forward traffic to the instances hosting the service(s) being load balanced.
    • If the instance must forward traffic to a system/service outside the configured security group.
AWS documentation, including security group templates, covering multiple use-cases:
Security groups are more effective when layered with Network ACLs, providing an additional control to help protect your resources in the event of a misconfiguration. But there are some important differences to keep in mind according to AWS:
Security Group
Network ACL
Operates at the instance level (first layer of defense)
Operates at the subnet level (second layer of defense)
Supports allow rules only
Supports allow rules and deny rules
Is stateful: Return traffic is automatically allowed, regardless of any rules
Is stateless: Return traffic must be explicitly allowed by rules
We evaluate all rules before deciding whether to allow traffic
We process rules in number order when deciding whether to allow traffic
Applies to an instance only if someone specifies the security group when launching the instance, or associates the security group with the instance later on
Automatically applies to all instances in the subnets it’s associated with (backup layer of defense, so you don’t have to rely on someone specifying the security group)
Additionally, the AWS Security Best Practices document, makes the following recommendations:
  • Always use security groups: They provide stateful firewalls for Amazon EC2 instances at the hypervisor level. You can apply multiple security groups to a single instance, and to a single ENI.
  • Augment security groups with Network ACLs: They are stateless but they provide fast and efficient controls. Network ACLs are not instance-specific so they can provide another layer of control in addition to security groups. You can apply separation of duties to ACLs management and security group management.
  • For large-scale deployments, design network security in layers. Instead of creating a single layer of network security protection, apply network security at external, DMZ, and internal layers. 

For those who believe the purchase of some vendor magic beans (i.e. a product) will instantly fix the problem, get ready for disappointment. You’re not going to be able to configure that tool properly for enforcement until you comprehend how security groups work and what the rules should be for your environment.

aws_poop

Tagged , , , , ,

Fixing a Security Program

I’m still unsettled by how many security programs are so fundamentally broken. Even those managed and staffed by people with impressive credentials. But when I talk to some of these individuals, I discover the key issue. Many seem to think the root cause is bad tools. This is like believing the only thing keeping you from writing the Next Great American novel is that you don’t have John Steinbeck’s pen or Dorothy Parker’s typewriter.

In reality, most of the problems found in security programs are caused inferior processes, inadequate policies, non-existent documentation  and insufficient standards. If buying the best tools actually fixed security problems, wouldn’t we already be done? The truth is that too many employed in this field are in love with the mystique of security work. They don’t understand the business side, the drudgery, the grunt work necessary to build a successful program.

For those people, here’s my simple guide.  I’ve broken it down to the following essential tasks:

  1. Find your crap. Everything. Inventory and categorize your organization’s physical and digital assets according to risk. If you don’t have classification standards, then you must create them.
  2. Document your crap. Build run books. Make sure you have diagrams of networks and distributed applications. Create procedure documents such as IR plans. Establish SLOs and KPIs. Create policies and procedures governing the management of your digital assets.
  3. Assess your crap. Examine current state, identify any issues with the deployment or limitations with the product(s). Determine the actual requirements and analyze whether or not the tool actually meets the organization’s needs. This step can be interesting or depressing, depending upon whether or not you’re responsible for the next step.
  4. Fix your crap. Make changes to follow “best practices.” Work with vendors to understand the level-of-effort involved in configuring their products to better meet your needs. The temptation will be to replace the broken tools, but these aren’t $5 screwdrivers. Your organization made a significant investment of time and money and if you want to skip this step by replacing a tool, be prepared to provide facts and figures to back up your recommendation. Only after you’ve done this, can you go to step 6.
  5. Monitor your crap. If someone else knows your crap is down or compromised before you do, then you’ve failed. The goal isn’t to be the Oracle of Delphi or some fully omniscient being, but simply more proactive. And you don’t need to have all the logs. Identify the logs that are critical and relevant and start there: Active Directory, firewalls, VPN, IDS/IPS.
  6. Replace the crap that doesn’t work. But don’t make the same mistakes. Identify requirements, design the solution carefully, build out a test environment. Make sure to involve necessary stakeholders. And don’t waste time arguing about frameworks, just use an organized method and document what you do.

Now you have the foundation of any decent information security program. This process isn’t easy and it’s definitely not very sexy. But it will be more effective for your organization than installing new tools every 12 months.

 

Tagged , , , , , , , ,

The Question of Technical Debt

Not too long ago, I came across an interesting blog post by the former CTO of Etsy, Kellan Elliott-McCrea, which made me rethink my understanding and approach to the concept of technical debt. In it, he opined that technical debt doesn’t really exist and it’s an overused term. While specifically referencing code in his discussion, he makes some valid points that can be applied to information security and IT infrastructure.

In the post, he credits Peter Norvig with the quote, “All code is liability.” This echoes Nicholas Carr’s belief in the increased risk that arises from infrastructure technology due to the decreased advantage as it becomes more pervasive and non-proprietary.

When a resource becomes essential to competition but inconsequential to strategy, the risks it creates become more important than the advantages it provides. Think of electricity. Today, no company builds its business strategy around its electricity usage, but even a brief lapse in supply can be devastating…..

Over the years, I’ve collected a fair amount of “war stories” about less than optimal application deployments and infrastructure configurations. Too often, I’ve seen things that make me want to curl up in a fetal position beneath my desk. Web developers failing to close connections or set timeouts to back-end databases, causing horrible latency. STP misconfigurations resulting in network core meltdowns. Data centers built under bathrooms or network hub sites using window unit air conditioners. Critical production equipment that’s end-of-life or not even under support. But is this really technical debt or just the way of doing business in our modern world?

Life is messy and always a “development” project. Maybe the main reason DevOps has gathered such momentum in the IT world is because it reflects the constantly evolving, always shifting, nature of existence. In the real world, there is no greenfield. Every enterprise struggles to find the time and resources for ongoing maintenance, upgrades and improvements. As Elliott-McCrea so beautifully expresses, maybe our need to label this state of affairs as atypical is a cop-out. By turning this daily challenge into something momentous, we make it worse. We accuse the previous leadership and engineering staff  of incompetence. We come to believe that the problem will be fully eradicated through the addition of the latest miracle product. Or we invite some high-priced process junkies in to provide recommendations which often result in inertia.

We end up pathologizing something which is normal, often casting an earlier team as bumbling. A characterization that easily returns to haunt us.

When we take it a step further and turn these conflations into a judgement on the intellect, professionalism, and hygiene of whomever came before us we inure ourselves to the lessons those people learned. Quickly we find ourselves in a situation where we’re undertaking major engineering projects without having correctly diagnosed what caused the issues we’re trying to solve (making recapitulating those issues likely) and having discarded the iteratively won knowledge that had allowed our organization to survive to date.

Maybe it’s time to drop the “blame game” by information security teams when evaluating our infrastructures and applications. Stop crying about technical debt, because there are legitimate explanations for the technology decisions made in our organizations and it’s generally not because someone was inept. We need to realize that IT environments aren’t static and will always be changing and growing. We must transform with them.

Tagged , , , , , ,

Why You’re Probably Not Ready for SDN

While it may seem as though I spend all my time inventing witty vendor snark to post in social media,  it doesn’t pay the bills. So I have a day-job as a Sr. Security Architect. But after coming up through the ranks in IT infrastructure, I often consider myself “architect first, security second.” I’m that rare thing,  an IT generalist. I actually spend quite a bit of time trying to stay current on all technology and SDN is one of many topics of interest for me. Especially since vendors are now trying to spin it as a security solution.

Software-defined networking (SDN) is still discussed as if it’s the secret sauce of the Internet. This despite Gartner placing it at the bottom of its Networking Hype Cycle due to “SDN fatigue” and the technology’s failure, thus far, to gain much traction in the enterprise.

 However, the magical SDN unicorn still manages to rear its head in strategy meetings under the new guise of hyper-convergence and the software-defined data center (SDDC). This is probably due to IT leadership’s continued yearning for cost savings, improved security and the achievement of a truly agile organization. But is SDN, with its added complexity and startling licensing costs, really the answer?
You can read the rest of the article here. And yes, there’s a registration wall.
Tagged , , , , , , ,