skip to Main Content

Top 10 Network Security Mistakes in AWS, and How to Fix Them, Part 2

A Two-part Blog Series and Recorded Access to Our Cloud Security Alliance Webinar

Welcome to Part 2 of our blog series, The Top 10 Network Security Mistakes in AWS, and How to Fix Them. As you may recall from Part 1, we covered the first four mistakes, which were all in the realm of native controls:

  1. Assuming Security Groups and ACLs are enough to protect against attacks and stop exfiltration
  2. Using the default outbound security group of 0.0.0.0/0 (allow any/all)
  3. Leaving east-west security to chance
  4. Thinking you’re secure because you have CSPM

In addition, we hosted a webinar with a deep dive on some of our favorites among the top 10. Roy Long and I had a great discussion. If you’ve not seen it, check out the replay here.

With that said, let’s get back into the top 10. Having covered native controls, the remaining mistakes are in the areas of:

  • Visibility
  • PaaS Security
  • Process and Culture

And we’ve got a bonus item too. More on that in a minute.

Get the right visibility

5) Equating cloud logging with great visibility 

There’s a lot of logging options in AWS and all clouds: CloudTrail, CloudWatch, and service-specific logging in S3 and CloudWatch. Lots of logging does not equal relevant visibility to attacks or exfiltration. Does this visibility combine cloud asset information in real-time with traffic logs and threat intelligence? Does it allow you to figure out if an attacker from a malicious IP compromised the frontend, moved laterally to the highest value asset, and downloaded a ransomware toolkit?

The point here is not to say that all of the AWS logging options are worthless, far from it. The goal of logging is to help solve specific problems in terms of protection first and then incident response. 

Recommendation:

  • Decide on where you need visibility. Areas to consider in addition to the cloud: Ingress, egress, east-west, PaaS flows. 
  • Decide on the highest priority security threats and initiatives of the organization. 
  • Determine how you will get this visibility with a minimum number of tools across multiple accounts and clouds. 

 

6) VPC flow logs and Route53 DNS queries are just too basic to be of any use

This is an easy blindspot to overlook. Logs of Route 53 DNS queries tell you about intent (where are the attackers looking to connect) and VPC flow logs tell you something about actual behavior (where did they actually connect to), both good and bad. Amid the flood of logs (see above) it’s important to not lose focus on data that can give you meaningful information. 

Recommendation:

  • While VPC flow logs and DNS queries can give you meaningful information, their volume can create the classic haystack problem. What you want to do is correlate this deluge of information with relevant information: workloads (instances, VPCs, and their meta-data) and threat intelligence. This combination of asset information, traffic logs, and threat intelligence allows you to see the malicious IPs and sites from the strange and benign. 

 

PaaS is not automagically secure

7) Relying on the shared security model for PaaS security

AWS (and others’) shared security model says that they protect their IaaS and PaaS infrastructure really well and it’s the customer’s responsibility to enable the right knobs: setup correct IAM permissions, enable encryption, do logging etc. And its AWS’ responsibility to ensure the security of the PaaS itself, i.e. patch it and protect against attacks – and they do that brilliantly. But how do you know you’ve configured and architected your AWS cloud services, especially the services that create VPC endpoints correctly? Are there stranger things taking place? 

Use of AWS breaks down into the following modes of deployment (the cloud perimeter):

  • Account management: True out-of-band services like IAM and AWS management console. CSPM tools help here. 
  • Public PaaS: AWS S3 buckets that host static content like images, files etc, or Lambda functions and API server endpoints. CSPM and AWS offerings like Config and Macie help you ensure these are configured correctly. 
  • Your VPCs: This is where your IaaS applications live and often there are 10s and 100s of VPCs most of them private and connected as a hub-n-spoke around the AWS Transit Gateway. 
  • Private PaaS: These are completely private to your VPC’s and hold the crown jewels of any organization: RDS databases, ElasticSearch logs, private S3 buckets that hold sensitive data. 

It’s these last two categories that represent the biggest emerging threat since it’s assumed that this is private to your VPCs and doesn’t need special protections. Its a fallacy to assume that perimeter protections are sufficient. 

Recommendation:

  • Access all your private PaaS with Private Link and VPC endpoints with appropriate security groups to limit access from outside. 
  • A best practice is to put all your Private Links and VPC endpoints in a single shared VPC where your common services live. Everyone must access these through a hub-n-spoke design across AWS Transit Gateway. This gives you two benefits (a) reduce the sprawl of VPC endpoints spread across 10s of application VPCs, (b) enables stricter access control to important PaaS assets like your production RDS. You can now get visibility and control of your PaaS and a single place to manage them. “dev” can only access approved “dev” PaaS, and “prod” can only connect to “prod” PaaS instances but with a stricter inspection policy. 

 

8) I can protect my S3 buckets and RDS database with IAM controls and CSPM

Once you become an expert in IAM it’s a powerful feeling. It allows you to set access policies, permissions, and configuration settings in a clear and concise manner. And AWS SCP and CSPM tools can be used to ensure that they meet compliance requirements and company policies. Werner Vogels, AWS CTO, frequently talks about avoiding single points of failure. And, relying on IAM to protect your PaaS is one of them. Recent vulnerabilities of cloud services like Azurescape and tens of S3 incidents confirm this hypothesis.

One of the common breaches in public clouds today happens when an attacker has somehow landed on a frontend instance (breached the marketing webserver through a WAF misconfiguration, malicious insider…) and then uses the valid IAM privileges of that machine to legitimately connect to other systems, including RDS databases, to exfiltrate data. This is not AWS’ fault. Yes, the attacker used the approved IAM credentials of a breached machine to move laterally. 

Recommendation: To protect against misconfigurations, lateral moving attackers and malicious insiders take the following steps:

  • Get visibility to your PaaS traffic, see the above recommendation. 
  • For critical applications (production, compliance) implement advanced network security that provides a defense in depth approach, i.e. ensure that “prod” VPCs can only connect to prod DynamoDB, or “pci” instances (EC2 or EKS) can only connect to the RDS used for PCI-DSS applications. 

 

Business process & culture improvements

Yes, the last two are really meant to improve your network security. Tactical recommendations can only take you so far in improving your security posture. To truly get the agility and digital transformation benefits of adopting public clouds, you seriously need to consider these. 

9) Assuming you don’t have the resources to properly secure your environment. Automate all you can, so security teams can focus on what you can’t

All of the above recommendations are possible to implement if you have infinite resources and patience. The real answer of course is to build network security into your security operations, DevOps, or overall automation processes. This is relatively easy for the basic network security you’d configure via Security Groups and IAM controls. The real challenge comes for advanced network security that has to support TLS decryption, forward or reverse proxy, auto-scaling, and cloud networking, and deal with this at-scale (10s and 100s of VPCs and many AWS accounts). By automating network security (basic and advanced) you ensure that security teams actually do what is hardest, continuously innovate, and has cycles to enable threat hunting and incident response. 

Recommendation: 

  • Select an infrastructure-as-code (IaC) automation that gives flexibility in terms of design patterns and workflows, supports multiple clouds, and has an open architecture that enables third-party integrations. While AWS CloudFormation can be great for small teams, mid-sized to large deployments should consider Terraform. And more importantly, ensure that your advanced network security solution works smoothly, and with no compromises. There are too many Terraform plugins (called providers) that seem to automate security, but then require additional support scripts in Python, Ansible, or Go. 
  • Forget the jargon about DevSecOps or SecOps etc; what you are trying to do is bake security into the build, deploy and run process. If this concept is new to your organization or company, then start with a single project or team to build experience and lessons that can be shared for adopting this across your entire organization. 

 

10) Allowing silos to persist. Security is not an island

It is still far too common that application teams are moving at cloud speed, and the security team is still dealing with tickets to open port X to make app1 go live. This is the datacenter siloed approach. 

Recommendation:

  • Security should be an active participant in the architecture and design of applications, not consulted once the design is complete. The best organizations include a security practitioner who participates in design discussions. 
  • Security should be setting the policies and overseeing the implementation. Actual implementation should be in the hands of the DevOps or operations team. 

And, because at Valtix, ours goes to eleven, an additional mistake that we’ve seen far too often:

 

11) Bringing a data center mindset to the cloud. 

We’ve seen this from both end-user organizations and technology vendors. You’ll see this in everything from control processes to architecture and implementation. A classic example is setting up systems with classic active-passive high availability (HA). The active handles all the traffic and passive (aka standby) monitor it, and when the passive device detects a failure in the active device, it takes over. Legacy next-generation firewalls (NGFWs), web application firewall (WAFs), and load balancers all use this design pattern. 

This approach made sense in traditional data centers and on-premises networks because you had control of the networking infrastructure (especially layer 2) so that you actually had a sub-second failover and you got session-level failover. I’ll spare you the details of why this works on-prem (see GARP) and why you spent 2X the money for the two devices. Public clouds give you no control over layer 2, and hence the active-passive failover implementations are a complete kludge: the failover is more like a disaster recovery (DR) setup where the failover takes several minutes because the passive device has to make cloud API calls to shift traffic from the failed active device to the passive device. The setup for these things is complex (see here and here for examples), the operating model is very brittle with lots of moving parts/scripts and no service-level agreement (SLA) on uptime, and you actually don’t get session failover while paying for 2 devices and getting only 1 in active service at any time. 

Another example, using legacy HSMs for secrets management. More on this in a future blog. 

Recommendation:

  • If you go cloud, go all the way, the right way. When architecting for public clouds, evaluate if your existing design pattern has a better replacement that is cloud-native. Ask your cloud provider’s solution architect and vendors for best practices that don’t perpetuate the old designs. 
  • For high availability, redundancy and scalability use the cloud-native constructs: cloud-native load balancing across multiple AZs (and multiple regions if you need that level of uptime and redundancy), health monitoring and telemetry from a control plane with an SLA, built-in self-healing from the control plane which replaces unresponsive instances, auto-scaling from the control plane. The beauty of this architecture is that it works at all scales: small setups use 2 AZ deployments with 1 instance each as a minimum, larger ones just have a higher auto-scaling maximum with more AZs. The control plane scales the infrastructure automatically, on-demand.  

 

So there you have it – 10 (11!) common network security mistakes we see frequently enough to make the top 10. To learn more, here are a few items:

I hope this was helpful. Thanks for reading!

Latest Posts

Cybersecurity Awareness: Our Commitment to Security in the Cloud and Beyond

Back To Top