Something Wicked This Way Comes?

I’ve been through multiple ‘reboots’ in my tech career. I started out initially doing networking on PCs (DOS/Win/Mac/Linux): installing network interface cards (NICs), configuring them, and then moving on to the next machine. It was a very dynamic and transformational period in the tech world. Then, I became a “systems administrator”, which at the time in the mid-90’s connoted a high degree of skill with UNIX and UNIX-like operating systems used for distributed applications like the brand new “World Wide Web” among other things. At the time, I was as enthusiastic as ever, but also now a bit seasoned, which gave me just enough bravado to advance my career into the private sector (after starting out exclusively in academia and higher-ed).

That was the biggest jump I’d experienced in my nascent tech career path, at a time when companies were hiring ALL THE PEOPLE who had remotely even heard something about this thing called the Internet. I was lucky, I had both knowledge and experience. As a grad student, before deciding to become a computer professional, in addition to reading journal articles for my doctoral program of study, I’d print out and read Internet RFCs. This was back when there were less than a few hundred or so of them, and you recognized a lot of the authors’ names because they were writing them all. This Internet thing was undeniably super cool and, like many of my peers, I could already sense that it was going to the world dramatically.

A few years before, I recall having a conversation with my major professor at one point (before leaving the academic world for the computing world), and describing to him this thing called The Well and proposing that we do some research into the influence of computer networks on human social behavior. Mind you, this was around 1990, still several years before the average person owned a computer or means of connecting to the Internet from home. I was envisioning an advanced network of social communication not unlike today’s social media platforms (my chosen field of study was called social cognition, and I focused on impression formation and attitude formation).

I think he laughed at me, because neither one of us at the time had any real grasp of the concept at that time. I had been on the Well, and using BBS’es a lot, was starting to explore FTP and Archie and Gopher. And, I had heard of HCI research (Human Computer Interaction) but was imagining something else as a field of study that didn’t exist, with an added social element that I thought was completely intriguing to consider.  He didn’t really know where to go with my idea, so we dropped it and decided that it was really outside of what we were studying and trying to accomplish with our respective lines of research.

After making the jump to the private sector and privatization of Internet “sites”, most of my work was still in the sysadmin realm, but soon expanded to include both storage and networking on an enterprise scale. I wore whatever hat needed wearing on a given day in a given workplace, and grateful that my self-taught tech skills seem to allow me to progress in my career without many obstacles. As time marched on, problems with staffing became more of a scale issue: hire more and more engineers to get the infrastructure substrate solid and support application development faster.

Around 2005 or so, I began noticing a trend of people applying for systems and software engineering jobs with little to no experience and deep understanding, but plenty of technical training and certifications. Positions were starting to go unfilled but, also many new hires were simply not up to speed and often caused more problems than they solved. It didn’t matter what industry or the company, no workplace was immune to this problem it seemed.

Now, it seems this problem has become both chronic and more acute. I believe that’s part of the rise of DevOps. DevOps itself has resisted hard definitions, but I tend to think of it as basically adopting a more agile methodology for software development projects and infrastructure operations, by tapping people to do both development and operations work. It’s been embraced by organizations large and small, but it still remains early. And finding a good devops engineer, one that can effectively traverse the trees as well as the forest, so to speak, is even more difficult. Positions are still going unfilled and orgs are turning to trying to coach and train existing staff, to get them “up to speed” on devops practices.

And, lo and behold, guess what? It’s not working,  except in the rarest of occasions in my experience. Yes, devops is expanding, but it’s dragging a lot of legacy cruft with it and dilutes the gains of the paradigm shift. I just recently read this article by James Beswick in which he describes almost exactly the same story I’ve been sharing with you here, but with the backdrop of cloud adoption and the role that automation is playing in eliminating human labor. It’s a real thing, I see it happening in many places.

To the extent that the cost-reduction process is the prime driver for cloud adoption, it’s natural to expect some sort of fallout like eliminating staff. However, and this is where my concern spikes a bit, companies seem to be risking the gains offered by devops in the cloud by an attitude of “Well, we can’t find the right person, so let’s just have our existing staff to get training and certifications and all will be well”. In all but the rarest of situations, that isn’t working. Many people are not able to make the jump, and it drags down the entire org which is still managing to operate in the field of the devops/cloud paradigm without realizing the essential benefits (streamlining of workflows, automation and code testing, rapid cut/release cycles, etc.).

It remains to be seen if we are simply at the bottom of the S-curve as Beswick suggests; it’s a compelling analogy, one that’s worked to describe technological bifurcation in the last 40 years or so. But, we clearly have a definite problem in terms of a lack of skills and capabilities in the existing labor supply that is being magnified with an insatiable demand by orgs to move faster and better daily. Is it simply a waiting game, whereby more automation will eventually fill in the skilled labor gaps? Machine learning and state machine processing can effectively replace many humans. Data mining is occurring on a level that is mind-boggling and, to be honest, downright frightening.

For example, Michael Bloomberg’s recent massive investment in a data mining approach to controlling the electorate does not bode well for you and I. While I applaud the goal, I am deeply afraid of the implications for what is left of consumer (i.e., voter) privacy.  Where are the conversations about ethics and individual freedom and privacy? Where is the EFF or other orgs that normally care about such things? Have we all just given up any and all pretense of having privacy in this new world of data mining and individual profiling? We know where this can go, look at China’s recent ratcheting up of control over its citizens via massive investments in data-mining and facial recognition.

From a purely technological perspective, it’s nothing short of amazing how advances in cloud computing are removing barriers, sometimes daily, to processing what has always seemed like an infinite sea of data. Yet, at the same time, I am starting to feel a bit like how I think Bill Joy must have felt in the early 2000’s when he expressed regret over contributing indirectly to the rise of nanotechnology and what that meant for the future of humankind. My fearfulness is very real and legitimate, but at the end of the day, I don’t feel like I can stop the machine. I hope we survive this new period of even more rapid technological advancements in our culture and learn from it.

Huzzah, I’m AWS Certified!

I passed the AWS Certified Solutions Architect – Associate exam on 12/11 and the AWS Certified Developer – Associate exam on 12/15. Woohoo!

awsCertSolArchAssocawsCertDeveloperAssoc

AWS re:Invent 2017 Download

I had a good time at AWS re:Invent 2017 last week, despite being sick as a dog for most of it. Though I caught fewer sessions than I would have liked, the ones I did attend on serverless topics were top notch. Here are some links to my favorites:

ARC401 – Serverless Architectural Patterns and Best Practices

Highlights:

  • Serverless foundations
  • Web applications
  • Data Lakes
  • Stream processing
  • Operations automation (e.g., Tailor, for automating AWS account creation)
  • Excellent review of best practices and new features in Lambda

SRV401 – Become a Serverless Black Belt: Optimizing Your Serverless Applications

Highlights:

  • Optimization Katas
    • Lean Functions
    • Eventful Invocations
    • Coordinated Calls
    • Serviceful Operations
  • Cold start issues in Lambda
  • Instrumenting Lambda with XRay
  • Resource allocation
  • Concurrency vs. latency
  • Compelling customer story from ACloudGuru’s VP of Engineering on going 100% serverless

SRV305 – What’s New in Serverless

Highlights:

  • Announced Serverless Application Repository
  • Reviewed new Lambda console
  • Reviewed new Lambda features
  • Reviewed Cloud9 IDE
  • Reviewed XRay tracing for Lambda
  • New API Gateway features
  • Compelling customer story from FICO’s VP of Engineering

Integration Gotcha: SNS and Lambda

When using SNS pub/sub components, a common integration pattern is to use Lambda to process SNS messages. This can include the use of data blobs as the SNS payload for doing file processing, data transformations, and archiving data in S3 among other things. SNS messages have a large payload limit of 256KB per message, but I recently ran into a situation where I could not reliably deliver messages that were sized well under that limit.

As it turns out, when Lambda is consuming your SNS large payloads in events, you hit a limit within Lambda that is exactly half of the SNS payload limit. For Event (asynchronous) invocations in Lambda, there is a 128KB payload limit. So, if your SNS messages are not being processed by Lambda, check the size of the messages and verify that they are below 128KB. This was a confusing problem until I looked at the CloudWatch console for SNS message deliveries and noticed the errors there.

 

 

Patterns for Kinesis Streams

I’ve recently been working on a streaming component in a project and have been spending a lot of time with both Kinesis Streams and Kinesis Firehose. I tend to think of the two as event queue frameworks, with Firehose also having the ability to forward events to other AWS services like ElasticSearch (for Kibana-style dashboarding) and backup the same data to a S3 bucket. If you don’t need either of those destinations, then most likely you will get plenty of mileage out of working with Streams alone.

Potential uses abound, but one powerful pattern is making Kinesis a destination for CloudWatch Logs streams via subscription filters. By creating a Kinesis stream and making it a CloudWatch log destination in one account, you can readily add CloudWatch subscription filters in other accounts to create a cross-account log sink. Once your CloudWatch Logs are in one or more Kinesis Streams shards, you can process that log data via Lambda and/or possibly forward to Kinesis Firehose for ES/S3 delivery. There’s a great blog post over at Blend about this exact sort of usage, including a link to their GitHub repo for the CloudFormation templates they use to build and deploy the solution.

One of the best overviews I’ve read recently about design and scale-out issues around event queue processing and how Kinesis resolves, by design, a lot of the challenges therein (e.g., data duplication, ABA problems) is by the fine folks over at Instrumental, entitled “Amazon Kinesis: the best event queue you’re not using“. If you are considering using Kinesis at scale, or are already designing/deploying a consumer/producer pattern to be used with Kinesis, I highly recommend you check out the Instrumental blog post.

 

Python’s logging() module in a boto3/botocore context

Python’s logging module provides a powerful framework for adding log statements to code vs. what might be done via using print() statements. It provides a system of logging levels similar to syslog-style levels that can be used to produce both on-screen runtime diagnostics as well as more detailed logs with full debug level insights into per module/submodule behavior.

Managing usage of logging() can be complicated, especially around the hierarchical nature of the log streams that it provides. I have developed a simple boto3 script that integrates logging to illustrate a basic usage that is easy to adopt and, in the end, not much more work than using print() statements. For detailed information on logging beyond what I present here, consult the excellent Python docs on the topic, as well as the links in the References section at the end of this post.

Logging Configuration

The setup for logging() that I am using involves two configuration files, logger_config.yaml and logger_config_debug.yaml. The difference between the two files has to do with the log levels used by the log handlers. By default, the example module deployVpc.py uses the logger_config setup. This config will produce no screen output by default except at the ERROR level and above. It produces a log file, however, that contains messages at the INFO level for the module and at the WARNING level for boto-specific calls.

Note: boto (including botocore) ships with some logging() active at the INFO level. While not as detailed as DEBUG, there’s enough busyness to that level of logging by boto that you will likely want to not see its messages except when troubleshooting or debugging your code. This is the approach I took with the current configuration, by opting to set custom logger definitions for boto and friends, so that the root logger will not by default display boto’s native log level messages.

Let’s take a look at the default logging configuration file I’ve put together, logger_config.yaml:

---
version: 1
disable_existing_loggers: False
formatters:
  simple:
    format: "%(asctime)s %(levelname)s %(module)s %(message)s"
  fancy:
    format: "%(asctime)s|%(levelname)s|%(module)s.%(funcName)s:%(lineno)-2s|%(message)s"
  debug:
    format: "%(asctime)s|%(levelname)s|%(pathname)s:%(funcName)s:%(lineno)-2s|%(message)s"

handlers:
  console:
    class: logging.StreamHandler
    level: DEBUG
    formatter: simple
    stream: ext://sys.stdout

  screen:
    class: logging.StreamHandler
    level: ERROR
    formatter: fancy
    stream: ext://sys.stdout

  logfile:
    class: logging.handlers.RotatingFileHandler
    level: DEBUG
    formatter: debug
    filename: "/tmp/deployVpc.log"
    maxBytes: 1000000
    backupCount: 10
    encoding: utf8

loggers:
  boto:
    level: WARNING
    handlers: [logfile, screen]
    propagate: no
  boto3:
    level: WARNING
    handlers: [logfile, screen]
    propagate: no
  botocore:
    level: WARNING
    handlers: [logfile, screen]
    propagate: no
  deployVpc:
    level: INFO
    handlers: [logfile, screen]
    propagate: no
  __main__:
    level: INFO
    handlers: [logfile, screen]
    propagate: no

root:
  level: NOTSET
  handlers: [console, logfile]

I chose to use YAML for the configuration file as it’s easier to parse, both visually and programmatically. By default, Python uses an INI file format for configuration, but both JSON and YAML are easily supported.

At the top of the file is some basic configuration information. Note the disable_existing_loggers setting. This allows us to avoid timing problems with module-level invocation of loggers. When logging per module/submodule, as those modules are imported early in your main script, they will not find the correct configuration information as it’s yet to be loaded. By setting disable_existing_loggers to False, we avoid that problem.

The remaining file consists of four sections:

  • formatters
  • handlers
  • loggers
  • root logger definition

Formatters

Formatters are used to define the log message string format. Here, I am using three different formatters:

  • simple – very simple and brief
  • fancy – more detail including timestamp for a helpful log entry
  • debug – fancy with module pathname instead of module name, useful for boto messages

By default, I leave simple for the console handler (for root logger), use fancy for the screen handler, and debug for the logfile handler.

Handlers

Handlers are used to define at what level, in what format, and exactly where a particular log message should be generated. I’ve left console in its default configuration, but added a StreamHandler and a RotatingFileHandler. Python’s logging module supports multiple types of handlers including Syslog, SMTP, HTTP, and others. Very flexible and powerful!

  • console – used by the root logger
  • screen – log ERROR level and above using fancy formatting to the screen/stdout
  • logfile – log DEBUG level messages and above using debug formatting to a file in /tmp that gets automatically rotated at 1MB and retention of 10 copies

Loggers

Loggers are referenced in your code whenever a message is generated. The configuration for a given logger is found in this section of the configuration file. In my case, I wanted a separate logger per module/function if necessary, so I’ve made entries at that level. I also include entries for boto and friends so I can adjust their default log levels so I don’t see their detailed information except when and where I want to (i.e., by logging at WARNING instead of INFO or DEBUG for normal operation). A logger entry also defines where log streams should end up. In this case, I send all streams to both my screen handler and my logfile handler.

I also don’t want custom loggers to propagate messages throughout the logging hierarchy (i.e., up to the root logger). So I’ve set propagate to “no”.

Implementing logging in code

Setup

I created a module called loggerSetup.py which is where I do the initialization for defining how logging() will be configured, via the configuration files:

#!/usr/bin/env python
"""Setup logging module for use"""

import os
import logging
import logging.config
import yaml

home = os.path.expanduser('~')
logger_config = home + "/git-repos/rcrelia/aws-mojo/boto3/loggerExample/logger_config.yaml"
logger_debug_config = home + "/git-repos/rcrelia/aws-mojo/boto3/loggerExample/logger_config_debug.yaml"

def configure(default_path=logger_config, default_level=logging.DEBUG, env_key='LOG_CFG'):
    """Setup logging configuration"""
    path = default_path
    value = os.getenv(env_key, None)
    if value:
        path = value
    if os.path.exists(path):
        with open(path, 'rt') as f:
            config = yaml.safe_load(f.read())
        logging.config.dictConfig(config)
    else:
        logging.basicConfig(level=default_level)

def configure_debug(default_path=logger_debug_config, default_level=logging.DEBUG, env_key='LOG_CFG'):
    """Setup logging configuration for debugging"""
    path = default_path
    value = os.getenv(env_key, None)
    if value:
        path = value
    if os.path.exists(path):
        with open(path, 'rt') as f:
            config = yaml.safe_load(f.read())
        logging.config.dictConfig(config)
    else:
        logging.basicConfig(level=default_level)

This module defines two functions: configure() and configure_debug(). This provides another way of running a non-default logging configuration without using the LOG_CFG environment variable (i.e., on a per-module basis). When you setup logging in your module like so:

loggerSetup.configure()
logger = logging.getLogger(__name__)

You would simply edit the first line to use .configure_debug() instead of .configure().

 Usage

Usage is straightforward, simply do the following in each module you wish to use logging(). Refer to the deployVpc.py script for the full syntax and usage around these bits of code.

Note: deployVpc.py requires use of AWS API key access that is stored in a config profile (I used one called ‘aws-mojo’, change to your own favorite profile). It will create a VPC and Internet Gateway in your AWS account. But it will also, by default, remove those objects as well. Caveat emptor…

  1. Import the logging modules and loggerSetup module
import logging, logging.config, loggerSetup
  1. Activate the logging configuration and define your logger for the module
loggerSetup.configure()
logger = logging.getLogger(__name__)

Note: By using __name__ instead of a custom logger name, you can easily re-use this setup code in any module.

  1. Add a logger command to your code using the level of your choice:
logger.info('EC2 Session object created')

That’s all there is to it. Below are some screenshots that show the handler output (screen and logfile) for both the default and debug configurations. Hopefully this will encourage you to look at using Python’s logging() framework for your own projects.

The full source for all of the logging module configuration as well as sample boto script is available over on GitHub in my aws-mojo repository.

Screenshots

Example: Default configuration – output to screen handler (should be no output except ERROR and above)

Default screen handler output

Example: Default configuration – output to logfile handler (should be messages at INFO and above for your code and at WARNING and above for boto library code messaging)

Default logfile handler output

Example: Debug configuration – output to screen handler (should be messages at INFO and above for your code and at WARNING)

Debug screen handler output

Example: Debug configuration – output to logfile handler (should be messages at DEBUG and all levels for your code and boto library code messaging)

Debug logfile handler output

References

Continue reading

Ansible and EC2 Auto Scaling Groups: False-positive idempotency errors and a workaround

When using Ansible to deploy and manage EC2 auto scaling groups (ASGs) in AWS, you may encounter, like I have recently, an issue with idempotency errors that can be somewhat befuddling. Basically, when the ec2_asg module is called, one of its properties, vpc_zone_identifier, is used to define the subnets used by the ASG. A typical ASG configuration is to use two subnets, each one in a different availability zone, for a robust HA configuration, like so:

- name: "create auto scaling group"
  local_action:
    module: ec2_asg
    name: "{{ asg_name }}"
    desired_capacity: "{{ desired_capacity }}"
    launch_config_name: "{{ launch_config }}"
    min_size: 2
    max_size: 3
    desired_capacity: 2
    region: "{{ region }}"
    vpc_zone_identifier: "{{ subnet_ids }}"
    state: present

Upon subsequent Ansible plays, when ec2_asg is called, but no changes are made, you can still experience a changed=true result because of how Ansible is ordering the subnet-id’s used in vpc_zone_identifier versus how AWS is ordering them. This makes the play non-idempotent. How does this happen?

It turns out that Ansible’s ec2_asg module sorts the subnet-ids, while AWS does not when it returns those values. Here is the relevant code from the v2.3.0.0 version of ec2_asg.py, notice the sorting that happens in an attempt to match what AWS provides as an order:

 518 for attr in ASG_ATTRIBUTES:
 519     if module.params.get(attr, None) is not None:
 520         module_attr = module.params.get(attr)
 521         if attr == 'vpc_zone_identifier':
 522             module_attr = ','.join(module_attr)
 523         group_attr = getattr(as_group, attr)
 524         # we do this because AWS and the module may return the same list
 525         # sorted differently
 526         if attr != 'termination_policies':
 527             try:
 528                 module_attr.sort()
 529             except:
 530                 pass
 531             try:
 532                 group_attr.sort()
 533             except:
 534                 pass
 535         if group_attr != module_attr:
 536             changed = True
 537             setattr(as_group, attr, module_attr)
 538

While this is all well and good, AWS does not follow any specific ordering algorithm when it returns values for subnet-ids in the ASG context. So, when AWS returns its subnet-id list for the ec2_asg call, Ansible will sometimes have a different order in its ec2_asg configuration and then incorrectly interpret the difference between the two lists as a change and mark it thusly. If you are counting on your Ansible plays to be perfectly idempotent, this is problematic. There is now an open GitHub issue about this specific problem.

The good news is that the latest development version of ec2_asg, which is also written using boto3, does not exhibit this false-positive idempotency error issue. The devel version of ec2_asg (i.e., unreleased 2.4.0.0) is altogether different than what ships in current stable releases. So, these false-positive idempotency errors can occur in releases up to and including version 2.3.1.0 (I have found it in 2.2.1.0, 2.3.0.0, and 2.3.1.0).  Sometime soon, we should have a version of ec2_asg that behaves idempotently. But what to do until then?

One approach is to write a custom library in Python that you use instead of ec2_asg. While feasible, it would involve a lot of time spent verifying integration with both AWS and existing Ansible AWS modules.

Another approach, and one I took recently, is to simply ask AWS what it has for the order of subnet-ids to be in vpc_zone_identifier and then plug that ordering into what I pass to ec2_asg during each run.

Prior to running ec2_asg, I use the command module to run the AWSCLI autoscaling utility and query for the contents of VPCZoneIdentifier. Then I take those results and use them as the ordered list that I pass into ec2_asg afterward:

- name: "check for ASG subnet order due to idempotency failures with ec2_asg"
  command: 'aws autoscaling describe-auto-scaling-groups --region "{{ region }}" --auto-scaling-group-names "{{ asg_name }}" '
  register: describe_asg
  changed_when: false

- name: "parse the json input from aws describe-auto-scaling-groups"
  set_fact: asg="{{ describe_asg.stdout | from_json }}"

- name: "get vpc_zone_identifier and parse for subnet-id ordering"
  set_fact: asg_subnets="{{ asg.AutoScalingGroups[0].VPCZoneIdentifier.split(',') }}"
  when: asg.AutoScalingGroups

- name: "update subnet_ids on subsequent runs"
  set_fact: my_subnet_ids="{{ asg_subnets }}"
  when: asg.AutoScalingGroups

# now use the AWS-sorted list, my_subnet_ids, as the content of vpc_zone_identifier

- name: "create auto scaling group"
  local_action:
    module: ec2_asg
    name: "{{ asg_name }}"
    desired_capacity: "{{ desired_capacity }}"
    launch_config_name: "{{ launch_config }}"
    min_size: 2
    max_size: 3
    desired_capacity: 2
    region: "{{ region }}"
    vpc_zone_identifier: "{{ my_subnet_ids }}"
    state: present

On each run, the following happens:

  1. A command task runs the AWSCLI to describe the autoscaling group in question. If it’s the first run, an empty array is returned. The result is registered as asg_describe.
  2. The JSON data in asg_describe is copied into a new Ansible fact called “asg”
  3. The subnets in use by the ASG and how they are ordered is determined by extracting the VPCZoneIdentifier attribute from the AutoScalingGroup (asg fact). If it’s the first run, this step is skipped because of the when: clause which limits task execution to runs where the ASG already exists (runs 2 and later). It puts this list into the fact called “asg_subnets”
  4. Using the AWS-ordered list from step 3, Ansible sets a new fact called “my_subnet_ids”, which is then specified as the value to vpc_zone_identifier when ec2_asg is called.

I did a test on the idempotency of the play by running Ansible one hundred times after the ASG was created; at no point did I receive a false-positive change. Prior to this workaround, it would happen every run if I happened to be specifying subnet-ids ordered differently than from what AWS returned in terms of their order.

While this is admittedly somewhat kludgy, at least I can be confident that my plays involving AWS EC2 autoscaling groups will actually behave idempotently when they should. In the meantime, while we wait for the next update to Ansible’s ec2_asg module, this workaround can be used successfully to avoid false positive idempotency errors.

Until next time, have fun getting your Ansible on!