William Hertling's Thoughtstream: SXSW CloudCamp / DevOps Session Notes 2011

CloudCamp

#cloudcamp

Dyn
http://dyn.com

About

Managed DNS and email delivery provider
names to numbers: twitter.com -> 168.143....
4 million clients
200,000 zones
100,000 domains
17 global data centers
Meant for scalability, automation, redundancy

Why CloudCamp

rapidly scale
pay as you go
API automation
operational efficiency

Dynect Platform

Managed DNS
Global anycast network
active failover
load balancing
traffic management
cdn manager
ui and api access
SLA - five 9s

Load balance between your infrastructure and the cloud

split across amazon, rackspace, your servers

How it works: when a client comes in from Australia, don’t send them to an U.S. server, route them to the server that is closest via the network.

Glen Campbell

Rackspace

@glenc

http://glen-campbell.com

glen dot campbell at rackspace dot com

#scalableapps

Was at Yahoo

During hurricane katrina Yahoo requests went up to 30,000 requests per second, and stayed there for 6 days. 10x the traffic expected.
intermittent failovers from various places.
wheeled in 50 extra machines

took a day just to setup, even though the machines were already sitting there
then serving 40,000 requests per second.

Now with rackspace, you could scale up within minutes: no need to build new machines

One of the goals is that you have no single point of failure.

You can’t have one network service provider, one dns provider, one database. You can never have one of anything.
For Cloud, you can’t rely on a single database server, single file service

Rackspace is part of the open-stack platform

open source for cloud computing
Our goal is to get Rackspace running on the open source platform by the end of the year.

You can run the openstack.org platform yourself.

http://openstack.org

You want infrastructure that makes it easy to scale, easy to deploy, easy to add services.

Robert Phillips

Sendgrid

Sr. Director Marketing

Work with 20,000 companies
Send over a billion transactional emails per month
your app sends email
problems:

deliverability: inbox vs spam folder
analytics: did the customer receive, open it, click on it
platform: enhance email from open architecture

backend problems

isp rate limits
blacklists
dk/dkim
spf records
content inspection: ratio of images to text

“email isn’t fun”
customers: gowalla, plancast, foursquare
Integration

APIs: STMP, Web API, SMTP relay
receive emails w/ parse api
pull data w/ event api
manage sub users
dedicated ip addresses
whitelabeling

Gene Kim

Lessons Learned Creating SuperTribe of Dev and Ops

Morris Worm 1988 - took down 10% of internet
Wrote Tripwire at Purdue in response
Tremendous passion for studying high performers

started as gene’s list of people with good kung fu
people who had the best security, best mean time between failure
codified these practices for Visible Ops Handbook

Not only in security, but also did work full time doing operations

and worked as developer
show me a developer who doesn’t cause problems for ops

Now more than ever, we need great IT operations

Organizations are held up by the ability to get features released, get things through operations.
But it’s not just an operations problem, it’s a development problem and a business problem.

Johnny Diggz

Chief Evangalist for Tropo

Geeks without bounds: gwob.org - volunteer

SignalKit: Notifier

Put livechat link on their website
Using 6 lines of code and Tropo service, was able to monitor group chat to send IM
A carpenter doesn’t weld his own hammer or cast his own screwdriver. He uses an off the shelf tool.
Why would a programmer write their own email functionality or notify functionality?

github.com/aaronpk

See tropo demo that sends map position information through IM to a particular web browser

DEVOPS

John Willis

@botchagalupe

Historical culture of dev and ops fighting
The cloud is forcing us to have to get along
You Better Care: It’s about your business
Devops is Velocity

The velocity of innovation
How do we compete today?

By Scale: Scale of users, scale of data, scale of compute power. Businesses can compete on scale.
By Velocity of Innovation: How fast can you react to and execute on new market forces or opportunities.

Doing 30 deploys a day.
It’s about getting ideas to customers really, really fast.
How fast can you go from “Ah ha!” idea moment to the “Ka-ching!” cash moment?

It’s the applicable lifecycle that seperates the two: and dev and ops and test all contribute to the delay.

Once we move to software as a service, everything we thought we knew about competitive advantage has to be rethought. - Tim O’Reilly: Operations: The New Secret Sauce

Operations is no longer just the girl friday you bring in once a week to do a deploy.
It’s critical to the business, to the ability to innovate, deliver value to customer.
If you have crap operations, you might as well shut down your business.

New Face of a Rock Star: John Allspaw - VP Technical Operations at Etsy.

One of the first things you have to do is find one of these guys.

So What’s Your Culture Dog?

“We will encourage you to develop the three great virtues of a programmer: laziness, impatience, and hubris.” - LarryWall.
We became lazy for the wrong reasons: We should have been lazy so we develop automation, so we don’t have to repeat tasks over and over.
Leadership?

The good guys are going to leave your company if you aren’t showing leadership

Wall of Confusion: Dev vs Ops

Break Down Walls

Force a breakdown in the wall:

Take an Ops guy, and put them in Dev
Take a Dev guy and put them in Ops

Respect each other
Enemies are outside the wall: Work together to beat the bad guys outside the wall, don’t fight each other.

It’s really clear for a small startup
It’s harder for a big company

Fearless Culture

Failure is the New Black: We need to embrace failure. You shouldn’t get fired when you break something.
At Amazon, they’d have a game day where they’d try to take down a data center. It’s not tested until we have a failure in production.

Sense of Urgency

Gotta build a sense of urgency - to be purpose driven.
“We gotta do this, or we’re going to go out of business.”
The Etsy guys, they have fun going to work.

Partners

We are partners in getting things done.
When it becomes a question of survival, you had to become partners or die.

The Smell Test

From Chris Reid
If you are in an organization, and you don’t know what that guy does, there’s been a failure.

Shaman in your organization

There are the guys who know why that flag is passed on that command line.
They are the communicators between the people: they are enablers. Everyone comes to find out what is going on.

Passion:

When you go to work, are you a guy who presses keys on a keyboard, or are you the great programmer that ever lived?

Gene Kim

Co-Founder Tripwire, Author Visible Ops and Visible Security

@RealGeneKim

genek at realgenekim dot me

Universal pattern that shows up when your organization needs Devops
Three sets of patterns you can do inside dev and ops
Benchmarked 1,300 organizations - to link controls and performance

High performance organizations exist and they are 4-5 times more productive than ordinary organizations
High performers find and fix security break fast
Unplanned work comes at the expense of planned work

Vicious Downward Spiral

Ops Sees:

Way too many fragile applications, prone to failure
It takes too long to find out which bit got flipped
The problem is detected by a salesperson or customer
Too much time required to restore service
Too much time spent firefighting and unplanned
Planned project work cannot complete
Frustrated customers leave
Market share goes down
Business misses Wall St commitments
Business makes even larger promises to Wall St

Dev Sees

More urgent, date-driven projects put into the queue
Even more fragile code put into production
More releases have turbelent installs
Release cycles lengthen to amortize cost of deployments
Failling bigger deployments are even more difficult to diagnose
Most senior constrained IT ops resources have less time to fix underlying process problems
Every increasing backlog of infrastructure projects that could fix the underlying problems

Operations inside the Dev/Ops Super-Tribe

Increase flow from Dev to Production

Increase throughput
Decrease WIP

Goal to create system of operations that allows...

Zone #1: Decrease cycle time of releases

Create determinism in the release process
Move packaging responsibility to development
Release early and often
Decrease release cycle time

Reduce deployments time from 6 hours to 45 minutes
Refactor deployment process that had 1300 steps spanning 4 weeks

Never fix forward, instead “roll back”, escalating any deviation from plan to Dev
Verify for all handoffs (e.g. correctness, accuracy, timeliness, etc.)

Zone #2: Increase production rigor

Define what work is and where work can come from
Protect the integrity of the work queue: e.g. infrastructure/process improvements
To preserve and increase throughput, elevate preventive projects and maintenance tasks
Document all work, changes and outcomes so that it is repeatable

Contact Gene for slides/resources.

Visible Ops
Visible Security Ops
Lean IT
Web Operations

Earnest Mueller

National Instruments

How we implemented DevOps

30 years old, 5000+ employees, mostly engineers
Robots and stuff: scientific data acquisitions

Before

Traditional siloed IT department

programmers split by business unit
infrastructure split by technology

Large complex Web site with dedicated operations team

100 programmers
6 ops guy (doing support, release, systems engineering, security, performance management)

Low agility: 6 weeks to get a server
Uptime problems with complexity and silos
Grand vision: “Don’t spend a lot of money please”

The Tipping Point

NI decided it was time to make some SaaS products
Some existing product to web integration points, but uncordinated and poorly maintained
R&D realized they didn’t have web knowledge, started up a new time

Blessing and Curse

Everything was new, so we simultaneously developed:

Team, Process, Systems, Code, Providers, System Automation

(existing processes oriented around annual software products, not frequent web releases)

The Team

We built up our team to fit our role of internal ISV

Application architect
System architect
Operations lead
2 developers
1 automation developer
2 follow-the-sun operations staff

Work with other product developer teams

The Process

Agile
All systems work used the “developer” tools and systems

Revision control: Perforce
Bug Tracking: HP
Specs and reviews: Atlassian Confluence Wiki
Task tracking and burndown: JIRA/Greenhopper

All members collaborate on all aspects of the product
This was the key to making it work - using all the same tools. We could prioritize better, because it was all in one system.
There can be a fear that the systems tasks would always get pushed out

Seemed to be mistaken impression
But when presented alongside requirements, decision makers seemed to understand the need for systems work.

The Systems

Cloud!
Decided on Amazon EC2
Needed control and agility we wouldn’t be able to get internally: dynamic requirements, fast scaling
Needed Linux and Windows both for software
Currently taking on Microsoft Azure as well

System Automation

We built our own: PIE, the “Programmable Infrastructure Environment”

Looked at Chef/Puppet, and others. What we needed wasn’t quite any of those.

XML System model defines systems, services, code installs, runtime interaction, variable substitutions
PIE autobuilds the system from the model: provisioning, software installs, monitoring integration
Zookeeper as a runtime registry for systems info and eventing
Allows us to start/stop/control/install/autoscale on bunches of dynamic environments
We have our dev environment, test environment, production environment - multiplied across our many products. We could deploy a new environment in a couple of hours.

Code

All REST-based web services
Cloud and PIE code mostly in Java, product code mostly in C3/.NET
The developer must create and deliver the PIE XML code that will build, deploy, monitor their own code. They must deliver the XML along with their code. The developer is the only person who knows what their system downs.

Providers

CloudKick - monitoring
PagerDuty - paging
DNSmadeEasy
Postmark - email

Results

Win!
Continuous pipeline of products delivered quickly
LabView Web UI Builder (http://ni.com/uibuilder) in release
FPGA Compile Cloud in beta
One big one in the pipeline and others knocking on our door
Using cloud, automation, and collaboration through devops, we’ve been able to deliver the apps quickly and continuously.

Vastly less time that it used to take with the traditional web organization.

Challenges

overcoming the thought that it was impossible

can i do infrastructure tasks using agile? with sprints? by actually trying it, it turns out to be possible.
write our own software provisioning sounds hard, maybe we can’t do it. but when you try it, you can.

building trust between dev and ops

working together in one team and using the same tools really helps. you get transparency. you can’t build the sense of trust when you don’t know what each other are working on.

various customer dev teams, some globally distributed
explaining core web performance/availability/management needs to desktop developers.

Needed to write “why you should log” paper to explain core concepts

maintaining vision through rapid change
figuring out how to apply dev concepts to systems: what does it mean to have unit tests for systems?

Where to Go Next?

improve testing -> monitoring
monitoring = lightweight, repeated integration test in production
culture change is the single most important thing. culture is driven by the demands place on people: if the demand is “don’t spend a lot of a money”, then you get a culture that results from that. if you have them own a product, they operate at a higher level.

Rugged DevOps

James Wickett

@wickett

http://theagileadmin.com

You want people to build Rugged software because they desire the benefits of it, not just because they are scared of auditors.
Am I Secure?

latest and greatest vulnerabilities
Justification of

What do you think of security people?

paranoid, jaded,

It’s an us vs. them mentality:

dev vs ops, ops vs security, dev vs security
security professionals often degrade developers
there is interest across the isle, but often ruined by negative language

As bad as the ratio between ops and developers is (many dev, few ops), it’s even worse for security: 1 security for 1,000 dev
“How Complex Systems Fail”: google it, great paper
Rugged Software Manifesto

Rugged characteristics:

Availability
Longevity
Scalable, Portable
Maintainable and Defensible

Rugged offers affirming values, rather than the Fear/Uncertainty/Doubt of Security

You can sell Rugged as a Feature

Using Rugged product labels

Simple understand of rugged in various characteristics
Custom lines of code by category
Libraries used and their ruggedness

Rugged is Implicit: Customers expect that their money won’t be stolen, their password won’t be intercepted, etc.
To achieve:

People

Sit near the developers: DevOpsSec
Track Security flaws or bugs in the same bug tracking system
Security guys that are so outnumbered have to make broad statements like “don’t use PHP”, because they don’t have the time/bandwidth to have a full conversation.

Recommended

Visible Ops Security
Web Operations
Beautiful Security

William Hertling's Thoughtstream

Pages

SXSW CloudCamp / DevOps Session Notes 2011

No comments: