SXSW CloudCamp / DevOps Session Notes 2011


CloudCamp
#cloudcamp



  • About
    • Managed DNS and email delivery provider
    • names to numbers: twitter.com -> 168.143....
    • 4 million clients
    • 200,000 zones
    • 100,000 domains
    • 17 global data centers
    • Meant for scalability, automation, redundancy
  • Why CloudCamp
    • rapidly scale
    • pay as you go
    • API automation
    • operational efficiency
  • Dynect Platform
    • Managed DNS
    • Global anycast network
    • active failover
    • load balancing
    • traffic management
    • cdn manager
    • ui and api access
    • SLA - five 9s
  • Load balance between your infrastructure and the cloud
    • split across amazon, rackspace, your servers
  • How it works: when a client comes in from Australia, don’t send them to an U.S. server, route them to the server that is closest via the network.
Glen Campbell
Rackspace
@glenc 
glen dot campbell at rackspace dot com
#scalableapps
  • Was at Yahoo
    • During hurricane katrina Yahoo requests went up to 30,000 requests per second, and stayed there for 6 days. 10x the traffic expected.
    • intermittent failovers from various places.
    • wheeled in 50 extra machines
      • took a day just to setup, even though the machines were already sitting there
      • then serving 40,000 requests per second.
    • Now with rackspace, you could scale up within minutes: no need to build new machines
  • One of the goals is that you have no single point of failure.
    • You can’t have one network service provider, one dns provider, one database. You can never have one of anything.
    • For Cloud, you can’t rely on a single database server, single file service
  • Rackspace is part of the open-stack platform
    • open source for cloud computing
    • Our goal is to get Rackspace running on the open source platform by the end of the year.
  • You can run the openstack.org platform yourself.
  • You want infrastructure that makes it easy to scale, easy to deploy, easy to add services.
Robert Phillips
Sendgrid
Sr. Director Marketing
  • Work with 20,000 companies
  • Send over a billion transactional emails per month
  • your app sends email
  • problems:
    • deliverability: inbox vs spam folder
    • analytics: did the customer receive, open it, click on it
    • platform: enhance email from open architecture
  • backend problems
    • isp rate limits
    • blacklists
    • dk/dkim
    • spf records
    • content inspection: ratio of images to text
  • “email isn’t fun”
  • customers: gowalla, plancast, foursquare
  • Integration
    • APIs: STMP, Web API, SMTP relay
    • receive emails w/ parse api
    • pull data w/ event api
    • manage sub users
    • dedicated ip addresses
    • whitelabeling
Gene Kim
Lessons Learned Creating SuperTribe of Dev and Ops
  • Morris Worm 1988 - took down 10% of internet
  • Wrote Tripwire at Purdue in response
  • Tremendous passion for studying high performers
    • started as gene’s list of people with good kung fu
    • people who had the best security, best mean time between failure
    • codified these practices for Visible Ops Handbook
  • Not only in security, but also did work full time doing operations
    • and worked as developer
    • show me a developer who doesn’t cause problems for ops
  • Now more than ever, we need great IT operations
    • Organizations are held up by the ability to get features released, get things through operations.
    • But it’s not just an operations problem, it’s a development problem and a business problem.
Johnny Diggz
Chief Evangalist for Tropo
Geeks without bounds: gwob.org - volunteer
SignalKit: Notifier
  • Put livechat link on their website
  • Using 6 lines of code and Tropo service, was able to monitor group chat to send IM
  • A carpenter doesn’t weld his own hammer or cast his own screwdriver. He uses an off the shelf tool.
  • Why would a programmer write their own email functionality or notify functionality?
github.com/aaronpk
  • See tropo demo that sends map position information through IM to a particular web browser
DEVOPS
John Willis
@botchagalupe
  • Historical culture of dev and ops fighting
  • The cloud is forcing us to have to get along
  • You Better Care: It’s about your business
  • Devops is Velocity
    • The velocity of innovation
    • How do we compete today?
      • By Scale: Scale of users, scale of data, scale of compute power. Businesses can compete on scale.
      • By Velocity of Innovation: How fast can you react to and execute on new market forces or opportunities.
        • Doing 30 deploys a day.
        • It’s about getting ideas to customers really, really fast.
        • How fast can you go from “Ah ha!” idea moment to the “Ka-ching!” cash moment?
          • It’s the applicable lifecycle that seperates the two: and dev and ops and test all contribute to the delay.
    • Once we move to software as a service, everything we thought we knew about competitive advantage has to be rethought. - Tim O’Reilly: Operations: The New Secret Sauce
      • Operations is no longer just the girl friday you bring in once a week to do a deploy.
      • It’s critical to the business, to the ability to innovate, deliver value to customer.
      • If you have crap operations, you might as well shut down your business.
  • New Face of a Rock Star: John Allspaw - VP Technical Operations at Etsy.
    • One of the first things you have to do is find one of these guys. 
  • So What’s Your Culture Dog?
    • “We will encourage you to develop the three great virtues of a programmer: laziness, impatience, and hubris.” - LarryWall. 
    • We became lazy for the wrong reasons: We should have been lazy so we develop automation, so we don’t have to repeat tasks over and over.
    • Leadership?
      • The good guys are going to leave your company if you aren’t showing leadership
    • Wall of Confusion: Dev vs Ops
  • Break Down Walls
    • Force a breakdown in the wall:
      • Take an Ops guy, and put them in Dev
      • Take a Dev guy and put them in Ops
    • Respect each other
    • Enemies are outside the wall: Work together to beat the bad guys outside the wall, don’t fight each other.
      • It’s really clear for a small startup 
      • It’s harder for a big company
    • Fearless Culture
      • Failure is the New Black: We need to embrace failure. You shouldn’t get fired when you break something. 
      • At Amazon, they’d have a game day where they’d try to take down a data center. It’s not tested until we have a failure in production.
    • Sense of Urgency
      • Gotta build a sense of urgency - to be purpose driven.
      • “We gotta do this, or we’re going to go out of business.”
      • The Etsy guys, they have fun going to work. 
    • Partners
      • We are partners in getting things done.
      • When it becomes a question of survival, you had to become partners or die.
    • The Smell Test
      • From Chris Reid
      • If you are in an organization, and you don’t know what that guy does, there’s been a failure.
    • Shaman in your organization
      • There are the guys who know why that flag is passed on that command line.
      • They are the communicators between the people: they are enablers. Everyone comes to find out what is going on.
    • Passion:
      • When you go to work, are you a guy who presses keys on a keyboard, or are you the great programmer that ever lived?
Gene Kim
Co-Founder Tripwire, Author Visible Ops and Visible Security
@RealGeneKim
genek at realgenekim dot me
  • Universal pattern that shows up when your organization needs Devops
  • Three sets of patterns you can do inside dev and ops 
  • Benchmarked 1,300 organizations - to link controls and performance
    • High performance organizations exist and they are 4-5 times more productive than ordinary organizations
    • High performers find and fix security break fast
    • Unplanned work comes at the expense of planned work
  • Vicious Downward Spiral
    • Ops Sees:
      • Way too many fragile applications, prone to failure
      • It takes too long to find out which bit got flipped
      • The problem is detected by a salesperson or customer
      • Too much time required to restore service
      • Too much time spent firefighting and unplanned
      • Planned project work cannot complete
      • Frustrated customers leave
      • Market share goes down
      • Business misses Wall St commitments
      • Business makes even larger promises to Wall St
    • Dev Sees
      • More urgent, date-driven projects put into the queue
      • Even more fragile code put into production
      • More releases have turbelent installs
      • Release cycles lengthen to amortize cost of deployments
      • Failling bigger deployments are even more difficult to diagnose
      • Most senior constrained IT ops resources have less time to fix underlying process problems
      • Every increasing backlog of infrastructure projects that could fix the underlying problems
  • Operations inside the Dev/Ops Super-Tribe
    • Increase flow from Dev to Production
      • Increase throughput
      • Decrease WIP
    • Goal to create system of operations that allows...
  • Zone #1: Decrease cycle time of releases
    • Create determinism in the release process
    • Move packaging responsibility to development
    • Release early and often
    • Decrease release cycle time
      • Reduce deployments time from 6 hours to 45 minutes
      • Refactor deployment process that had 1300 steps spanning 4 weeks
    • Never fix forward, instead “roll back”, escalating any deviation from plan to Dev
    • Verify for all handoffs (e.g. correctness, accuracy, timeliness, etc.)
  • Zone #2: Increase production rigor
    • Define what work is and where work can come from
    • Protect the integrity of the work queue: e.g. infrastructure/process improvements
    • To preserve and increase throughput, elevate preventive projects and maintenance tasks
    • Document all work, changes and outcomes so that it is repeatable
  • Contact Gene for slides/resources.
    • Visible Ops
    • Visible Security Ops
    • Lean IT
    • Web Operations
Earnest Mueller
National Instruments
How we implemented DevOps
  • NI
    • 30 years old, 5000+ employees, mostly engineers
    • Robots and stuff: scientific data acquisitions
  • Before
    • Traditional siloed IT department
      • programmers split by business unit
      • infrastructure split by technology
    • Large complex Web site with dedicated operations team
      • 100 programmers
      • 6 ops guy (doing support, release, systems engineering, security, performance management)
    • Low agility: 6 weeks to get a server
    • Uptime problems with complexity and silos
    • Grand vision: “Don’t spend a lot of money please”
  • The Tipping Point
    • NI decided it was time to make some SaaS products
    • Some existing product to web integration points, but uncordinated and poorly maintained
    • R&D realized they didn’t have web knowledge, started up a new time
  • Blessing and Curse
    • Everything was new, so we simultaneously developed:
      • Team, Process, Systems, Code, Providers, System Automation
    • (existing processes oriented around annual software products, not frequent web releases)
  • The Team
    • We built up our team to fit our role of internal ISV
      • Application architect
      • System architect
      • Operations lead
      • 2 developers
      • 1 automation developer
      • 2 follow-the-sun operations staff
    • Work with other product developer teams
  • The Process
    • Agile
    • All systems work used the “developer” tools and systems
      • Revision control: Perforce
      • Bug Tracking: HP
      • Specs and reviews: Atlassian Confluence Wiki
      • Task tracking and burndown: JIRA/Greenhopper
    • All members collaborate on all aspects of the product
    • This was the key to making it work - using all the same tools. We could prioritize better, because it was all in one system.
    • There can be a fear that the systems tasks would always get pushed out
      • Seemed to be mistaken impression
      • But when presented alongside requirements, decision makers seemed to understand the need for systems work.
  • The Systems
    • Cloud!
    • Decided on Amazon EC2
    • Needed control and agility we wouldn’t be able to get internally: dynamic requirements, fast scaling
    • Needed Linux and Windows both for software
    • Currently taking on Microsoft Azure as well
  • System Automation
    • We built our own: PIE, the “Programmable Infrastructure Environment”
      • Looked at Chef/Puppet, and others. What we needed wasn’t quite any of those.
    • XML System model defines systems, services, code installs, runtime interaction, variable substitutions
    • PIE autobuilds the system from the model: provisioning, software installs, monitoring integration
    • Zookeeper as a runtime registry for systems info and eventing
    • Allows us to start/stop/control/install/autoscale on bunches of dynamic environments
    • We have our dev environment, test environment, production environment - multiplied across our many products. We could deploy a new environment in a couple of hours.
  • Code
    • All REST-based web services
    • Cloud and PIE code mostly in Java, product code mostly in C3/.NET
    • The developer must create and deliver the PIE XML code that will build, deploy, monitor their own code. They must deliver the XML along with their code. The developer is the only person who knows what their system downs.
  • Providers
    • CloudKick - monitoring
    • PagerDuty - paging
    • DNSmadeEasy
    • Postmark - email
  • Results
    • Win!
    • Continuous pipeline of products delivered quickly
    • LabView Web UI Builder (http://ni.com/uibuilder) in release
    • FPGA Compile Cloud in beta
    • One big one in the pipeline and others knocking on our door
    • Using cloud, automation, and collaboration through devops, we’ve been able to deliver the apps quickly and continuously. 
      • Vastly less time that it used to take with the traditional web organization.
  • Challenges
    • overcoming the thought that it was impossible
      • can i do infrastructure tasks using agile? with sprints? by actually trying it, it turns out to be possible.
      • write our own software provisioning sounds hard, maybe we can’t do it. but when you try it, you can.
    • building trust between dev and ops
      • working together in one team and using the same tools really helps. you get transparency. you can’t build the sense of trust when you don’t know what each other are working on.
    • various customer dev teams, some globally distributed
    • explaining core web performance/availability/management needs to desktop developers.
      • Needed to write “why you should log” paper to explain core concepts
    • maintaining vision through rapid change
    • figuring out how to apply dev concepts to systems: what does it mean to have unit tests for systems?
  • Where to Go Next?
    • improve testing -> monitoring
    • monitoring = lightweight, repeated integration test in production
    • culture change is the single most important thing. culture is driven by the demands place on people: if the demand is “don’t spend a lot of a money”, then you get a culture that results from that. if you have them own a product, they operate at a higher level.
Rugged DevOps
James Wickett
@wickett
  • You want people to build Rugged software because they desire the benefits of it, not just because they are scared of auditors.
  • Am I Secure?
    • latest and greatest vulnerabilities
    • Justification of 
  • What do you think of security people?
    • paranoid, jaded, 
  • It’s an us vs. them mentality:
    • dev vs ops, ops vs security, dev vs security
    • security professionals often degrade developers
    • there is interest across the isle, but often ruined by negative language
  • As bad as the ratio between ops and developers is (many dev, few ops), it’s even worse for security: 1 security for 1,000 dev
  • “How Complex Systems Fail”: google it, great paper
  • Rugged Software Manifesto
  • Rugged characteristics:
    • Availability
    • Longevity
    • Scalable, Portable
    • Maintainable and Defensible
  • Rugged offers affirming values, rather than the Fear/Uncertainty/Doubt of Security
    • You can sell Rugged as a Feature
  • Using Rugged product labels
    • Simple understand of rugged in various characteristics
    • Custom lines of code by category
    • Libraries used and their ruggedness
  • Rugged is Implicit: Customers expect that their money won’t be stolen, their password won’t be intercepted, etc.
  • To achieve:
    • People
      • Sit near the developers: DevOpsSec
      • Track Security flaws or bugs in the same bug tracking system
      • Security guys that are so outnumbered have to make broad statements like “don’t use PHP”, because they don’t have the time/bandwidth to have a full conversation.
  • Recommended
    • Visible Ops Security
    • Web Operations
    • Beautiful Security
Post a Comment