Dan Slimmon on Monitoring

Car Alarms & Smoke Alarms & Monitoring
Dan Slimmon
@danslimmon
Senior Platform Engineer at Exosite
  • I work in Ops, so I wear a lot of hats
  • One of those is data scientist
    • Learn data analysis and visualization
    • You’ll be right more often and people will believe your right even more often than you are
  • A word problem
    • Plagiarism: 90% chance of positive
    • No Plagiarism: 20% chance of positive
    • Kids plagiarize 30% of the time
    • Given a random paper, what’s the probability that you’ll get a negative result?
      • 0.3*0.9 + 0.7*0.2 = 0.27+0.14=0.41
      • 59% likely to get negative result
    • If you get a positive result, how likely is it to really be plagiarized?
      • 65.8% likely
      • this is terrible.
      • Teachers will stop trusting the test.
  • Sensitivity & Specificity
    • Sensitivity: % of actual positives that are identified as such
    • Specificity: % of negative results that are indicated as such
    • Prevalence: percentage of people with problem
    • http://i.imgur.com/LkxcxLt.png
    • Positive Predictive Value: the probably that something is actually wrong.


  • Car Alarms
    • Go off all the time for reasons that aren’t someone stealing your car.
    • Most people ignore them.
  • Smoke Alarms
    • You get your ass outside and wait for the alarm to go off and the fire trucks.
  • We need monitoring tools that are both highly sensitive and highly specific.
  • Undetected outages are embarrassing, so we tend to focus on sensitivity.
    • That’s good.
    • But be careful with thresholds.
    • Too high, and you miss real problems. Too low, and too many false alarms.
    • There’s only one line with thresholds, so only one knob to adjust.
  • Get more degrees of freedom.
    • Hysteresis is a great way to add degrees of freedom. 
      • State machines
      • Time-series analysis
  • As your uptime increases, you must get more specific.
    • Going back to the chart…our positive predictive value goes down when there’s less actual problems.
  • A lot of nagios configs combine detecting problem with identifying what the problem is.
    • You need to separate those concerns.
    • Baron Schwartz says: Your alerting should tell you whether work is getting done.
    • Knowing that nginx is down doesn’t affect if your site is up. Check to see if you site is up (detecting problem), which is separate from source of problem (nginx isn’t running)
    • Alert on problems, bot on diagnosis.
Links:

No comments: