/Observability

The Future Of AI, LLMs, And Observability On Google Cloud

tl;dr: Learn about the current and future states of AI, ML, and LLMs on Google Cloud. This guide distills the top 7 insights and actions from a fireside chat with Google’s Director of AI, Dr. Ali Arsanjani, and Datadog’s VP of Engineering, Sajid Mehmood. It covers everything from upskilling teams to observability best practices to help technical teams keep pace with the rapid advancements in AI.

featured in #560


The Future Of AI, LLMs, And Observability On Google Cloud

tl;dr: Learn about the current and future states of AI, ML, and LLMs on Google Cloud. This guide distills the top 7 insights and actions from a fireside chat with Google’s Director of AI, Dr. Ali Arsanjani, and Datadog’s VP of Engineering, Sajid Mehmood. It covers everything from upskilling teams to observability best practices to help technical teams keep pace with the rapid advancements in AI.

featured in #560


How To Measure Design System At Scale

tl;dr: “Uber needs an observability system of similar scale for measuring design quality to prevent subpar user experience, especially when it comes to adopting the existing UI libraries and accessibility best practices packaged under the Uber’s Design System, Base. Without such an observability system – let’s call it Design System Observability – it could be too late when Uber learned through complaints and public media about the end users who would suffer confusing onboarding rides, inconsistent layouts, and frustrating voiceovers / talkbacks sessions.”

featured in #553


Best Practices for Setting Up an Observability Framework

tl;dr: This playbook outlines the specific steps an organization should take when setting up a sustainable observability framework from scratch. Establishing a repeatable observability framework prevents critical visibility gaps, improves availability, reduces operational costs, and streamlines incident management - among other key business outcomes.

featured in #548


Tracing: Structured Logging, But Better In Every Way

- Andy Sammalmaa tl;dr: “If you’re writing log statements, you’re doing it wrong. This is a pretty incendiary statement, and while there has been some good discussion after, I figured it was time to write down why I think logs are bad, why tracing should be used instead, and how we get from one to the other.”

featured in #546


Tracing: Structured Logging, But Better In Every Way

- Andy Sammalmaa tl;dr: “If you’re writing log statements, you’re doing it wrong. This is a pretty incendiary statement, and while there has been some good discussion after, I figured it was time to write down why I think logs are bad, why tracing should be used instead, and how we get from one to the other.”

featured in #545


So We Shipped An AI Product. Did it Work?

- Phillip Carter tl;dr: “Like many companies, earlier this year we saw an opportunity with LLMs and quickly but thoughtfully started building a capability. About a month later, we released Query Assistant to all customers as an experimental feature. We then iterated on it, using data from production to inform a multitude of additional enhancements, and ultimately took Query Assistant out of experimentation and turned it into a core product offering. However, getting Query Assistant from concept to feature diverted R&D and marketing resources, forcing the question: did investing in LLMs do what we wanted it to do?”

featured in #454


Bottleneck: Resilience And Observability

- Punit Lad Carl Nygard tl;dr: The authors delve into the intricacies of resilience and observability in the context of rapidly scaling systems. As systems expand, their complexity can lead to potential failures. Resilience isn't about averting these failures but adeptly managing them. Observability is pivotal for comprehending system behavior, with its three foundational pillars: Metrics, Logs, and Traces. The authors also highlight challenges posed by the vast data volume in observability and the role of automation.

featured in #442


Service Delivery Index: A Driver for Reliability

- Matthew McKeen Ryan Katkov tl;dr: The article introduces the Service Delivery Index – Reliability (SDI-R), a metric designed to measure and drive service reliability at Slack. As the company grew, the need to move from a reactive to a proactive approach to reliability became evident. SDI-R, a composite metric of successful API calls, content delivery, and user workflows, provides a common understanding of reliability across the organization. It helps in spotting trends, identifying regressions, and setting customer expectations. The article details how SDI-R evolved, the tools and processes that support it, and the lessons learned.

featured in #440