/ML

Classifying All Of The Pdfs On The Internet

- Santiago Pedroza tl;dr: “I classified the entirety of SafeDocs using a mixture of LLMs, Embeddings Models, XGBoost and just for fun some LinearRegressors. In the process I too created some really pretty graphs!”

featured in #545


Machine Unlearning In 2024

- Ken Liu tl;dr: “As our ML models today become larger and their (pre-)training sets grow to inscrutable sizes, people are increasingly interested in the concept of machine unlearning to edit away undesired things like private data, stale knowledge, copyrighted materials, toxic / unsafe content, dangerous capabilities, and misinformation, without retraining models from scratch.” Ken provides us with an introduction. 

featured in #515


Building A Weather Data Warehouse Part I: Loading A Trillion Rows Of Weather Data Into TimescaleDB

- Ali Ramadhan tl;dr: “I think it would be cool to have historical weather data from around the world to analyze for signals of climate change we’ve already had rather than think about potential future change.” Ali discusses the implementation of this analysis tool. 

featured in #510


Personalizing The DoorDash Retail Store Page Experience

tl;dr: "In this post, we show how we built a personalized shopping experience for our new business vertical stores, which include grocery, convenience, pets, and alcohol, among many others. Following a high-level overview of our recommendation framework, we home in on the modeling details, the challenges we have encountered along the way, and how we addressed those challenges."

featured in #479


Ship Shape

- Kerry Halupka Rowan Katekar tl;dr: How Canva does hand-drawn shape recognition in the browser using machine learning to convert user-drawn scribbles into vector graphics, keeping classification latency at the forefront of the user experience. "We wanted to make sure the experience was snappy but still accurate. Therefore, we decided to deploy the solution in the browser, which allows for real-time shape recognition and drawing assistance, providing a seamless and interactive user experience. Users can draw shapes and receive immediate feedback without experiencing delays associated with server-based processing."

featured in #474


Navigating The Chaos: Why You Don’t Need Another MLOps Tool

tl;dr: AI/ML development lacks systematic processes, leading to errors and biases in deployed models. The MLOps landscape is fragmented, and teams need to glue together a ton of bespoke and third-party tools to meet basic needs. We don’t think you should, so we're building Openlayer to condense and simplify AI evaluation.

featured in #469


Building In-Video Search

tl;dr: "Suppose it’s Christmas, and you want to create a great instagram piece out all the best scenes across Netflix films of people shouting “Merry Christmas”! Or suppose it’s Anya Taylor Joy’s birthday, and you want to create a highlight reel of all her most iconic and dramatic shots. Creating these involves sifting through hundreds of thousands of movies and TV shows to find the right line of dialogue or the appropriate visual elements (objects, scenes, emotions, actions, etc.). We have built an internal system that allows someone to perform in-video search across the entire Netflix video catalog, and we’d like to share our experience in building this system."

featured in #464


Hey, Computer, Make Me A Font

- Sergey Tselovalnikov tl;dr: “This is a story of my journey learning to build generative ML models from scratch and teaching a computer to create fonts in the process.” FontoGen is a generative ML model project that crafts type fonts based on user descriptions. The author delves into the complexities of text-to-SVG generation and the intricacies of maintaining stylistic uniformity across glyphs. Drawing inspiration from the IconShop paper, a sequence-to-sequence model was employed, using text embeddings from BERT and font embeddings from tokenized glyph shapes.

featured in #454


Is This A Date? Using ML To Identify Date Formats In File Names

tl;dr: “To make it easier for our users to organize and find their files, Dropbox has an automated feature called naming conventions. With this feature, users can set rules around how files should be named, and files uploaded to a specific folder will automatically be renamed to match the preferred convention. For example, files could be renamed to include a keyword or date… We developed a machine learning model that can accurately identify dates in a file name so that files can be renamed more effectively.”

featured in #452


How DoorDash Improves Holiday Predictions Via Cascade ML Approach

- Chad Akkoyun Zainab Danish tl;dr: DoorDash's engineering team tackled the challenge of accurately forecasting supply and demand during holidays, where traditional tree-based machine learning models like Random Forest and Gradient Boosting faced limitations. The article introduces the "cascade modeling approach" as a solution. This method extends the Gradient Boosting Machine model with a linear model to account for holiday impacts, enhancing forecast accuracy. The cascade approach involves calculating holiday multipliers, preprocessing data, and post-processing forecasts.

featured in #446