11 Comments

Awesome post, Logan! I was fascinated the whole way through. It's cool to see the way your discipline both differs and overlaps with my daily. Though the tech stacks are wildly different, we are both focused on enabling other developers to build amazing user-facing stuff on top of a set of primitives. I agree that staying nimble and using resources in an optimal manner really sets platform teams apart—I'm sure that's all the more true in ML infra where, as you say, resources are precious and experimentation is really at the fore.

Thanks for your insights—you're clearly an amazing technologist 👏

Expand full comment

Thanks Drew! Admittedly, I haven't worked on a non-ML platform team but I think it's really interesting how the focus of a platform affects design decisions. Especially coming from modeling experience this was something super interesting to me about ML infra.

Expand full comment

FWIW, if you ever feel like posting about it, I would personally love to hear more about how you evaluate data center networks and compute architectures when deploying a model.

I like stories where an understanding of the actual computer system allows us to write much more efficient software. Tends to be a bit of a blindspot for me.

Expand full comment

This is a really good idea. While not directly what I work on, that's been super interesting to learn about. I'll have to make sure my understanding of it is rock solid first.

Expand full comment

Thanks, Logan. I learned alot. I know that because I don't know much about what you wrote about yet. Onto some learning!

Expand full comment

I’m glad it was helpful! I’ll be expanding on these topics in future articles 😊

Expand full comment

Great article, I think everyone is focused on the fancy algorithms but not many talk about what goes on behind making these algorithms operational. Would love to read more a about each parts of the ML infrastructure.

Expand full comment

Agreed. I see a lot about machine learning but everyone seems to gloss over how we bring models to users and the complexities involved in it. I’ll definitely be writing more about it.

Expand full comment

Thank you for this insightful piece on machine learning infrastructure!

Your explanation of the complexities involved in deploying and maintaining these systems is enlightening.

I’m curious, in your experience, what are the most common pitfalls organizations encounter when scaling their machine learning infrastructure, and how can they proactively address these challenges?

Expand full comment

Prioritization can be a huge pitfall. In my experience, there's always a ton of meaningful to be done and great leadership is needed to prioritize and refine important work so a team can make meaningful progress. A second pitfall is not taking the time to understand user needs. The infrastructure needs to be efficient, but it also needs to be easy for modelers to use. User feedback should be gathered regularly and used to drive improvements.

Expand full comment

Interesting so what sort of mechanism is best for gathering that feed back and is there a frequency that works best? Like on demand or once a day/week? I would assume there is a best practice or systematic setup that works best.

Expand full comment