Awesome post, Logan! I was fascinated the whole way through. It's cool to see the way your discipline both differs and overlaps with my daily. Though the tech stacks are wildly different, we are both focused on enabling other developers to build amazing user-facing stuff on top of a set of primitives. I agree that staying nimble and using resources in an optimal manner really sets platform teams apart—I'm sure that's all the more true in ML infra where, as you say, resources are precious and experimentation is really at the fore.
Thanks for your insights—you're clearly an amazing technologist 👏
Thanks Drew! Admittedly, I haven't worked on a non-ML platform team but I think it's really interesting how the focus of a platform affects design decisions. Especially coming from modeling experience this was something super interesting to me about ML infra.
FWIW, if you ever feel like posting about it, I would personally love to hear more about how you evaluate data center networks and compute architectures when deploying a model.
I like stories where an understanding of the actual computer system allows us to write much more efficient software. Tends to be a bit of a blindspot for me.
This is a really good idea. While not directly what I work on, that's been super interesting to learn about. I'll have to make sure my understanding of it is rock solid first.
Great article, I think everyone is focused on the fancy algorithms but not many talk about what goes on behind making these algorithms operational. Would love to read more a about each parts of the ML infrastructure.
Agreed. I see a lot about machine learning but everyone seems to gloss over how we bring models to users and the complexities involved in it. I’ll definitely be writing more about it.
Thank you for this insightful piece on machine learning infrastructure!
Your explanation of the complexities involved in deploying and maintaining these systems is enlightening.
I’m curious, in your experience, what are the most common pitfalls organizations encounter when scaling their machine learning infrastructure, and how can they proactively address these challenges?
Prioritization can be a huge pitfall. In my experience, there's always a ton of meaningful to be done and great leadership is needed to prioritize and refine important work so a team can make meaningful progress. A second pitfall is not taking the time to understand user needs. The infrastructure needs to be efficient, but it also needs to be easy for modelers to use. User feedback should be gathered regularly and used to drive improvements.
Interesting so what sort of mechanism is best for gathering that feed back and is there a frequency that works best? Like on demand or once a day/week? I would assume there is a best practice or systematic setup that works best.
Awesome post, Logan! I was fascinated the whole way through. It's cool to see the way your discipline both differs and overlaps with my daily. Though the tech stacks are wildly different, we are both focused on enabling other developers to build amazing user-facing stuff on top of a set of primitives. I agree that staying nimble and using resources in an optimal manner really sets platform teams apart—I'm sure that's all the more true in ML infra where, as you say, resources are precious and experimentation is really at the fore.
Thanks for your insights—you're clearly an amazing technologist 👏
Thanks Drew! Admittedly, I haven't worked on a non-ML platform team but I think it's really interesting how the focus of a platform affects design decisions. Especially coming from modeling experience this was something super interesting to me about ML infra.
FWIW, if you ever feel like posting about it, I would personally love to hear more about how you evaluate data center networks and compute architectures when deploying a model.
I like stories where an understanding of the actual computer system allows us to write much more efficient software. Tends to be a bit of a blindspot for me.
This is a really good idea. While not directly what I work on, that's been super interesting to learn about. I'll have to make sure my understanding of it is rock solid first.
Thanks, Logan. I learned alot. I know that because I don't know much about what you wrote about yet. Onto some learning!
I’m glad it was helpful! I’ll be expanding on these topics in future articles 😊
Great article, I think everyone is focused on the fancy algorithms but not many talk about what goes on behind making these algorithms operational. Would love to read more a about each parts of the ML infrastructure.
Agreed. I see a lot about machine learning but everyone seems to gloss over how we bring models to users and the complexities involved in it. I’ll definitely be writing more about it.
Thank you for this insightful piece on machine learning infrastructure!
Your explanation of the complexities involved in deploying and maintaining these systems is enlightening.
I’m curious, in your experience, what are the most common pitfalls organizations encounter when scaling their machine learning infrastructure, and how can they proactively address these challenges?
Prioritization can be a huge pitfall. In my experience, there's always a ton of meaningful to be done and great leadership is needed to prioritize and refine important work so a team can make meaningful progress. A second pitfall is not taking the time to understand user needs. The infrastructure needs to be efficient, but it also needs to be easy for modelers to use. User feedback should be gathered regularly and used to drive improvements.
Interesting so what sort of mechanism is best for gathering that feed back and is there a frequency that works best? Like on demand or once a day/week? I would assume there is a best practice or systematic setup that works best.