New Evals for Better Models, AI Research Papers Made Easier to Understand, Train Your Own Flux LoRA, and More
Machine learning highlights and resources 9-23-24
Here’s this past week's most interesting AI highlights and resources. This week I’ve added an AI papers podcast using NotebookLM. Thanks to all supporters of Society’s Backend! If you’re interested in my full reading list and want to support the community, you can do so for just $1/mo.
Don’t forget to subscribe on YouTube and follow me on X.
Highlights
An interesting study showed the negative impact a PhD can have on mental health. It’s been a hot topic on X this past week as many people claim a PhD to not be worth the time spent while many AI-related jobs list one as a requirement.
Google shipped a new set of evals that specifically measure long context reasoning performance to challenge frontier models. We’re seeing new evals a lot because as models get better, we need improved methods to evaluate them. Scale AI is set to launch Humanity’s Last Exam which aims to be the toughest open-source benchmark for LLMs.
There’s been a lot of talk about AI products not really being helpful as they’re hyped up to consumers while not actually providing value. Here’s a video of the Salesforce CEO, Marc Benioff, calling of Microsoft. I’ve mentioned this previously with regard to consumer AI products in general.
Anthropic has released a lot of research lately on making AI tools more useful. This week they released research on properly chunking data to make RAG more useful. It’s pretty interesting and I recommend following Anthropic and their researchers for more info like this. They also recently released a course on LLM prompt evaluations.
An interesting Reddit post compared DeepSeek 2.5, Claude 3.5 Sonnet, and GPT 4o for coding. It showed marginal difference between the models while the open model (DeepSeek) was between 17 and 21 times cheaper than its non-open counterparts. This shows the utility of open models. Detailed post here.
An interesting comparison of how OpenAI’s o1 and GPT models compare as calculators showcased the utility of the o1 family. o1-mini seems to perform similarly to o1-preview at a much cheaper rate. Here’s the post discussion and the o1-mini chatbot to give it a try yourself.
Lots of focus on “thought” in LLMs this week. Check out the papers podcast in the section below to get an AI-generated podcast that goes over the important papers I’ve found.
Papers Podcast
Here’s the paper podcast generated by NotebookLM. Let me know what you think!
Here are the links to each paper individually:
Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries
Training Language Models to Self-Correct via Reinforcement Learning
Agents in Software Engineering: Survey, Landscape, and Vision
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Jobs
I found two interesting jobs this week:
Figure AI posts frequently that they’re hiring in all areas of technology but especially in AI.
OpenAI is hiring ML engineers for multi-agent research.
As a reminder, the ML Road Map has a new section that lists the most-desired ML-related job skills pulled from job applications.
Reading List
ByteDance Steps Up AI Chip Efforts
ByteDance, the parent company of TikTok, is accelerating its efforts to produce its own artificial intelligence chips, aiming for mass production by 2026. The company is collaborating with Taiwan Semiconductor Manufacturing Co. to design two semiconductors, which could reduce its reliance on expensive Nvidia chips for AI development and operations.
This move could give ByteDance a competitive edge in China's AI chatbot market by lowering costs and enhancing technological independence.
How to train Flux LoRA models
The article provides a detailed guide on training Flux LoRA models, which are advanced local Diffusion models that can surpass the quality of Stable Diffusion 1.5 and XL models. It includes a step-by-step tutorial using a Google Colab notebook and ComfyUI as the GUI, covering everything from image preparation to running the training workflow and testing the LoRA model.
Understanding how to train Flux LoRA models is crucial for those looking to create custom AI-generated art and enhance the capabilities of their AI models.
Agents in Software Engineering: Survey, Landscape, and Vision
The article surveys the integration of Large Language Models (LLMs) with software engineering (SE), highlighting how agents are used to enhance various SE tasks. It presents a framework for LLM-based agents in SE, consisting of perception, memory, and action modules, and discusses current challenges and future opportunities in this area.
Understanding how LLM-based agents can optimize software engineering tasks is crucial for advancing the efficiency and capabilities of software development.
Building A GPT-Style LLM Classifier From Scratch
Sebastian Raschka's article outlines the process of transforming pretrained large language models (LLMs) into text classifiers through finetuning, specifically using a spam classification example. The excerpt discusses the importance of focusing on the last token for capturing contextual information, modifying the output layer of the model, and freezing non-trainable layers to enhance performance.
Understanding how to finetune LLMs for specific tasks like text classification can significantly improve the efficiency and accuracy of AI applications in various domains.
This is written by
. It’s definitely worth checking out his newsletter .The Impact of PhD Studies on Mental Health—A Longitudinal Population Study
The study examines the mental health impact of PhD studies through the analysis of psychiatric medication prescriptions among PhD students in Sweden. Findings reveal that PhD students have a higher rate of psychiatric medication use compared to individuals with a master's degree, with a marked increase during the course of their PhD studies, peaking in the fifth year before declining.
Understanding the mental health challenges faced by PhD students highlights the need for better support systems within academic environments.
How To Build A SOTA Image Diffusion Model (feat. Suhail Doshi)
In the video, Suhail Doshi discusses the process of building a state-of-the-art image diffusion model, covering the technical aspects and the necessary tools and frameworks. He explains the importance of using high-quality datasets, the role of GPUs in accelerating model training, and the integration of advanced techniques to enhance model performance.
Understanding how to build advanced image diffusion models is crucial for those in the field of machine learning and artificial intelligence, as it can lead to significant improvements in image generation and processing technologies.
Fragmented regulation means the EU risks missing out on the AI era.
The article highlights concerns that fragmented and inconsistent AI regulations in the EU are causing Europe to fall behind other regions in AI innovation. It stresses the importance of harmonized regulatory frameworks to foster the development of open and multimodal AI models, which can significantly boost productivity, scientific research, and economic growth. The authors argue that without clear and unified rules, Europe risks missing out on advancements that could otherwise benefit its citizens and economy.
Ensuring consistent AI regulations is crucial for Europe to remain competitive and harness the economic and social benefits of AI technology.
Why Large Language Models Cannot (Still) Actually Reason
Large language models (LLMs) struggle with complex reasoning tasks due to their stochastic nature, computational constraints, and inability to perform genuinely open-ended computations. Strategies like chain of thought prompting and integrating external tools have shown some promise in enhancing LLMs' reasoning capabilities, but they still face significant challenges and limitations.
Understanding the limitations and potential improvements of LLMs in reasoning is crucial for developing more reliable and accurate AI systems in the future.
This is written by
. It’s definitely worth checking out his newsletter .Why Companies Invest in Open-Source Tech and Research[Markets]
The article discusses why companies invest heavily in open-source software (OSS) within the AI sector, highlighting the benefits for various stakeholders such as developers who gain access to advanced tools and contribute to innovation, businesses that reduce development costs and enhance security, and end-users who get better, more affordable products. It also outlines strategies for companies to integrate OSS into their business models, such as offering support services, developing proprietary applications that integrate with OSS, and forming partnerships to build ecosystems around open-source tools.
The importance of the article lies in its comprehensive overview of how open-source technology drives innovation, reduces costs, and creates value for multiple stakeholders in the tech industry, particularly in AI.
This is written by
. It’s definitely worth checking out his newsletter .Why Google Will Make a Better Model Than OpenAI’s o1
The article discusses the ongoing rivalry between Google and OpenAI, highlighting Google's efforts to develop a superior AI model to OpenAI's o1. Google's upcoming Gemini 2 model aims to improve reasoning quality, extend context windows, and offer multimodal capabilities. Meanwhile, OpenAI's o1 incorporates advanced reinforcement learning and Chain of Thought (CoT) methodologies.
The significance lies in the potential for Google's advancements to push the boundaries of AI capabilities, fostering innovation and competition in the field of artificial intelligence.
Keep reading with a 7-day free trial
Subscribe to Society's Backend to keep reading this post and get 7 days of free access to the full post archives.