
Deep Papers (Arize AI)
Explore every episode of Deep Papers
Pub. Date | Title | Duration | |
---|---|---|---|
18 Jan 2023 | ChatGPT and InstructGPT: Aligning Language Models to Human Intention | 00:47:39 | |
Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. Read OpenAI's InstructGPT paper here: https://openai.com/blog/instruction-following/ Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
13 Feb 2023 | Hungry Hungry Hippos - H3 | 00:41:53 | |
Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
20 Mar 2023 | Toolformer: Training LLMs To Use Tools | 00:34:06 | |
Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. Follow AI__Pub on Twitter. To learn more about ML observability, join the Arize AI Slack community or get the latest on our LinkedIn and Twitter. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
21 Jul 2023 | Orca: Progressive Learning from Complex Explanation Traces of GPT-4 | 00:42:03 | |
Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
26 Jul 2023 | Lost in the Middle: How Language Models Use Long Contexts | 00:42:28 | |
Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. This episode is led by Sally-Ann DeLucia and Amber Roberts, as they discuss the paper "Lost in the Middle: How Language Models Use Long Contexts." Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
31 Jul 2023 | Llama 2: Open Foundation and Fine-Tuned Chat Models | 00:30:26 | |
Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. This episode is led by Aparna Dhinakaran ( Chief Product Officer, Arize AI) and Michael Schiff (Chief Technology Officer, Arize AI), as they discuss the paper "Llama 2: Open Foundation and Fine-Tuned Chat Models." Follow AI__Pub on Twitter. To learn more about ML observability, join the Arize AI Slack community or get the latest on our LinkedIn and Twitter. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
30 Aug 2023 | Skeleton of Thought: LLMs Can Do Parallel Decoding | 00:43:39 | |
Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this paper reading, we explore the paper ‘Skeleton-of-Thought’ (SoT) approach, aimed at reducing large language model latency while enhancing answer quality. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
29 Sep 2023 | Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior | 00:42:14 | |
Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
17 Oct 2023 | Explaining Grokking Through Circuit Efficiency | 00:36:12 | |
Join Arize Co-Founder & CEO Jason Lopatecki, and ML Solutions Engineer, Sally-Ann DeLucia, as they discuss “Explaining Grokking Through Circuit Efficiency." This paper explores novel predictions about grokking, providing significant evidence in favor of its explanation. Most strikingly, the research conducted in this paper demonstrates two novel and surprising behaviors: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalization to partial rather than perfect test accuracy. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
18 Oct 2023 | RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models | 00:43:49 | |
We discuss RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting. While researchers have successfully applied LLMs such as ChatGPT to reranking in an information retrieval context, such work has mostly been built on proprietary models hidden behind opaque API endpoints. This approach yields experimental results that are not reproducible and non-deterministic, threatening the veracity of outcomes that build on such shaky foundations. RankVicuna provides access to a fully open-source LLM and associated code infrastructure capable of performing high-quality reranking. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
20 Nov 2023 | Towards Monosemanticity: Decomposing Language Models With Dictionary Learning | 00:44:50 | |
In this paper read, we discuss “Towards Monosemanticity: Decomposing Language Models Into Understandable Components,” a paper from Anthropic that addresses the challenge of understanding the inner workings of neural networks, drawing parallels with the complexity of human brain function. It explores the concept of “features,” (patterns of neuron activations) providing a more interpretable way to dissect neural networks. By decomposing a layer of neurons into thousands of features, this approach uncovers hidden model properties that are not evident when examining individual neurons. These features are demonstrated to be more interpretable and consistent, offering the potential to steer model behavior and improve AI safety. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
30 Nov 2023 | The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets | 00:41:02 | |
For this paper read, we’re joined by Samuel Marks, Postdoctoral Research Associate at Northeastern University, to discuss his paper, “The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets.” Samuel and his team curated high-quality datasets of true/false statements and used them to study in detail the structure of LLM representations of truth. Overall, they present evidence that language models linearly represent the truth or falsehood of factual statements and also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
18 Dec 2023 | How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings | 00:44:59 | |
We’re thrilled to be joined by Shuaichen Chang, LLM researcher and the author of this week’s paper to discuss his findings. Shuaichen’s research investigates the impact of prompt constructions on the performance of large language models (LLMs) in the text-to-SQL task, particularly focusing on zero-shot, single-domain, and cross-domain settings. Shuaichen and his team explore various strategies for prompt construction, evaluating the influence of database schema, content representation, and prompt length on LLMs’ effectiveness. The findings emphasize the importance of careful consideration in constructing prompts, highlighting the crucial role of table relationships and content, the effectiveness of in-domain demonstration examples, and the significance of prompt length in cross-domain scenarios. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
27 Dec 2023 | A Deep Dive Into Generative's Newest Models: Gemini vs Mistral (Mixtral-8x7B)–Part I | 00:47:50 | |
For the last paper read of the year, Arize CPO & Co-Founder, Aparna Dhinakaran, is joined by a Dat Ngo (ML Solutions Architect) and Aman Khan (Product Manager) for an exploration of the new kids on the block: Gemini and Mixtral-8x7B. Link to transcript and live recording: https://arize.com/blog/a-deep-dive-into-generatives-newest-models-mistral-mixtral-8x7b/ Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
02 Feb 2024 | HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels | 00:36:22 | |
We discuss HyDE: a thrilling zero-shot learning technique that combines GPT-3’s language understanding with contrastive text encoders. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
02 Feb 2024 | Phi-2 Model | 00:44:29 | |
We dive into Phi-2 and some of the major differences and use cases for a small language model (SLM) versus an LLM. Find the transcript and live recording: https://arize.com/blog/phi-2-model Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
08 Feb 2024 | RAG vs Fine-Tuning | 00:39:49 | |
This week, we’re discussing "RAG vs Fine-Tuning: Pipelines, Tradeoff, and a Case Study on Agriculture." This paper explores a pipeline for fine-tuning and RAG, and presents the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
01 Mar 2024 | Sora: OpenAI’s Text-to-Video Generation Model | 00:45:08 | |
This week, we discuss the implications of Text-to-Video Generation and speculate as to the possibilities (and limitations) of this incredible technology with some hot takes. Dat Ngo, ML Solutions Engineer at Arize, is joined by community member and AI Engineer Vibhu Sapra to review OpenAI’s technical report on their Text-To-Video Generation Model: Sora. At the end of our discussion, we also explore EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. This recent paper proposed a new framework and pipeline to exhaustively evaluate the performance of the generated videos, which we look at in light of Sora. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
15 Mar 2024 | Reinforcement Learning in the Era of LLMs | 00:44:49 | |
We’re exploring Reinforcement Learning in the Era of LLMs this week with Claire Longo, Arize’s Head of Customer Success. Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). This week’s paper, aims to link the research in conventional RL to RL techniques used in LLM research and demystify this technique by discussing why, when, and how RL excels. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
25 Mar 2024 | Anthropic Claude 3 | 00:43:01 | |
This week we dive into the latest buzz in the AI world – the arrival of Claude 3. Claude 3 is the newest family of models in the LLM space, and Opus Claude 3 ( Anthropic's "most intelligent" Claude model ) challenges the likes of GPT-4. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
04 Apr 2024 | Demystifying Chronos: Learning the Language of Time Series | 00:44:40 | |
This week, we’ve covering Amazon’s time series model: Chronos. Developing accurate machine-learning-based forecasting models has traditionally required substantial dataset-specific tuning and model customization. Chronos however, is built on a language model architecture and trained with billions of tokenized time series observations, enabling it to provide accurate zero-shot forecasts matching or exceeding purpose-built models. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
26 Apr 2024 | Keys To Understanding ReAct: Synergizing Reasoning and Acting in Language Models | 00:45:07 | |
This week we explore ReAct, an approach that enhances the reasoning and decision-making capabilities of LLMs by combining step-by-step reasoning with the ability to take actions and gather information from external sources in a unified framework. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
13 May 2024 | Breaking Down EvalGen: Who Validates the Validators? | 00:44:47 | |
Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators often inherit the problems of the LLMs they evaluate, requiring further human validation. This week’s paper explores EvalGen, a mixed-initative approach to aligning LLM-generated evaluation functions with human preferences. EvalGen assists users in developing both criteria acceptable LLM outputs and developing functions to check these standards, ensuring evaluations reflect the users’ own grading standards. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
30 May 2024 | Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment | 00:48:07 | |
We break down the paper--Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment. Ensuring alignment (aka: making models behave in accordance with human intentions) has become a critical task before deploying LLMs in real-world applications. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
14 Jun 2024 | LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic | 00:44:00 | |
It’s been an exciting couple weeks for GenAI! Join us as we discuss the latest research from OpenAI and Anthropic. We’re excited to chat about this significant step forward in understanding how LLMs work and the implications it has for deeper understanding of the neural activity of language models. We take a closer look at some recent research from both OpenAI and Anthropic. These two recent papers both focus on the sparse autoencoder--an unsupervised approach for extracting interpretable features from an LLM. In "Extracting Concepts from GPT-4," OpenAI researchers propose using k-sparse autoencoders to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. In "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," researchers at Anthropic show that scaling laws can be used to guide the training of sparse autoencoders, among other findings. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
28 Jun 2024 | RAFT: Adapting Language Model to Domain Specific RAG | 00:44:01 | |
Where adapting LLMs to specialized domains is essential (e.g., recent news, enterprise private documents), we discuss a paper that asks how we adapt pre-trained LLMs for RAG in specialized domains. SallyAnn DeLucia is joined by Sai Kolasani, researcher at UC Berkeley’s RISE Lab (and Arize AI Intern), to talk about his work on RAFT: Adapting Language Model to Domain Specific RAG. RAFT (Retrieval-Augmented FineTuning) is a training recipe that improves an LLM’s ability to answer questions in a “open-book” in-domain settings. Given a question, and a set of retrieved documents, the model is trained to ignore documents that don’t help in answering the question (aka distractor documents). This coupled with RAFT’s chain-of-thought-style response, helps improve the model’s ability to reason. In domain-specific RAG, RAFT consistently improves the model’s performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG. Read it on the blog: https://arize.com/blog/raft-adapting-language-model-to-domain-specific-rag/ Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
23 Jul 2024 | DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines | 00:33:57 | |
Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic “prompt engineering.” Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
06 Aug 2024 | Breaking Down Meta's Llama 3 Herd of Models | 00:44:40 | |
Meta just released Llama 3.1 405B–according to them, it’s “the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.” Will the latest Llama herd ignite new applications and modeling paradigms like synthetic data generation? Will it enable the improvement and training of smaller models, as well as model distillation? Meta thinks so. We’ll take a look at what they did here, talk about open source, and decide if we want to believe the hype. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
16 Aug 2024 | Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges | 00:39:05 | |
This week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prompts to answer questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. Read it on the blog: https://arize.com/blog/judging-the-judges-llm-as-a-judge/ Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
11 Sep 2024 | Composable Interventions for Language Models | 00:42:35 | |
This week, we're excited to be joined by Kyle O'Brien, Applied Scientist at Microsoft, to discuss his most recent paper, Composable Interventions for Language Models. Kyle and his team present a new framework, composable interventions, that allows for the study of multiple interventions applied sequentially to the same language model. The discussion will cover their key findings from extensive experiments, revealing how different interventions—such as knowledge editing, model compression, and machine unlearning—interact with each other. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
19 Sep 2024 | Breaking Down Reflection Tuning: Enhancing LLM Performance with Self-Learning | 00:26:54 | |
A recent announcement on X boasted a tuned model with pretty outstanding performance, and claimed these results were achieved through Reflection Tuning. However, people were unable to reproduce the results. We dive into some recent drama in the AI community as a jumping off point for a discussion about Reflection 70B. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
27 Sep 2024 | Exploring OpenAI's o1-preview and o1-mini | 00:42:02 | |
OpenAI recently released its o1-preview, which they claim outperforms GPT-4o on a number of benchmarks. These models are designed to think more before answering and handle complex tasks better than their other models, especially science and math questions. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
15 Oct 2024 | Google's NotebookLM and the Future of AI-Generated Audio | 00:43:28 | |
This week, Aman Khan and Harrison Chu explore NotebookLM’s unique features, including its ability to generate realistic-sounding podcast episodes from text (but this podcast is very real!). They dive into some technical underpinnings of the product, specifically the SoundStorm model used for generating high-quality audio, and how it leverages a hierarchical vector quantization approach (RVQ) to maintain consistency in speaker voice and tone throughout long audio durations. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
16 Oct 2024 | The Shrek Sampler: How Entropy-Based Sampling is Revolutionizing LLMs | 00:03:31 | |
In this byte-sized podcast, Harrison Chu, Director of Engineering at Arize, breaks down the Shrek Sampler. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
24 Oct 2024 | KV Cache Explained | 00:04:19 | |
In this episode, we dive into the intriguing mechanics behind why chat experiences with models like GPT often start slow but then rapidly pick up speed. The key? The KV cache. This essential but under-discussed component enables the seamless and snappy interactions we expect from modern AI systems. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
29 Oct 2024 | Swarm: OpenAI's Experimental Approach to Multi-Agent Systems | 00:46:46 | |
As multi-agent systems grow in importance for fields ranging from customer support to autonomous decision-making, OpenAI has introduced Swarm, an experimental framework that simplifies the process of building and managing these systems. Swarm, a lightweight Python library, is designed for educational purposes, stripping away complex abstractions to reveal the foundational concepts of multi-agent architectures. In this podcast, we explore Swarm’s design, its practical applications, and how it stacks up against other frameworks. Whether you’re new to multi-agent systems or looking to deepen your understanding, Swarm offers a straightforward, hands-on way to get started. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
12 Nov 2024 | Introduction to OpenAI's Realtime API | 00:29:56 | |
We break down OpenAI’s realtime API. Learn how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we walk through the API’s capabilities, potential use cases, and best practices for implementation. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
23 Nov 2024 | Agent-as-a-Judge: Evaluate Agents with Agents | 00:24:54 | |
This week, we break down the “Agent-as-a-Judge” framework—a new agent evaluation paradigm that’s kind of like getting robots to grade each other’s homework. Where typical evaluation methods focus solely on outcomes or demand extensive manual work, this approach uses agent systems to evaluate agent systems, offering intermediate feedback throughout the task-solving process. With the power to unlock scalable self-improvement, Agent-as-a-Judge could redefine how we measure and enhance agent performance. Let's get into it! Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
10 Dec 2024 | Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies | 00:28:47 | |
LLMs have revolutionized natural language processing, showcasing remarkable versatility and capabilities. But individual LLMs often exhibit distinct strengths and weaknesses, influenced by differences in their training corpora. This diversity poses a challenge: how can we maximize the efficiency and utility of LLMs? A new paper, "Merge, Ensemble, and Cooperate: A Survey on Collaborative Strategies in the Era of Large Language Models," highlights collaborative strategies to address this challenge. In this week's episode, we summarize key insights from this paper and discuss practical implications of LLM collaboration strategies across three main approaches: merging, ensemble, and cooperation. We also review some new open source models we're excited about. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
23 Dec 2024 | LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods | 00:28:57 | |
We discuss a major survey of work and research on LLM-as-Judge from the last few years. "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods" systematically examines the LLMs-as-Judge framework across five dimensions: functionality, methodology, applications, meta-evaluation, and limitations. This survey gives us a birds eye view of the advantages, limitations and methods for evaluating its effectiveness. Read a breakdown on our blog: https://arize.com/blog/llm-as-judge-survey-paper/ Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
14 Jan 2025 | Training Large Language Models to Reason in Continuous Latent Space | 00:24:58 | |
LLMs have typically been restricted to reason in the "language space," where chain-of-thought (CoT) is used to solve complex reasoning problems. But a new paper argues that language space may not always be the best for reasoning. In this paper read, we cover an exciting new technique from a team at Meta called Chain of Continuous Thought—also known as "Coconut." In the paper, "Training Large Language Models to Reason in a Continuous Latent Space" explores the potential of allowing LLMs to reason in an unrestricted latent space instead of being constrained by natural language tokens. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
04 Feb 2025 | Multiagent Finetuning: A Conversation with Researcher Yilun Du | 00:30:03 | |
We talk to Google DeepMind Senior Research Scientist (and incoming Assistant Professor at Harvard), Yilun Du, about his latest paper "Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains." This paper introduces a multiagent finetuning framework that enhances the performance and diversity of language models by employing a society of agents with distinct roles, improving feedback mechanisms and overall output quality. The method enables autonomous self-improvement through iterative finetuning, achieving significant performance gains across various reasoning tasks. It's versatile, applicable to both open-source and proprietary LLMs, and can integrate with human-feedback-based methods like RLHF or DPO, paving the way for future advancements in language model development. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
21 Feb 2025 | How DeepSeek is Pushing the Boundaries of AI Development | 00:29:54 | |
This week, we dive into DeepSeek. SallyAnn DeLucia, Product Manager at Arize, and Nick Luzio, a Solutions Engineer, break down key insights on a model that have dominating headlines for its significant breakthrough in inference speed over other models. What’s next for AI (and open source)? From training strategies to real-world performance, here’s what you need to know. Read a summary: https://arize.com/blog/how-deepseek-is-pushing-the-boundaries-of-ai-development/ Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
01 Mar 2025 | AI Roundup: DeepSeek’s Big Moves, Claude 3.7, and the Latest Breakthroughs | 00:30:23 | |
This week, we're mixing things up a little bit. Instead of diving deep into a single research paper, we cover the biggest AI developments from the past few weeks. We break down key announcements, including:
Stay ahead of the curve with this fast-paced recap of the most important AI updates. We'll be back next time with our regularly scheduled programming. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
25 Mar 2025 | Model Context Protocol (MCP) | 00:15:03 | |
We cover Anthropic’s groundbreaking Model Context Protocol (MCP). Though it was released in November 2024, we've been seeing a lot of hype around it lately, and thought it was well worth digging into. Learn how this open standard is revolutionizing AI by enabling seamless integration between LLMs and external data sources, fundamentally transforming them into capable, context-aware agents. We explore the key benefits of MCP, including enhanced context retention across interactions, improved interoperability for agentic workflows, and the development of more capable AI agents that can execute complex tasks in real-world environments. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
04 Apr 2025 | AI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last Exam | 00:26:11 | |
This week we talk about modern AI benchmarks, taking a close look at Google's recent Gemini 2.5 release and its performance on key evaluations, notably Humanity's Last Exam (HLE). In the session we covered Gemini 2.5's architecture, its advancements in reasoning and multimodality, and its impressive context window. We also talked about how benchmarks like HLE and ARC AGI 2 help us understand the current state and future direction of AI. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. | |||
18 Apr 2025 | LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection | 00:27:19 | |
For this week's paper read, we actually dive into our own research. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X. |