
AI Explained Official Podcast (Philip - Host of AI Explained YT)
Explore every episode of AI Explained Official Podcast
Pub. Date | Title | Duration | |
---|---|---|---|
28 Oct 2024 | The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think | 00:22:34 | |
A new state of the art LLM (at least for creative writing and basic reasoning) but what lies behind the numbers that were put out? Is it for real, and are AI agents about to grab your mouse and shake your cursor? | |||
01 Nov 2024 | ChatGPT with Search, Altman Answers Anything and Simple Bench Out | 00:15:20 | |
The Google destroyer, the Perplexity crusher? Or just hype? ChatGPT with Search is here, and simultaneously Altman and co did an AMA on Reddit, covering GPT-5, Sora, SearchGPT and a lot more. Plus, the biggest news of them all: Simple Bench is out. Altman AMA (ask me anything): https://www.reddit.com/r/ChatGPT/comments/1ggixzy/ama_with_openais_sam_altman_kevin_weil_srinivas/ https://x.com/sama/status/1852041075793522911 Perplexity Ads: https://www.cnbc.com/2024/08/22/perplexity-ai-plans-to-start-running-search-ads-in-fourth-quarter.html Perplexity: https://www.perplexity.ai/ | |||
10 Nov 2024 | Leak: ‘GPT-5 exhibits diminishing returns’, Sam Altman: ‘lol’ | 00:15:44 | |
The last few days have seen two narratives emerge. One, derived from yesterday’s OpenAI leak in TheInformation, that GPT-5/Orion is a disappointment, and less of a leap than GPT-3 to GPT-4. The second comes from a series of 4 clips (shown in this video) from Sam Altman, regarding the ‘clear path’ to AGI. Let’s go beyond the headlines (and through papers like Frontier Math) to get closer to the ground truth…
00:39 – Bear Case, TheInformation Leak 04:01 – Bull Case, Sam Altman 06:20 – FrontierMath 11:29 – o1 Paradigm 13:11 – Text to Video Greatness and Universal-2
TheInformation Leak: https://www.theinformation.com/articles/openai-shifts-strategy-as-rate-of-gpt-ai-improvements-slows?rc=sy0ihq Noam Brown Replies: https://x.com/polynoamial/status/1855453104394637444 Sam Altman Y-Combinator Interview: https://www.youtube.com/watch?v=xXCBz_8hM9w&t=1556s Altman Reply: https://x.com/sama/status/1855100359511097828 FrontierMath Paper: https://arxiv.org/pdf/2411.04872 Frontier Math Blog Post: https://epochai.org/frontiermath Tao: https://x.com/EpochAIResearch/status/1854996368814936250 MMLU Are We Done (cites me!): https://arxiv.org/pdf/2406.04127 Universal-2 https://www.assemblyai.com/research/universal-2 Noam Brown ‘We don’t know’: https://www.youtube.com/watch?v=Gr_eYXdHFis Anthropic Founder Response: https://x.com/jackclarkSF/status/1855485569998217231 Sora (Runway Comment): https://x.com/c_valenzuelab/status/1855026417354129455 Sora New Vid: https://www.youtube.com/watch?v=_iETa2KDRuw Darri3D Video: https://www.reddit.com/r/ChatGPT/comments/1gn0n3z/can_you/ | |||
15 Nov 2024 | New Google Model Ranked ‘No. 1 LLM’, But There’s a Problem | 00:15:19 | |
A new and mysterious Gemini model appears at the top of the leaderboard, but is that the full story? I dig behind the headline to show you some anti-climactic results, give some context with leaks in the last 48 hours of diminishing returns to scaling, and add the response of Altman, OpenAI and co. The future is about to look a lot stranger...
You can now gift memberships to AI Insiders (my Patreon w/ exclusive vids, network): https://www.patreon.com/AIExplained/gift https://x.com/vedantmisra/status/1857148554105544708 Gemini Ranking: https://lmarena.ai/?leaderboard API not yet up: https://x.com/OfficialLoganK/status/1857106844805681153 ‘Just Die Chat’: https://x.com/koltregaskes/status/1856754648146653428 Google CEO tweet: https://x.com/sundarpichai/status/1857114106928718329 Sutskever Quote: https://www.reuters.com/technology/artificial-intelligence/openai-rivals-seek-new-path-smarter-ai-current-methods-hit-limitations-2024-11-11/ Another OpenAI Staffer Leaves: https://x.com/RichardMCNgo/status/1856843040427839804 Bloomberg Report: https://www.bloomberg.com/news/articles/2024-11-13/openai-google-and-anthropic-are-struggling-to-build-more-advanced-ai?s=09 Noam Brown on what OpenAI Researchers Believe: https://x.com/polynoamial/status/1855037689533178289 Clive Chan: https://x.com/itsclivetime/status/1855704120495329667 Chollet Responds to Altman: https://x.com/fchollet/status/1857060079586975852 https://x.com/sama/status/1856940152460869718 Altman Emails: https://x.com/TechEmails/status/1857285960997712356 Change of Heart: https://sd11.senate.ca.gov/news/senator-wiener-responds-openai-opposition-sb-1047 Amodei on ‘Empirical Regularities’: https://lexfridman.com/dario-amodei-transcript/ Verge Report: https://www.theverge.com/2024/10/25/24279600/google-next-gemini-ai-model-openai-december OpenAI Agents in January: https://www.bloomberg.com/news/articles/2024-11-13/openai-nears-launch-of-ai-agents-to-automate-tasks-for-users?srnd=phx-ai | |||
05 Dec 2024 | AI Breaks Its Silence: OpenAI’s ‘Next 12 Days’, Genie 2, and a Word of Caution | 00:15:29 | |
Calmest before the storm? Whatever analogy you want to use things had gotten quiet toward the end of 2024. But then tonight we got Genie 2, and a series of scheduled announcements from OpenAI. Sora is soon here, and o1, but I dive deeper into what it all means and whether reliability is on a path to being solved, ft: two recent papers. Assembly AI Speech to Text: https://www.assemblyai.com/?utm_source=youtube&utm_medium=influencer&utm_campaign=ai_explained Plus Kling Motion Brush, Simple Bench QwQ update and much more.
Jim Cramer: https://x.com/jimcramer/status/1864068878692675625 Give Us Full o1: https://x.com/tszzl/status/1863882905422106851 Verge Scoop: https://x.com/tomwarren/status/1864326361415925861 O1 Learning to Reason Benchmarks: https://openai.com/index/learning-to-reason-with-llms/ SIMA AI: https://arxiv.org/pdf/2404.10179 Genie Paper: https://arxiv.org/pdf/2402.15391 My Video on Genie: https://www.youtube.com/watch?v=gGKsfXkSXv8 Oasis Minecraft: https://x.com/risphereeditor/status/1852619965511204974 LLMs Procedural Knowledge Paper: https://arxiv.org/pdf/2411.12580 Bag of Heuristics Paper: https://arxiv.org/pdf/2410.21272 Jensen Huang Hallucinations: https://www.tomshardware.com/tech-industry/artificial-intelligence/jensen-says-we-are-several-years-away-from-solving-the-ai-hallucination-problem-in-the-meantime-we-have-to-keep-increasing-our-computation DeepSeek Interview: https://www.chinatalk.media/p/deepseek-ceo-interview-with-chinas Kling Motion Brush: https://klingai.com/image-to-video Tim Rocktaschel Book: https://geni.us/ArtificialIntelligence 00:43 - OpenAI 12 Days, Sora Turbo, o1 03:06 - Genie 2 08:26 - Jensen Huang and Altman Hallucination Predictions 09:45 - Bag of Heuristics Paper 11:40 - Procedural Knowledge Paper 13:45 - SimpleBench QwQ and Chinese Models 14:42 - Kling Motion Brush | |||
05 Dec 2024 | o1 Pro Mode – Full Analysis (plus o1 paper highlights) | 00:16:43 | |
Oh boy. o1 pro mode out on the same night as o1 full. I read the 49 page paper, ran my own tests, spent my fuel allowance on Pro Mode and will give you all the highlights. Suffice to say the story is not as simple as it first appears. Weights and Biases’ Weave: wandb.me/ai_explained Plus, GPT-4.5? MLE Bench, Simple Update, Image Analysis and much more
o1 System Card: https://cdn.openai.com/o1-system-card-20241205.pdf Apollo Research: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations Altman Tweet: https://x.com/AnonCEOMakeItAi/status/1864763052622504344 ChatGPT Pro: https://openai.com/index/introducing-chatgpt-pro/ Tibor Blaho: https://x.com/btibor91/status/1864709670470066605 Simple-bench.com
00:00 - Introduction 00:27 - ChatGPT Pro is $200 01:25 - OpenAI Benchmarks 03:20 - o1 System Card, o1 and o1 Pro Mode vs o1-preview 06:18 - Simple Bench surprising results on sample 08:31 - Weight & Biases 09:05 - Image Analysis Compared 12:51 - More Benchmarks and Safety | |||
10 Dec 2024 | Sora is Out, But is it a Distraction? | 00:15:34 | |
After a 10 month wait, OpenAI have released Sora to paying users. With just a prompt it can generate videos of up to 20 seconds in lower resolutions, and 10 seconds at 1080p if you can fork out $200/month. I’ve tested it and read the system card. The user interface is quite beautiful, even if the videos themselves operate until entirely new rules of physics. But I can’t help wondering if OpenAI want up to focus on releases like this, rather than some quietly broken promises. 80,000 hours Website, Podcast + Channel: https://open.spotify.com/show/2WzJwXWBDnn4iZ7odKwDib https://www.youtube.com/@eightythousandhours/videos https://openai.com/sora/ Sora Countries: https://help.openai.com/en/articles/10250692-sora-supported-countries Sora Credits: https://help.openai.com/en/articles/10245774-sora-billing-credits-faq https://runwayml.com/ and https://pika.art/home DeepMind Veo: https://deepmind.google/technologies/veo/ Sam Altman Ads as Last Resort: https://www.windowscentral.com/software-apps/openai-could-chase-intrusive-ads-as-last-resort But OpenAI Considering Ads: https://www.inc.com/ben-sherry/is-openai-getting-into-the-advertising-business-the-company-is-sending-mixed-messages/91033533 OpenAI Backtracks on Microsoft AGI Clause: https://www.ft.com/content/2c14b89c-f363-4c2a-9dfc-13023b6bce65 As Microsoft Boast of Labor Savings: https://www.theinformation.com/articles/microsofts-new-sales-pitch-for-ai-spend-less-money-on-humans?rc=sy0ihq OpenAI Military Pivot: https://www.technologyreview.com/2024/12/04/1107897/openais-new-defense-contract-completes-its-military-pivot/ Employees Have Doubts: https://www.washingtonpost.com/technology/2024/12/06/openai-anduril-employee-military-ai/?nid=top_pb_signin&arcId=KZIV7PLRHBCVNPAIAAAVUNRHIM&account_location=ONSITE_HEADER_ARTICLE | |||
12 Dec 2024 | Never Browse Alone? - Gemini 2 Live and ChatGPT Vision | 00:13:40 | |
The ‘Gemini 2 Era’ begins … with screen-sharing? But really, it’s a great free tool, for curiosity satisfying rather than bleeding-edge intelligence. I give you the benchmarks, the highlights and of course, the latest from OpenAI Advanced Voice Mode with Vision. Plus Deep Research in Gemini Advanced, Simple Bench updates, Santa and what might be for some of you Google’s deflating admission. 00:00 - Introduction 00:38 - Live Interaction 03:43 - Gemini 2.0 Flash Benchmarks 05:10 - Audio and Image Output 06:38 - Project Mariner (+ WebVoyager Bench) 08:49 - But Progress Slowing Down? 10:43 - OpenAI Announcements + Games https://aistudio.google.com/live Gemini 2.0 Flash Benchmarks: https://deepmind.google/technologies/gemini/ Project mariner: https://deepmind.google/technologies/project-mariner/ WebVoyager: https://x.com/laurentsifre/status/1858918588683296875/photo/1 Gemini Game play: https://www.youtube.com/watch?v=IKuGNHJBGsc Advanced Voice Mode OpenAI: https://www.youtube.com/watch?v=NIQDnWlwYyQ Claude Computer Use: https://docs.anthropic.com/en/docs/build-with-claude/computer-use Oriol Vinyals Interview: https://www.youtube.com/watch?v=78mEYaztGaw&t=687s | |||
21 Dec 2024 | o3 - wow | 00:22:20 | |
o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more. FrontierMath: https://epoch.ai/frontiermath https://arxiv.org/pdf/2411.04872 Chollet Statement:https://arcprize.org/blog/oai-o3-pub-breakthrough MLC Paper: AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf Human Performance on ARC-AGI: https://arxiv.org/pdf/2409.01374v1 Wei Tweet ‘3 months’:https://x.com/_jasonwei/status/1870184982007644614 Deliberative Alignment Paper: https://openai.com/index/deliberative-alignment/ Brown Safety Tweet: https://x.com/polynoamial/status/1870196476908834893 Swe-Bench Verified: https://openai.com/index/introducing-swe-bench-verified/ Amodei Prediction: https://x.com/OfirPress/status/1858567863788769518 David Dohan: 16 hours https://x.com/dmdohan/status/1870171404093796638 OpenAI Personal Writing: https://openai.com/index/learning-to-reason-with-llms/ John Hallman Tweet: https://x.com/johnohallman/status/1870233375681945725 00:00 - Introduction 01:19 - What is o3? 03:18 - FrontierMath 05:15 - o4, o5 06:03 - GPQA 06:24 - Coding, Codeforces + SWE-verified, AlphaCode 2 08:13 - 1st Caveat 09:03 - Compositionality? 10:16 - SimpleBench? 13:11 - ARC-AGI, Chollet | |||
08 Jan 2025 | OpenAI Backtracks on Superintelligence + Altman Brings His Timeline Forward | 00:23:41 | |
Sam Altman unexpectedly brings his timelines to AGI forward, while OpenAI backtrack on superintelligence. None of these changes were heralded, but they are significant. Plus the new year brings new assessments of the true capability of models to automate 'large swathes of the economy'. I'll give my prediction on that front for 2025, announcement a new Simple Bench competition, and showcase Kling 1.6 vs Veo 2 vs Sora, and much more. (Colab): https://colab.research.google.com/drive/1AVijcPnEkl8Gy_754XbRdG5m7Q5-9slg?usp=sharing TheAgentCompany Paper: https://arxiv.org/pdf/2412.14161v1 Sam Altman Major Interview: https://www.bloomberg.com/features/2025-sam-altman-interview/?srnd=phx-ai OpenAI Agent Coming Jan 2025: https://www.theinformation.com/articles/why-openai-is-taking-so-long-to-launch-agents?rc=sy0ihq Altman Singularity: https://x.com/sama/status/1875603249472139576 Altman Original Timeline: https://www.youtube.com/watch?v=7dCPytNTnjk&t=621s https://www.ft.com/content/34a7a082-e685-4e02-bca7-61ff89d99ed2 OpenAI Original Emails: https://www.lesswrong.com/posts/5jjk4CDnj9tA7ugxr/openai-email-archives-from-musk-v-altman-and-openai-blog DeepMind Sky News 2014 Article: https://news.sky.com/story/google-buys-uk-intelligence-firm-deepmind-10419783 Altman Blog Reflections: https://blog.samaltman.com/reflections OpenAI Changes Who Gets AGI: https://openai.com/index/why-our-structure-must-evolve-to-advance-our-mission/?s=09 OpenAI 5 Levels: https://www.bloomberg.com/news/articles/2024-07-11/openai-sets-levels-to-track-progress-toward-superintelligent-ai Altman 2015: https://blog.samaltman.com/machine-intelligence-part-1 OpenAI React to Anthropic: https://www.theinformation.com/articles/how-anthropic-got-inside-openais-head?rc=sy0ihq Microsoft $100B Definition: https://www.theinformation.com/articles/microsoft-and-openai-wrangle-over-terms-of-their-blockbuster-partnership?rc=sy0ihq GPQA Progress: https://epoch.ai/data/ai-benchmarking-dashboard Task Length Crucial for ARC-AGI: https://anokas.substack.com/p/llms-struggle-with-perception-not-reasoning-arcagi RL Environment Tweet: https://x.com/vedantmisra/status/1876327518157807990 Jason Wei Talk: https://www.youtube.com/watch?v=yhpjpNXJDco Miles Brunda | |||
20 Jan 2025 | Altman Expects a ‘Fast Take-off’, ‘Super-Agent’ Debuting Soon and DeepSeek R1 Out | 00:13:11 | |
OpenAI looks set to debut their Operator system, and some leaks are out. At the same time Deepseek R1 releases some numbers, and Sam Altman says he might have been wrong before, and now anticipates a 'fast take-off'. Plus two papers to give you an idea of what a super-agent might be decent at doing, some more exclusive article analysis and much more. Who said anything else is happening today... | |||
24 Jan 2025 | Nothing Much Happens in AI, Then Everything Does All At Once | 00:23:09 | |
When it rains, it pours. OpenAI Operator tested and reviewed, with full paper analysis. Perplexity Assistant is useful. Then Stargate, is it all smoke and mirrors? Strong rumours of an o3+ model from Anthropic. Then a full breakdown of Deepseek R1, and what it’s training method says about the state of AI. It’s not open source BTW. Plus Humanity’s Last Exam, and Hassabis Accelerates his AGI timeline. 00:00 - Introduction 00:54 - OpenAI Operator 04:53 - Perplexity Assistant 05:15 - StarGate 07:51 - Better than o3? 08:25 - DeepSeek R1 Analysis 12:12 - Training Secrets 15:19 - No More Process Rewarding ? 19:01 - Hassabis Timeline Accelerates 21:22 - Humanity’s Last Exam https://app.grayswan.ai/arena/chat/harmful-ai-assistant https://openai.com/index/computer-using-agent/ System Prompt: https://github.com/wunderwuzzi23/scratch/blob/master/system_prompts/operator_system_prompt-2025-01-23.txt OpenAI Operator: https://operator.chatgpt.com/ System Card: https://cdn.openai.com/operator_system_card.pdf There is No Plan: https://x.com/jeffclune/status/1882120726339318007 Perplexity Assistant: https://x.com/perplexity_ai/status/1882466239123255686 Stargate: https://openai.com/index/announcing-the-stargate-project/ Labour goes to 0: https://moores.samaltman.com/ Larry Ellison AI Surveillance: https://x.com/TheChiefNerd/status/1882042989184430332 Amodei 1984: https://www.bloomberg.com/news/articles/2025-01-22/anthropic-ceo-says-openai-s-stargate-venture-seems-chaotic Microsoft Hesitate: https://www.theinformation.com/articles/why-sam-altman-joined-forces-with-larry-ellison-and-took-a-step-back-from-microsoft?rc=sy0ihq Dylan Patel o3+ for Anthropic: https://www.youtube.com/watch?v=7EH0VjM3dTk Deepseek R1: https://arxiv.org/pdf/2501.12948 https://arxiv.org/pdf/2412.19437 Diagram: https://pbs.twimg.com/media/GhyQsM6WQAE7W52?format=jpg&name=large https://simple-bench.com/ Process: https://x.com/sama/status/1664018190840614912 https://x.com/karpathy/status/1835561952258723930 https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/?s=09 Demis Interview: https://www.youtube.com/watch?v=yr0GiSgUvPU Humanity’s Last Exam: https://x.com/DanHendrycks/status/1882481730671857815 https://www.nytimes.com/2025/01/23/technology/ai-test-humanitys-last-exam.html?s=09 | |||
31 Jan 2025 | o3-mini and the “AI War” | 00:15:21 | |
o3-mini is here, and yes, I’ve read the paper in full - 2 hours after release, and even the post-launch Reddit AMA. Some epic details like a FrontierMath score that made me double-take, a likely new Cursor favorite, bio risk expertise and a cost-comparison with Deepseek R1., But does it perform on basic reasoning - let’s find out. Plus, arguably the bigger story - the increasingly frenetic rhetoric coming out of the West - and Dario Amodei and Alexandr Wang (CEOs of Anthropic and Scale AI respectively) in particular. The last thing we need is an “AI War”. (Colab): https://colab.research.google.com/drive/1AVijcPnEkl8Gy_754XbRdG5m7Q5-9slg?usp=sharing Chapters: 00:00 - Introduction 00:45 - o3 mini 05:11 - First impressions vs Deepseek R1 07:21 - 10x Scale, o3-mini System Card, Amodei Essay, bitcoin wallets… 12:40 - Simple Competition Finale 13:03 - Clips and Final Thoughts on the “AI War” O3-mini: https://openai.com/index/openai-o3-mini/ Paper: https://cdn.openai.com/o3-mini-system-card.pdf Amodei Essay: https://darioamodei.com/on-deepseek-and-export-controls?s=09 FrontierMath wild stat:https://arxiv.org/pdf/2411.04872 Sam Altman Channels Napoleon: https://x.com/sama/status/1883185690508488934 Altman ‘pulls up releases’: https://x.com/sama/status/1884066337103962416 “AI War” by Wang: https://scale.com/blog/win-the-ai-war Anthropic Original Views on Capabilities: https://www.anthropic.com/news/core-views-on-ai-safety AI Insider Cost Comparison:https://x.com/arankomatsuzaki/status/1884676245922934788 Deepseek R1 Paper: https://arxiv.org/pdf/2501.12948 R1, o3-mini Price Comparison: https://techcrunch.com/2025/01/31/openai-launches-o3-mini-its-latest-reasoning-model/ Semianalysis on $1,3M deepseek salaries, and them falling behind as ‘the time gap to match US capabilities increases’: https://semianalysis.com/2025/01/31/deepseek-debates/ OpenAI Valuation: https://www.bloomberg.com/news/articles/2025-01-30/openai-in-talks-to-raise-funding-at-340-billion-value-wsj-says?srnd=phx-ai Wang Clip: https://x.com/tsarnick/status/1867700453494206883 Amodei Clip: https://x.com/ai_ctrl/status/1884951111771001188 https://simple-bench.com/ | |||
03 Feb 2025 | Deep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research | 00:18:32 | |
12 hours ago Deep Research was unveiled, and I’ve tested it thoroughly, including vs Deepseek R1 with search, Gemini Deep Research and even R1 in Perplexity. It’s a notable step forward, with one big caveat. I’ll go through all the benchmark figures, my initial impression of the o3 model within, and much more. https://www.youtube.com/watch?v=YkCDVn3_wiw GAIA Bench: https://openreview.net/forum?id=fibxvahvs3 https://openreview.net/pdf?id=fibxvahvs3 CodeELO:https://arxiv.org/pdf/2501.01257 CamelCamel:https://uk.camelcamelcamel.com/ Deepseek R1 with search: https://chat.deepseek.com/ https://arxiv.org/pdf/2501.12948 HaluBench: https://arxiv.org/pdf/2407.08488 Chapters: 00:00 - Introduction 01:06 - Powered by o3, Humanity’s Last Exam, GAIA 03:55 - Simple Tests 06:00 - Good News vs Deepseek R1 and Gemini Deep Research 09:32 - Bad News on Hallucinations 14:14 - What Can’t it Browse? 14:42 - For Shopping? 16:40 - Final thoughts | |||
11 Feb 2025 | AGI: (gets close), Humans: ‘Who Gets the Money?’ | 00:22:17 | |
A 'frontier reasoning model' from just 1000 examples (s1). A $100B Musk bid for power. Gemini 2, Rand and warning from Amodei. Here’s 7-8 developments you may have missed but which I would argue help us understand how the next few years will play out. From labour vs capital to automating rival companies and countries, and from non-profit shenanigans to new mini-docs, there was just too much for me not to make a vid. | |||
25 Feb 2025 | Claude 3.7 is More Significant than its Name Implies (ft DeepSeek R2 + GPT 4.5 coming soon) | 00:27:39 | |
Claude 3.7 is here, hot on the heels of Grok 3 and a host of other developments, but how good is it really? And what does it say about the next few months in AI? I’ve read the papers, played with the model for hours, and benched it on Simple. Things aren’t slowing down. Plus the latest in humanoid robots, led by Helix and freaked out by Protoclone. And reports of GPT 4.5 and DeepSeek R2. GraySwan Competition! https://app.grayswan.ai/arena/challenge/agent-red-teaming https://x.com/GraySwanAI/status/1894084923260043282 Chapters: 00:00 - Introduction 01:25 - Claude 3.7 New Stats/Demos 05:22 - 128k Output 06:13 - Pokemon 06:58 - Just a tool? 09:54 - DeepSeek R2 10:20 - Claude 3.7 System Card/Paper Highlights 17:18 - Simple Record Score/Competition 20:37 - Grok 3 + Redteaming prizes 22:26 - Google Co-scientist 24:02 - Humanoid Robot Developments 3.7 Release Notes: https://www.anthropic.com/news/claude-3-7-sonnet vs o3 and Grok 3: https://x.com/12exyz/status/1891723056931827959 Extended Thinking: https://www.anthropic.com/research/visible-extended-thinking?s=09 System Prompt: https://docs.anthropic.com/en/release-notes/system-prompts#feb-24th-2025 System Card: https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf Unfaithful CoT: https://arxiv.org/pdf/2305.04388 Original Constitution: https://www.anthropic.com/news/claudes-constitution Responsible Scaling Policy: https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf Amodei and Hassabis:https://www.youtube.com/watch?v=4poqjZlM8Lo https://simple-bench.com/ 400 Weekly Users: https://x.com/bradlightcap/status/1892579908179882057 Grok 3 Jailbroken: https://x.com/LinusEkenstam/status/1893832876581380280 Google Co-Scientist: https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/ But Hassabis Says Years Away: https://www.youtube.com/watch?v=yr0GiSgUvPU&t=156s DeepSeek R2 Reuters: https://www.reuters.com/technology/artificial-intelligence/deepseek-rushes-launch-new-ai-model-china-goes-all-2025-02-25/ Protoclone: https://www.reddit.com/r/interestingasfuck/comments/1it9rpp/protoclone_the_worlds_first_bipedal/ Helix: https://www.figure.ai/news/helix TechTrance: https://www.youtube.com/@TheTechTrance/videos | |||
28 Feb 2025 | GPT 4.5 - not so much wow | 00:25:05 | |
GPT 4.5 is here, and do you remember when AI lab CEOs like Sam Altman and Dario Amodei were betting everything on scaling up base models like this one? Well let’s find out what would have happened if the future of AI rested on models like GPT 4.5. You’ll see all the benchmarks, highlights of the paper, emotional intelligence and humor tests, Simple Bench results (reddit was an unreliable source), and why it’s not all bad news for OpenAI. | |||
13 Mar 2025 | Manus AI - The Calm Before the Hypestorm … (vs Deep Research + Grok 3) | 00:12:58 | |
Is Manus AI the memecoin of the AI world, or legit? I’ll compare it to OpenAI’s Deep Research, Operator, Grok 3 DeepSearch and more to find out. I’ll also let you in on some of the secrets of what makes a good hype campaign, the estimated costs of Manus AI, and where it is strong. Other news (yes, Gemini image editing and research hacking, I mean you), will have to wait for a few more hours, as millions enquire about Manus AI. | |||
25 Mar 2025 | Did AI Just Get Commoditized? Gemini 2.5, New DeepSeek V3, & Microsoft vs OpenAI | 00:13:47 | |
Gemini 2.5 is out, on the same day as the new DeepSeek V3 (which should power Deepseek R2). Do both models prove AI is being commoditized? Let’s find out, on this blockbuster day of AI releases. Plus exclusives from the Information, Simple indications, Vista Bench, LM Arena and more… | |||
28 Mar 2025 | Gemini 2.5 Pro - It’s a Smart Chatbot … (New Simple High Score) | 00:21:21 | |
Gemini gets a new record on Simple Bench, and several other benchmarks. I’ll go deep to explore its nuances, including how it deceptively reverse engineers answers, does better on certain coding benchmarks than others, may have a universal ‘conceptual language’ … | |||
07 Apr 2025 | AI CEO: ‘Stock Crash Could Stop AI Progress’, Llama 4 Anti-climax +‘Superintelligence in 2027’... | 00:23:51 | |
The latest on Llama 4, and whether it signals a slowdown in AI, or solid progress. Plus, a deep dive on that viral prediction of superintelligence by 2027, and Amodei’s cautionary words on what could stop AI progress in its tracks. o3 news, and more, as well. | |||
16 Apr 2025 | ‘Speaking Dolphin’ to AI Data Dominance, 4.1 + Kling 2: 7 Developments Critically Analysed | 00:20:09 | |
This pod won’t just be about the release of GPT 4.1 in the last 48 hours, o3 build-up, Kling 2.0, a sneak-peak at the next OpenAI model, or even the new Dolphin language tool. It will be about 7 such stories that contextualise where we are in AI and what is happening. Chapters: 00:00 - Introduction 00:30 - Kling 2.0 01:35 - GPT 4.1 05:25 - o3 Build-up 07:37 - ‘Product Company’ 09:31 - Safe Superintelligence 10:54 - DolphinGemma 13:16 - Data Dominance? Kling 2.0: https://app.klingai.com/global/release-notes Dolphin Gemma: https://blog.google/technology/ai/dolphingemma/?s=09 https://openai.com/index/gpt-4-1/ OpenAI o3 Build-up The Information: https://www.theinformation.com/articles/openais-latest-breakthrough-ai-comes-new-ideas?rc=sy0ihq Physical reasoning: https://x.com/a_karvonen/status/1911839968990814503 Fiction Live.bench: https://x.com/ficlive/status/1911853409847906626 Altman Ted: https://www.youtube.com/watch?v=5MWT_doo68k https://simple-bench.com/try-yourself https://aider.chat/docs/leaderboards/ 4.5: https://www.youtube.com/watch?v=6nJZopACRuQ Geospatial reasoning: https://research.google/blog/geospatial-reasoning-unlocking-insights-with-generative-ai-and-multiple-foundation-models/ Pioneers: https://x.com/OpenAIDevs/status/1910017976256119151 Evals: https://www.youtube.com/watch?v=scsW6_2SPC4 Anthropic Updates: https://www.bloomberg.com/news/articles/2025-04-15/anthropic-is-readying-a-voice-assistant-feature-to-rival-openai?srnd=phx-ai https://x.com/sethsaler/status/1912188383457059301 https://ai.meta.com/blog/llama-4-multimodal-intelligence/ https://deepmind.google/technologies/gemini/pro/ https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/ https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/ OpenAI Documentary: https://www.patreon.com/posts/one-machine-to-121940490 | |||
16 Apr 2025 | o3 and o4-mini - they’re great, but easy to over-hype | 00:14:24 | |
Critical analysis of the two most powerful new models behind ChatGPT, o3 and o4-mini. Not just the system cards, benchmarks, and my own tests, but some you may not have seen before. Yes, they can whip up amazing front-end in a few seconds, but you always have to ask what is in their data. Either way, they prove the gains from RL are just beginning… |