Autores varios - AI

My M5 Max, Gemma 4, MLX LOCAL Stack. (This KILLS MODEL PROVIDERS)

39:03 min youtube 2026 Semana 17 🇪🇸 ES

TL;DR

Viabilidad Local: El hardware Apple Silicon, especialmente el M5 Max, demuestra una capacidad masiva para ejecutar modelos de IA avanzados (Gemma 4 y Qwen 3.5) localmente, eliminando la dependencia costosa de APIs en la nube.
OptimizaciÃ³n CrÃtica: El formato MLX es fundamental para el rendimiento en Apple Silicon; las variantes MLX superan drÃ¡sticamente a los formatos GGUF, ofreciendo mejoras promedio del 15% al 50%.
LimitaciÃ³n Principal: A pesar de la potencia del hardware, el tamaÃ±o del contexto (mÃ¡s allÃ¡ de 16K-32K tokens) se identifica como el cuello de botella mÃ¡s severo, afectando negativamente tanto la velocidad como la precisiÃ³n.

Abrir en YouTube Volver al explorador

Resumen

YouTube: https://www.youtube.com/watch?v=00Y-p62sk0s | DuraciÃ³n: 39 min

â—† M5 Max MacBook Pro Unboxing y Objetivos

El anÃ¡lisis inicia comparando el M5 y el M4 MacBook Pro con el objetivo de evaluar la ejecuciÃ³n de modelos de inteligencia artificial localmente. La meta es demostrar que se puede utilizar IA privada, rÃ¡pida y econÃ³mica directamente en el dispositivo, evitando las dependencias de APIs externas como Claude u OpenAI.

Se examina el framework MLX de Apple, diseÃ±ado especÃficamente para su silicio, contrastÃ¡ndolo con formatos estÃ¡ndar como GGUF. Los modelos clave en la evaluaciÃ³n incluyen Gemma 4 de Google y la serie Qwen 3.5, buscando establecer un futuro de IA accesible localmente.

â–¶ Gemma and Qwen Cold Prompt (Carga Inicial)

Este capÃtulo se centra en la prueba de "cold start" para medir el tiempo que los modelos tardan en cargarse en la memoria. Se evaluaron Qwen 3.5 y diversas variantes de Gemma 4 (estÃ¡ndar y MLX).

Se observÃ³ que, durante el proceso de carga inicial, el dispositivo M4 logrÃ³ calentar el modelo Qwen 3.5 antes que el M5. No obstante, ambos dispositivos completaron la carga exitosa para iniciar los benchmarks comparativos.

â˜… Gemma and Qwen Warm Prompt (Rendimiento Detallado)

Se realizÃ³ un anÃ¡lisis detallado de los modelos Gemma y Qwen en hardware M5 y M4. Las mÃ©tricas comparadas incluyen velocidades de prellenado, velocidad de decodificaciÃ³n (tokens por segundo) y el tiempo total o "wall time", considerado el indicador mÃ¡s relevante.

Los resultados iniciales indican que el M5 generalmente supera a otros sistemas en ambas velocidades. Para una evaluaciÃ³n precisa, se utilizÃ³ un benchmarking en vivo con cinco prompts de complejidad creciente. AdemÃ¡s del tiempo total, es vital monitorear la memoria de acceso aleatorio (RAM) utilizada en su punto mÃ¡ximo para asegurar la ejecuciÃ³n local.

â–º Benchmark 1 - M5 Destroya a M4 (MLX vs GGUF)

Este benchmark comparÃ³ el rendimiento de Qwen y Gemma en formatos GGUF y MLX. Se demostrÃ³ que las variantes MLX son drÃ¡sticamente mÃ¡s rÃ¡pidas, logrando velocidades de decodificaciÃ³n mucho mayores.

El chip M5 Max supera consistentemente al M4, mostrando una mejora promedio del 15% al 50%. Un hallazgo crucial es que en Apple silicon, la elecciÃ³n de modelos MLX garantiza un rendimiento casi duplicado respecto a sus contrapartes GGUF.

âš ï¸ Alerta CrÃtica de Rendimiento (Local Model Bottleneck)

El benchmark de Graph Walks confirmÃ³ la superioridad del M5 Max en velocidad, especialmente en la fase de prefill. Sin embargo, se identificÃ³ que el verdadero cuello de botella para los modelos locales no es solo el hardware, sino el tamaÃ±o del contexto y la latencia.

A medida que el contexto aumenta a 16K o 32K tokens, el tiempo total de respuesta se vuelve inaceptablemente largo.
La precisiÃ³n de los modelos cae drÃ¡sticamente en contextos muy grandes, fallando al encontrar respuestas correctas. Esto limita severamente las capacidades prÃ¡cticas de la IA local.

â–º Benchmark 3 - Pi Coding Agent y Viabilidad

Se evaluÃ³ el rendimiento de modelos locales en un agente de codificaciÃ³n Pi. Los resultados mostraron que el M5 supera significativamente al M4 en la velocidad total de ejecuciÃ³n, manteniendo una alta tasa de correcciÃ³n incluso en tareas complejas (generaciÃ³n de paquetes completos).

Se concluyÃ³ que los modelos locales son viables para realizar microtareas y trabajo de ingenierÃa sin depender de APIs externas. Sin embargo, se advirtiÃ³ sobre una degradaciÃ³n del rendimiento al aumentar la ventana de contexto o el uso de tokens.

âœ… Recomendaciones Clave

Usar variantes MLX en Mac para maximizar la eficiencia del hardware.
Aprovechar al mÃ¡ximo el poder del hardware M5 Max para cualquier tarea de IA local.

â˜… Local Benchmark, MLX, Gemma y Futuro

El cÃ³digo base de los benchmarks estÃ¡ disponible para la comunidad. Los planes futuros incluyen probar modelos con mÃ¡s parÃ¡metros, desarrollar variantes especializadas del agente PIE, y expandir las pruebas a entradas multimodales (imÃ¡genes y audio).

Aunque aÃºn estÃ¡n en desarrollo, los modelos locales ofrecen una ventaja masiva de ahorro de costos y tiempo. Se predice que para fin de aÃ±o serÃ¡ posible ejecutar modelos avanzados como Sonnet o Opus 4.0 en dispositivos personales con las mejoras actuales de hardware. Los usuarios obtienen privacidad y control total sobre su rendimiento.

â—† Buscar el alpha

La tesis central no es sobre un activo financiero especÃfico, sino sobre una rotaciÃ³n de valor fundamental en la infraestructura de IA. El verdadero cambio econÃ³mico radica en la migraciÃ³n del gasto operativo (OpEx) dependiente de APIs cloud caras y centralizadas hacia soluciones locales optimizadas (CapEx/Hardware). Esto redefine el costo marginal de inferencia, amenazando directamente el modelo de negocio basado en suscripciones premium de los grandes proveedores de modelos.

Catalizador de RÃ©gimen: La demostraciÃ³n del rendimiento superior de las implementaciones nativas y optimizadas (MLX) sobre formatos genÃ©ricos (GGUF) establece que la eficiencia algorÃtmica especÃfica para el silicio es el nuevo diferenciador competitivo, no solo la potencia bruta.
LimitaciÃ³n Inmediata: El cuello de botella actual no es el poder de cÃ³mputo del chip, sino la gestiÃ³n prÃ¡ctica y escalable del contexto (16K-32K tokens). Esto define el horizonte inmediato donde los modelos locales aÃºn son limitados en tareas complejas.
RotaciÃ³n de Capital ImplÃcita: El valor se estÃ¡ moviendo desde las grandes plataformas de infraestructura cloud hacia la optimizaciÃ³n del "edge computing" y el desarrollo de frameworks especializados (como MLX), que democratizan el acceso a IA avanzada sin depender de terceros.
CondiciÃ³n de Reentrada/Escalabilidad: La viabilidad total para reemplazar servicios en la nube solo se cumplirÃ¡ cuando los modelos locales puedan manejar contextos masivos con una latencia aceptable, permitiendo tareas complejas como agentes de codificaciÃ³n avanzados a escala industrial.

La vuelta de tuerca: El invitado estÃ¡ seÃ±alando que la ventaja competitiva en IA ya no reside Ãºnicamente en tener el modelo mÃ¡s grande, sino en ejecutarlo de manera mÃ¡s eficiente y privada. Esto implica una desintermediaciÃ³n del ecosistema AI, donde las empresas pueden internalizar capacidades avanzadas sin incurrir en los costos recurrentes de un proveedor externo, redefiniendo drÃ¡sticamente la estructura de costes operativos de cualquier negocio basado en IA.

â–º Resumen por capÃtulos

M5 Max Mac Book Pro Unboxing (0:00)

El video presenta una comparaciÃ³n entre el M5 y el M4 MacBook Pro para evaluar modelos de inteligencia artificial locales. El objetivo principal es demostrar la viabilidad de utilizar modelos privados, rÃ¡pidos y econÃ³micos directamente en el dispositivo, evitando las dependencias de APIs en la nube como Claude o OpenAI. Se analizarÃ¡n los beneficios del framework MLX de Apple, especializado para su silicio, frente a formatos como GGUF. AdemÃ¡s, se realizarÃ¡n comparaciones entre modelos avanzados como Gemma 4 de Google y la serie Qwen 3.5. Esta iniciativa busca establecer un futuro donde el rendimiento de IA de vanguardia sea accesible localmente, garantizando privacidad y eficiencia.

Gemma and Qwen Cold Prompt (2:22)

El capÃtulo se centra en la comparaciÃ³n de modelos locales utilizando dispositivos M5 y M4. Se realiza una prueba inicial conocida como "cold start" para medir el tiempo que tardan los modelos en cargarse en la memoria. Los modelos evaluados incluyen Qwen 3.5 y diversas variantes de Gemma 4, tanto estÃ¡ndar como MLX. La primera ejecuciÃ³n siempre requiere un tiempo considerable debido a este proceso de carga inicial. Se observa que el dispositivo M4 logrÃ³ calentar el modelo Qwen 3.5 antes que el M5. Finalmente, ambos dispositivos completaron la carga exitosa de sus modelos para iniciar los benchmarks comparativos.

Gemma and Qwen Warm Prompt (3:17)

El video presenta un anÃ¡lisis detallado de los modelos Gemma y Qwen en hardware M5 y M4. Se comparan mÃ©tricas clave como las velocidades de prellenado, la velocidad de decodificaciÃ³n (tokens por segundo) y el tiempo total o "wall time", que es el indicador mÃ¡s importante al incluir todos los costos operativos. Los resultados iniciales muestran que el M5 generalmente supera a otros sistemas en velocidades de prellenado y decodificaciÃ³n para todos los modelos probados. Para una evaluaciÃ³n mÃ¡s precisa, se utiliza una herramienta de benchmarking en vivo con cinco prompts de complejidad creciente. AdemÃ¡s del tiempo total, se monitorea la memoria de acceso aleatorio (RAM) utilizada en su punto mÃ¡ximo, lo cual es vital para determinar si un modelo puede ejecutarse localmente.

Benchmark 1 - M5 destroys M4 (4:57)

El capÃtulo presenta un benchmark comparando el rendimiento de los modelos Qwen y Gemma en dispositivos M4 y M5 Max, utilizando formatos GGUF y MLX. Se demuestra que las variantes MLX son drÃ¡sticamente mÃ¡s rÃ¡pidas que los formatos GGUF, logrando velocidades de decodificaciÃ³n mucho mayores. El chip M5 Max supera consistentemente al M4, mostrando una mejora promedio del 15% al 50%. Un hallazgo clave es que en Apple silicon, siempre se debe optar por modelos MLX para obtener un rendimiento casi duplicado respecto a sus contrapartes GGUF. Aunque los prompts iniciales son simples, el anÃ¡lisis de tokens por segundo confirma la superioridad de las implementaciones optimizadas con MLX. Finalmente, el video transiciona hacia una prueba de escalabilidad de contexto para evaluar cÃ³mo afecta la longitud del prompt al tiempo total de espera.

Benchmark 2 - Local Model Bottleneck (14:20)

El benchmark de Graph Walks demostrÃ³ que el M5 Max supera al M4 en velocidad, especialmente en la fase de prefill. Sin embargo, se identificÃ³ que el verdadero cuello de botella para los modelos locales es el tamaÃ±o del contexto y la latencia. A medida que el contexto aumenta a 16K o 32K tokens, el tiempo total de respuesta se vuelve inaceptablemente largo. AdemÃ¡s, la precisiÃ³n de los modelos cae drÃ¡sticamente en contextos muy grandes, fallando al encontrar respuestas correctas. Esto indica que, aunque el hardware es potente, la gestiÃ³n del contexto limita severamente las capacidades prÃ¡cticas de los modelos locales.

Benchmark 3 - Pi Coding Agent (27:07)

Se realizÃ³ un benchmark comparando el rendimiento de modelos locales en un agente de codificaciÃ³n Pi entre los dispositivos M5 y M4. Los resultados demostraron que el M5 supera significativamente al M4 en la velocidad total de ejecuciÃ³n del agente, manteniendo una alta tasa de correcciÃ³n. Las pruebas escalaron desde tareas sencillas hasta complejas como la generaciÃ³n de paquetes completos, validando la capacidad agentica de estos modelos locales. Se concluyÃ³ que los modelos locales son viables para realizar microtareas y trabajo de ingenierÃa Ãºtil sin depender de APIs externas. Sin embargo, se observÃ³ una degradaciÃ³n en el rendimiento a medida que aumenta la ventana de contexto o el uso de tokens. El presentador recomienda enfÃ¡ticamente usar las variantes MLX en Mac y maximizar el hardware M5 para cualquier tarea de IA local.

Local Benchmark, MLX, Gemma, and M5 Takeaways (34:55)

El cÃ³digo base de los benchmarks locales estÃ¡ disponible para que la comunidad lo utilice y personalice. El trabajo futuro se centrarÃ¡ en probar modelos con mÃ¡s parÃ¡metros y desarrollar variantes especializadas del agente PIE para controlar mejor los resultados. TambiÃ©n se planea expandir las pruebas a entradas multimodales, incluyendo imÃ¡genes y audio. Aunque los modelos locales aÃºn estÃ¡n en desarrollo, ofrecen una ventaja masiva de ahorro de costos y tiempo a largo plazo. Se predice que para fin de aÃ±o serÃ¡ posible ejecutar modelos avanzados como Sonnet o Opus 4.0 en dispositivos personales con mejoras de hardware. Los modelos locales permiten a los usuarios obtener privacidad y control sobre el rendimiento, algo que no ofrecen los proveedores de modelos.

Generado con algoritmo v1-chunked · modelo google/gemma-4-e4b · 2026-05-03T12:05:55Z

Transcripción

[0:10] [music]
[0:33] >> What's up engineers? Today we're going
[0:34] to have some fun. On the left I have my
[0:36] new fully specked out M5 MacBook Pro. On
[0:40] the right I have the previous generation
[0:42] fully specked out M4 Max MacBook Pro.
[0:46] Today we're going to push the limits of
[0:47] these devices by using some of the best
[0:50] models and tooling from Apple, Google,
[0:52] Alibaba, and Nvidia. I'm filming at the
[0:55] perfect time here because once again the
[0:58] Claude APIs are down. You know what I
[1:00] wish I could do? I wish I could use
[1:03] private, cheap, fast, performant, local
[1:06] models right on my device. Here's
[1:09] everything we're going to cover in this
[1:11] video. If you want to see the insane
[1:12] gains you can get from using the
[1:14] dedicated MLX models specialized for
[1:17] Apple hardware, definitely stick around.
[1:19] Many engineers are wasting time not
[1:22] using the dedicated MLX models. We're
[1:24] going to look at the M5 versus the M4.
[1:26] We're going to look at GGUF versus MLX
[1:29] models. And if you want to see how much
[1:31] Google cooked on the Gemma 4 model,
[1:34] stick around as well. We're going to
[1:35] compare Gemma 4 versus the Qwen 3.5
[1:38] series. All of these innovations are
[1:40] coming together to create
[1:42] state-of-the-art model performance from
[1:45] Apple, Google's Gemma 4, Nvidia
[1:47] optimized models, of course the cracked
[1:50] Qwen 3.5, and then we're going to
[1:51] supercharge it with Apple's MLX machine
[1:55] learning framework designed for Apple
[1:57] silicon. This helps us push away from
[2:00] our dependency on Anthropic's APIs, on
[2:02] OpenAI's APIs, on any cloud provider's
[2:05] APIs. It's almost a guaranteed future we
[2:07] will be running powerful models on our
[2:10] devices in a private, cheap, fast, and
[2:12] performant way. It's only a matter of
[2:15] time. But the only way to know when the
[2:17] time has arrived is to prepare ahead of
[2:19] time. That's what we're doing here. So
[2:21] the first thing we're going to do here,
[2:22] when you're running local models, is you
[2:24] have to warm them up. We have our M5 on
[2:26] the right, we have the M4. I want to
[2:27] show you the models we're going to be
[2:28] using here. So on both sides we're going
[2:30] to run Jade Bench Ping, and we're going
[2:33] to kick off our models on both sides. So
[2:35] at the same time, we're going to run
[2:37] these and we're going to compare them in
[2:38] this simple first prompt test. 1 2 3.
[2:41] This is what's called a cold start. The
[2:43] models are not in memory yet, so both
[2:45] devices are loading the models into
[2:47] memory. And here we're going to show
[2:49] exactly the models we're going to be
[2:50] using throughout our benchmarks. But
[2:52] we're going to start small, we're going
[2:53] to start simple. So the first run always
[2:55] takes a bit of time. You can see here my
[2:56] M4 actually was able to warm up the Qwen
[2:59] 3.5 before the M5. There we go, there's
[3:02] the M5 device. M4 just got Qwen 3.5 with
[3:05] the NVFP4
[3:07] format from Nvidia. M5 just completed.
[3:09] Gemma 4s are complete, and then we have
[3:12] the Gemma 4 MLX variant also complete.
[3:15] I'm going to run this again and get
[3:16] another cleaner benchmark now that these
[3:18] models have warmed up. So we have that
[3:20] first result on the M5. There's the
[3:22] second result on the M5. There's the
[3:24] third result, and we should get the
[3:25] fourth result here in a second. Stats
[3:27] look great, and on the right you can see
[3:28] the M4 with the Qwen GGUF model 35
[3:31] billion parameter is coming in as well.
[3:33] So here are the key metrics we're going
[3:35] to cover here. Prefill speeds. This is
[3:37] the processing of the incoming prompt.
[3:39] Decode. This is what people refer to as
[3:41] the true tokens per second. And then the
[3:43] wall. This is the thing that really
[3:44] matters. Wall is the end-to-end time to
[3:46] actually execute with all costs,
[3:49] including the cold start, including
[3:50] prefill, decode, and a couple other kind
[3:52] of hidden costs of running models
[3:54] locally. So this is the true time. This
[3:56] is what we really, really care about.
[3:57] Already you can see here on the M5
[3:59] prefill and decode speeds across the
[4:02] board for every single model are
[4:04] generally higher. We're prefilling
[4:06] faster and we're decoding faster,
[4:07] resulting in faster times. We of course
[4:10] can't benchmark anything meaningful
[4:12] going one prompt at a time. Instead
[4:14] we're going to use a live benchmarking
[4:16] tool to track what's going on with each
[4:19] one of these models. So over on the M4
[4:21] here, you can see we're going to live
[4:23] track prefill, decode, total wall clock
[4:26] time, and then random access memory at
[4:29] peak usage. So this is really important
[4:31] because if you can't fit the model into
[4:32] RAM, it just can't run. There are new
[4:35] LLM innovations coming out all the time
[4:36] to help with the KV cache to get some of
[4:39] that hard work out of the RAM. But right
[4:41] now this is how it works. So now we're
[4:42] going to run five prompts increasing in
[4:45] complexity. We have all of our models
[4:47] aligned at the top. So let's go ahead
[4:48] and kick off our first benchmark to
[4:50] really compare these models side by side
[4:53] by side.
[4:54] >> [music]
[4:57] >> We'll run these on both devices. J Bench
[4:59] QG. It's Qwen versus Gemma. And this is
[5:02] going to be the M4. We're going to copy
[5:03] this over, and with the screen
[5:05] connection feature here, jump right over
[5:08] to the M5. And then we're going to kick
[5:10] these both off. 1 2 3. So you can see
[5:13] the available baseline memory. And now
[5:15] we're going to start booting up these
[5:16] models and actually executing on each
[5:18] prompt. I'll share more about the
[5:19] architecture of this application. The
[5:21] big idea here is that you can run
[5:23] benchmarks from multiple devices, and
[5:25] the results are going to be streamed
[5:27] right to this live interface. Open up
[5:29] your new terminal on the M5, and if we
[5:31] type J Ping, you'll notice on the live
[5:33] bench UI ping received. Our first
[5:36] benchmark just came in. On the right
[5:38] we're going to have our M5 Max device,
[5:41] and on the left we're going to have our
[5:42] M4 device benchmarks as well. And so how
[5:45] this is going to work is across these
[5:47] five prompts we're going to kick off
[5:49] Qwen 3.5 GGUF Nvidia optimized version
[5:53] with the MLX variant, and this is the
[5:55] model that Ollama announced. Ollama now
[5:58] supports models that can run directly on
[6:00] MLX and Apple silicon, and this is that
[6:02] exact model. Gemma 4 GGUF format. But
[6:04] then we also have a MLX community built
[6:06] version of the Gemma 4 model that's
[6:08] going to be optimized to run on Apple
[6:10] silicon, and you'll see that as the
[6:12] numbers come in. So right now things
[6:13] look pretty close. There isn't a huge
[6:15] difference between the M4 and the M5. If
[6:17] we scroll down to the wall time, our M4
[6:20] is actually moving a bit faster with
[6:22] this Qwen 3.5 GGUF model. And so we're
[6:24] kind of going back and forth a little
[6:26] bit, but as these models execute the
[6:28] clear advantage of the M5 Max chip is
[6:32] going to emerge. Our M5 device has just
[6:34] finished running all of the Qwen 3.5
[6:37] GGUF formats, and now it's going to kick
[6:39] off the MLX. Now this is where things
[6:41] get really interesting. So the MLX
[6:43] variants are going to run very, very
[6:45] fast. They're going to smoke the GGUF
[6:48] format.
[6:49] >> [laughter]
[6:49] >> And it's not really even going to be
[6:51] close. You can see we already got the
[6:52] second response from the M5 Max using
[6:55] the Qwen 3.5 MLX variant. There's a
[6:57] couple big optimizations in this thing.
[7:00] First off, MLX, it's the mixture of
[7:02] experts just as the GGUF format is. So
[7:05] that's great, but it's also the Nvidia
[7:07] optimized NVFP4.
[7:09] This thing is souped up. It's quite
[7:11] powerful and it's quite fast. You're
[7:13] going to see this on both devices, but
[7:14] you can see it especially on the M5
[7:17] here. So if you look at some of the
[7:18] stats, prefill speed is almost double
[7:20] using the MLX variant. You can see the
[7:22] M4 is also completing very, very
[7:24] quickly. The difference isn't as big on
[7:26] the M4 device, but it's still there. If
[7:29] we check out the decode speeds, the GGUF
[7:32] format on the M5 got 60 tokens per
[7:34] second. Pretty good, right? Very fast.
[7:36] But it's not close. The MLX Qwen 3.5
[7:39] version got 118 tokens per second. You
[7:41] can see that this is pretty consistent
[7:43] across each one of these prompts. After
[7:45] you run prefill, which is the more
[7:47] variable speed, this is that initial
[7:49] loading of the prompt, the decode is
[7:52] very, very consistent. 118 tokens per
[7:54] second for every single one of these
[7:56] prompts. And so speaking of the prompts,
[7:57] what are we actually running here? These
[7:59] are very, very simple tasks. And so if I
[8:00] just click into this, you know, the
[8:01] first prompt here is explain what a hash
[8:03] table is in two sentences. So just very,
[8:05] very simple surface level questions.
[8:07] We're not pushing the performance quite
[8:09] yet. We'll do that in our future
[8:10] benchmark coming up. But what we really
[8:12] want to see here is what are the model
[8:13] statistics on these different models on
[8:16] both devices. The Gemma 4 model is
[8:19] starting to run on the M5 device. This
[8:21] model is really, really incredible. The
[8:23] Gemma 4 GGUF model that we're running
[8:26] right now. It's got about 100 tokens per
[8:28] second, so it's not as fast as the MLX
[8:30] variant, but the prefill speed is
[8:33] faster. So in the prefill step we're
[8:34] getting about 550 tokens per second, and
[8:36] it's beating out in every prompt. And
[8:39] now if we look side by side, right? If
[8:40] we just scroll down to wall time, this
[8:42] is the total time it takes to run.
[8:44] Overall we're getting faster speeds out
[8:47] of the M5 with some variants, right?
[8:50] These are non-deterministic systems, so
[8:51] you can see here the first three Qwen
[8:53] 3.5 GGUF runs on the M5 were quite slow,
[8:56] but the MLX and the Gemma model are
[8:59] much, much faster. They're getting a lot
[9:00] more performance. There are a lot of
[9:01] ways to measure local model performance.
[9:03] The thing that matters the most is the
[9:04] wall clock time. How much time did you
[9:06] actually sit and wait end-to-end? You
[9:08] can see something really special
[9:09] happening now. The Gemma 4 MLX variant,
[9:12] our last model, is blitzing through this
[9:14] stuff. And that's going to lead us to
[9:15] our first big takeaway here. You know,
[9:17] there's a clear trend. I set these
[9:19] models up in this order on purpose. If
[9:21] we focus back over in the M5, the
[9:23] benchmark is complete. It's run all the
[9:25] models back-to-back. You'll notice here
[9:27] my M4 Max MacBook Pro, the fan is on and
[9:31] it's pretty heavy right now. But
[9:33] throughout that whole time, the M5 Max
[9:35] was relatively quiet. We're getting
[9:37] really, really great prefill speeds out
[9:39] of both of our Gemma models on both
[9:42] devices. The blue here is our Gemma
[9:44] models. If we scroll to the bottom here,
[9:45] we can look at our max random access
[9:47] memory, and you can see the smallest
[9:49] model here we have is that Gemma 4 MLX
[9:52] variant. And so this thing is compact.
[9:55] It's fitting in just 16 GB of RAM. And
[9:58] we can actually visualize this as well,
[10:00] right? If I open up a new terminal and I
[10:01] type MacMon, we're going to see a live
[10:04] visual here. This is what really
[10:05] matters. We have 128 GB of RAM, and
[10:08] we're using about 42 right now. Looks
[10:09] like things are completing, memory is
[10:11] getting swapped in and out here. So, it
[10:12] just completed. It just kind of dropped
[10:14] down that memory back to some base
[10:16] level. That looks good. Let's just
[10:17] understand and internalize this first
[10:20] benchmark run. Prefill speeds. What do
[10:22] we see here? We see that the Qwen 3.5 GG
[10:25] of model is the slowest at prefill. From
[10:27] there, pretty much every model just
[10:29] increases up till the GG and MLX. So,
[10:33] the MLX is not prefilling as quickly as
[10:35] a GG file. Now, if we take a look at the
[10:38] actual prompt we're running here,
[10:39] prefill speed doesn't matter a ton. Why
[10:41] is that? It's because our prompts are
[10:43] relatively small. But for our next
[10:44] context benchmark, prefill speed becomes
[10:47] very, very important because this is the
[10:48] time you're waiting for the model to
[10:50] actually load the prompt into its memory
[10:53] and to start processing it. All of these
[10:54] prompts that we just ran, very, very
[10:56] simple, two, three, four sentences, and
[10:58] that's it. And then we can see the
[10:59] responses for each model here on the P5
[11:01] deep, the final prompt, design a rate
[11:03] limiter. And you can see how every model
[11:06] went through that example and kind of,
[11:08] you know, came up with its response. So,
[11:10] fantastic. So, that's prefill speeds,
[11:11] but let's look at tokens per second.
[11:13] What is the pattern here? The pattern is
[11:15] very, very clear. The MLX model variants
[11:17] of Qwen 3.5 and Gemma 4 are extremely
[11:21] fast. We're talking about 100 tokens per
[11:23] second. The slowest here is the Qwen 3.5
[11:25] GG of coming in at just 50 tokens per
[11:27] second, which is fully usable, by the
[11:29] way. Anything over 30 tokens per second,
[11:31] I consider fully usable. Once you drop
[11:33] below 20, I consider that the dead zone.
[11:36] I just can't wait for that token speed.
[11:38] Comment down below, let me know your
[11:39] minimum viable speed for these models.
[11:42] All right? And so, again, in our next
[11:44] context benchmark, we're going to see
[11:46] how the total wait time is affected by
[11:48] the prompt length. This is something
[11:50] that's often left out of small local
[11:53] model benchmarks. The bigger that prompt
[11:55] gets, the bigger the total context
[11:58] that's going to run in your agent, which
[11:59] will be our final benchmark, the slower
[12:01] the total time is going to be because
[12:03] they have to process all that context.
[12:04] But you can see here, the theme is
[12:06] pretty consistent. The Gemma models are
[12:08] faster in general, both formats. But
[12:09] when we start using the Apple silicon,
[12:12] the performance goes up by quite a lot.
[12:14] I mean, almost double from the Qwen 3.5
[12:17] GG format. And I do think we need to
[12:19] give credit to Nvidia's floating point
[12:21] format here, NVFP4, but it's also due to
[12:24] the MLX. Right, it's very clear that MLX
[12:26] is a huge, huge, huge speed gain on the
[12:28] Mac devices. So, 5 is clearly faster on
[12:31] average. The slowest speed here we got
[12:33] was 60 tokens per second over that 45
[12:36] tokens per second wall clock time, which
[12:38] is the thing that really matters. The
[12:40] big takeaway here between the M4 and the
[12:42] M5 is that the M5 is about 15 to 50%
[12:47] faster than the M4, which is a pretty
[12:49] massive jump. And we're going to see the
[12:51] trend continue as we increase the
[12:53] context size of the prompts that we're
[12:55] executing. And then memory, pretty
[12:57] simple. Gemma 4 is an incredibly packed
[13:00] model, like they say it here themselves,
[13:02] right? They're maximizing intelligence
[13:03] per parameter, and they've definitely
[13:05] accomplished that. It's great to have a
[13:07] model coming out of the US that's truly
[13:09] open and actually competitive with the
[13:12] Qwen series and the other Chinese labs.
[13:13] So, that's great to see. I don't really
[13:15] discriminate at all when it comes to
[13:16] where the model's coming from, but it's
[13:17] great to have something from the US.
[13:19] Fantastic. Let's move on to our next
[13:20] benchmark. These are simple prompts.
[13:22] These are a fraction of the prompts that
[13:25] people are actually running now. We've
[13:27] run some pretty heavy, serious prompts
[13:29] with the state-of-the-art models. Let's
[13:30] start to scale this up. Let's throw
[13:32] harder problems at this. So, I'm going
[13:34] to hit back here, and we're going to
[13:35] move into our context scaling benchmark.
[13:37] So, here we're just going to look at
[13:39] these MLX models. The big takeaway, by
[13:41] the way, from these four models is if
[13:44] you're running on Apple silicon, always
[13:46] find an MLX model. There's just really
[13:48] no debate about this, and they're up to
[13:50] twice as good as their GG counterparts.
[13:53] So, this is big. This is a really
[13:55] important thing to understand. If you're
[13:57] on Apple silicon, use MLX. Now, whether
[13:59] to use the MLX Qwen model or the Gemma
[14:02] model, that's not so clear. Let's see if
[14:05] we can get some more clarity out of Qwen
[14:07] versus Gemma in this next benchmark. So,
[14:08] let's go to our context scaling, and
[14:10] let's go ahead and kick this off. So,
[14:11] we're going to run just bench context,
[14:14] and this is the M4. And then I'll copy
[14:15] that same thing over, cross over to the
[14:17] M5. So, let's go ahead and kick this off
[14:19] at the same time. Here we go. 1 2 3.
[14:23] >> [music]
[14:25] >> And now we're going to start streaming
[14:28] updates to our live bench tool from both
[14:30] our M4 and our M5. Let's see how they
[14:33] run. So, side by side, they're both
[14:35] kicking off right now, and this is
[14:36] running the graph walks benchmark. So,
[14:39] this is something we looked at in last
[14:40] week's video when we were talking about
[14:42] Claude Mythos. Graph walk is really
[14:45] interesting. The models have to perform
[14:47] breadth-first search across increasing
[14:49] contexts. 200 tokens here, 500 tokens
[14:52] here, 1,000 tokens here. And so, we're
[14:55] scaling up the prompt size. Let me be
[14:56] super clear, this is the size of the
[14:58] prompt. And so, here we're keeping track
[14:59] of prefill, decode. You can see results
[15:02] coming in already. The M5 is running
[15:04] pretty steadily here, 117 tokens per
[15:06] second. And then we have our total wall
[15:08] clock size. So, this is the most
[15:09] important thing, just how fast did that
[15:11] thing run? And then we have the
[15:12] accuracy. So, we're all keeping track of
[15:14] the actual graph walks F1 score, higher
[15:17] is better. And if we click into this, we
[15:19] have an expected answer. They're
[15:20] performing breadth-first search along a
[15:23] graph. I'm not going to go into the
[15:24] details of that too much. If you're
[15:25] interested, you can check out the graph
[15:27] walks code base. This is the benchmark
[15:30] we looked at last week. Mythos was
[15:31] performing a staggering 80% success rate
[15:35] on this benchmark all the way up to a
[15:37] million tokens. And so, you know, just
[15:39] for comparison here, you can see our
[15:41] model already made a mistake here at
[15:43] just 8K tokens of context length. If you
[15:46] want to get the behind-the-scenes scoop
[15:48] on my take on the Mythos model, check
[15:51] out last week's video where we dug into
[15:53] the Mythos model and really looked
[15:54] behind the curtain at what Project
[15:56] Glasswing really means. Anyway, back to
[15:57] local models here. So, once again, we
[15:59] are seeing as these models perform
[16:02] breadth-first search through
[16:03] increasingly complex and longer and
[16:05] longer graphs, the machines are starting
[16:08] to work. I'll be quiet for a second
[16:09] here, and you can just hear the M4.
[16:13] Okay. So, the M4 is making some noise,
[16:16] and we're even going to get my M5 fan to
[16:18] really kick on here in a second. You
[16:20] know, for instance, let's look at some
[16:22] of the work that these have to do. You
[16:23] can see some of the chain of thought
[16:24] here. This is the Qwen 3.5 35 billion
[16:27] parameter MLX variant, and you can see
[16:28] it's thinking through, working through
[16:31] the
[16:32] uh traversing the nodes inside of a
[16:35] graph. And you can see, you know, the
[16:36] expected answer is becoming more and
[16:38] more complicated. And this is just at 4K
[16:39] context. It got close to the answer,
[16:41] actually. Looks like it made a mistake.
[16:43] So, we docked some points from that.
[16:44] These devices are fully on. So, we're
[16:46] making these Mac devices work all
[16:48] through ultra hard to perform this work,
[16:50] right? The fans are on, GPUs are
[16:51] cooking, and we can go ahead and just
[16:53] see that as well, MacMon here. And let's
[16:55] go to the M5 MacMon as well. GPU
[16:59] utilization here is quite high, almost
[17:02] 100% here, 55 RAM. So, we're only
[17:04] running two models here, so and they're
[17:06] running one at a time. Both of these
[17:07] devices absolutely cooking right now.
[17:11] Uh working through breadth-first search
[17:14] as a language model, which is pretty
[17:16] crazy. That's very precise token
[17:18] reading, token traversing, and
[17:20] reasoning. This only works because these
[17:22] models can now think. And so, you can
[17:23] see here, on 32K, both systems are now
[17:26] processing that. The 32K is what I'm
[17:29] seeing as the proper context limit for
[17:33] these small language models. I'm talking
[17:35] 35 billion parameters and below. Even
[17:38] with the mixture of experts running,
[17:40] right? We have A4B here and A3B. These
[17:42] are both the MLX variants. We are
[17:44] getting that top-tier performance thanks
[17:47] to the Apple silicon we're running on
[17:48] here, but yeah, it's very clear that
[17:50] this is a hard problem for our models,
[17:53] both Qwen and, as you'll see here, Gemma
[17:55] as well. But the differences are pretty
[17:57] apparent here. The M5 is really chewing
[18:00] through these problems relatively
[18:01] quickly. And here we go. So, now we're
[18:03] getting the Gemma
[18:05] Gemma 4 MLX variants running here. It's
[18:08] actually getting fewer tokens per
[18:09] second, right? Tokens per second is not
[18:11] the full story. The full story for us is
[18:13] wall clock time. The Gemma model's
[18:14] starting to come through here on the M5
[18:16] device side. We're just focusing on the
[18:18] M5 Max. You know, you can see that
[18:20] accuracy is coming in nicely here as
[18:22] well. The correct graph nodes are being
[18:25] mentioned by both models. So, it looks
[18:27] like the performance is going to be
[18:28] pretty neck and neck here between Gemma
[18:31] and Qwen 3.5. So, that's great to see.
[18:33] You know, the devices are fully, fully
[18:35] working now. Maxed GPU, the efficiency
[18:38] cores, performance cores, and this new
[18:41] super core is quite activated, doing a
[18:43] lot of work. It's clear that the M5
[18:46] doesn't need the performance core for
[18:47] this work, and it's only using the super
[18:49] core for this work. But what is for sure
[18:51] being used is our RAM, our GPU, our
[18:54] power utilization is quite high as well.
[18:57] On the M5, we got 35 W, and on the M4,
[19:01] we're up to 40 W. So, the M4 is doing a
[19:03] lot more work here, and this is pretty
[19:05] consistent. So, prefill looking good.
[19:08] Prefill is the prompt processing. The M5
[19:11] is nearly double the M4. Just to say it
[19:14] super clearly, you know, out loud, if
[19:16] you want local model performance,
[19:17] upgrade from your M4, from your M3, from
[19:20] your M2, from your M1, whatever you're
[19:21] currently using. I have a fully maxed
[19:24] out M4, and the M5 is outperforming it
[19:27] by a wide margin. It especially
[19:29] outperforms in prefill speed. And this
[19:31] becomes very, very important as the
[19:33] prompt size increases. Look at the M4,
[19:36] it's still processing the 32K context
[19:39] prompt. So, this is a huge graph that is
[19:42] a set of nodes that our language model
[19:44] has to work through, and it has to give
[19:45] the correct answer to. But the Gemma 4
[19:48] MLX on the M5 is just burning through
[19:51] this. It just completed the 8K and and
[19:53] how quickly they do that? In 13 seconds.
[19:56] All right, so it's very very fast, but
[19:57] you can see here something interesting.
[19:58] The wall clock time, so the total time
[20:00] of the Quinn models are actually
[20:01] performing a bit faster here. You know,
[20:03] Quinn's got a nice leg up here even
[20:05] though it's a 35 billion parameter
[20:07] model. Of course, this is mixture of
[20:09] experts, so they're both only using
[20:11] three or four billion parameters to
[20:12] answer each question. Gemma models are
[20:14] going to be a little bit slower here and
[20:16] actually quite a bit slower if you look
[20:17] at this on an average basis. We'll see
[20:19] how this last 32K
[20:21] >> [laughter]
[20:22] >> context window prompt looks. Overall
[20:24] though, performance looking really good.
[20:26] They're both answering the question,
[20:27] which is really important. These models
[20:29] are incredible now. These local models
[20:31] are doing significant challenging work.
[20:33] Let's look at the 16K. So they have to
[20:35] work through a very very complex set of
[20:37] nodes. Looks like it's looking at 14
[20:39] depths of trees in a graph. And then
[20:42] here it's a little tricky, it has to
[20:44] find just one answer. Just one node. MLX
[20:47] got the correct answer. Gemma did not
[20:49] find the correct answer. It's really
[20:51] really important for these models to
[20:52] think and to process properly. Looks
[20:54] like our M4 just completed that huge
[20:58] run. Look at this wall clock time. So
[20:59] this is end-to-end time. So about 400
[21:01] seconds on the M4. And my M5 took 280
[21:05] seconds. Okay, so that's, you know, 40%
[21:08] improvement in speed for a large prompt.
[21:11] And right now the M5 is working through
[21:13] that final prompt on the Gemma 4 MLX
[21:16] variant. This is taking quite a bit of
[21:17] time as well. I'm expecting it to come
[21:19] at you know, 308 seconds or something
[21:22] for the Gemma 4 26 billion parameter MLX
[21:25] model. But once again, in terms of like
[21:27] raw usage, the M4 is working its ass off
[21:30] to compete with the M5.
[21:32] And you can hear it. Pretty incredible
[21:34] stuff. Again, stats, if we look at the
[21:36] tokens per second, M4 versus the M5 is
[21:38] getting about what is this? Maybe 15%?
[21:42] No, a little bit more than 15. It looks
[21:43] like 20% tokens per second, but we do
[21:46] lose tokens per second here as the
[21:48] prompt size goes up. So this is an
[21:49] interesting observation. And it kind of
[21:51] leads us to another big takeaway from
[21:53] running these models side by side and
[21:55] really seeing what they can do. As
[21:57] prompt size increases, local model
[21:59] performance goes down very very quickly.
[22:02] This might sound obvious, but it's
[22:03] important to realize the impacts of this
[22:05] when you're expecting your local model
[22:07] to do agentic work. These one-off
[22:09] prompts aren't really that important
[22:11] anymore. Everyone is putting language
[22:13] models inside of an agent. And that
[22:15] context window stacks up very very
[22:17] quickly. I can almost guarantee you
[22:19] every time you put up Claude Code and
[22:21] you run two or three prompts, you're
[22:22] already at 32K tokens. What's limiting
[22:25] our models now, our local models, isn't
[22:28] so much the performance because we're
[22:31] doing breadth-first search on up to
[22:34] 32,000 tokens and getting correct
[22:36] answers out of these local models. Looks
[22:38] like we're breaking down here a little
[22:39] bit at the 8K and 16K mark. You know,
[22:41] the point here is uh performance
[22:42] matters. And in our last benchmark,
[22:44] we're going to look at how these models
[22:45] perform inside of the Pi coding agent to
[22:47] get a real like agentic view at these
[22:50] models. But the bottleneck here for
[22:52] local models is context window size. At
[22:56] that 16K mark, we have to wait 30
[22:58] seconds for a response. That's tough,
[23:01] right? [laughter] In fact, it's
[23:02] unusable, I would say, right? Waiting 30
[23:04] seconds for a response is frankly just
[23:06] unusable. Again, as a general rule,
[23:09] these local models say they can handle,
[23:11] you know, 250 context windows, 160K
[23:15] context windows. That's all great if
[23:16] you're running it on Nvidia cracked
[23:19] GPUs. We are not, right? We are running
[23:21] it on local hardware. And even with the
[23:23] best state-of-the-art M5 Max MacBook Pro
[23:26] device here, we are consuming a lot of
[23:29] time when the tokens hit about, you
[23:31] know, 16,000 token context window
[23:33] length. Accuracy is still there. What we
[23:35] do lose is these models completing in a
[23:37] reasonable amount of time. Just like
[23:39] large language models who say they have
[23:41] a 1 million token context window, it's
[23:43] really like the true context is 500K,
[23:46] maybe 700K, maybe 800K. The Claude 4.6
[23:49] series has been the best so far with
[23:51] their true 1 million context window. For
[23:54] local models, it's much much more
[23:56] limited than that. We're running a lot
[23:58] of like state-of-the-art game-changing
[24:00] hardware and now we're getting the wrong
[24:01] answer at 32K, right? This is just too
[24:03] complex. Check this out. Expected answer
[24:05] is all these nodes and Quinn didn't
[24:07] respond. It just crapped itself. And
[24:09] Gemma 4 gave an answer, but it's not
[24:12] correct. So we spent all that time to
[24:13] get a incorrect answer out of these
[24:15] models. 32K is pushing it. 16K is it's
[24:19] still viable. But again, at 16K context
[24:21] window, we waited 30 seconds for an
[24:23] answer. So, you know, this is going to
[24:25] kind of tell a story as to what you can
[24:27] actually do with these models locally.
[24:29] And so now we're just waiting for my M4
[24:31] to finish the final 32K long prompt
[24:33] running the Gemma 4 26 billion parameter
[24:36] MLX variant. And, you know, final notes
[24:38] here, you can see tokens per second
[24:40] looking pretty good, then going downhill
[24:42] at that 8, 16, 32K mark. That is a
[24:44] signal that the model is just not
[24:46] processing. It's having a hard time. And
[24:48] then prefill speeds kind of that same
[24:50] deal on the M5 where prefill speeds are
[24:53] improving as we increase the size of the
[24:55] prompt. And then around 8K, 16K, we
[24:58] start to hit a wall. At 32K, we really
[25:00] fall off the cliff. Before I filmed this
[25:01] video, I actually had an agent cut out
[25:03] the 64K just because it took too much
[25:06] time. If you scroll down here, you know,
[25:07] 280 seconds and we have 400 seconds here
[25:10] and we're probably going to get like
[25:12] maybe a 500 second run here out of the
[25:14] Gemma model on the M4. We'll see. The M4
[25:17] Max worked its butt off here and finally
[25:20] completed the final graph walk at 32K
[25:23] and it got the answer wrong. So again,
[25:25] just reaffirming this idea that larger
[25:28] context windows are really really hard
[25:30] for local models to process. And really
[25:33] the falloff starts at that 8K context
[25:36] window length level. So if you're doing
[25:38] work underneath that level, you're
[25:39] probably in good shape to use a powerful
[25:41] local model. Gemma 4, Quinn 3.5, they're
[25:44] both equally viable if you look at the
[25:46] benchmarks here in accuracy for long
[25:48] context. They're both about the same.
[25:50] They're going to answer right or not at
[25:52] all with a slight edge to Quinn 3.5. But
[25:54] you can see here wall clock time, this
[25:56] benchmark took a lot longer to run. Just
[25:58] really that key takeaway, context window
[26:00] length is the true limitation for local
[26:03] models now. It's not really performance,
[26:05] right? The intelligence of these models
[26:07] is great, but you can only use them up
[26:09] to a certain context window length.
[26:11] Otherwise, the speed just takes too
[26:13] long. The wall clock time, the
[26:15] end-to-end latency is just too high. The
[26:18] numbers are even worse on the M4, right?
[26:20] At 2K, we're already up at 30 seconds.
[26:23] At 1K, we're already at that 20 second
[26:25] mark and the 36 second mark. M5 running
[26:29] the Quinn 3.5 MLX model, very very fast.
[26:33] So we're using the best local models
[26:34] with the best tooling at a decent
[26:36] parameter size. So the speed is here.
[26:38] This is at that 35 and the 26 billion
[26:40] parameter mark. If you push the
[26:42] parameter size lower, you can expect to
[26:44] see your accuracy go down while you
[26:46] increase your speed a little bit. For
[26:48] simple problems, this is fine. But I'm
[26:50] really looking at this 26, 35 billion
[26:53] parameter level model to get top-tier
[26:56] performance in a decent amount of time.
[26:58] Let's move on to our final benchmark
[27:00] here. Let's look at the Pi coding agent.
[27:03] What happens when we put these models
[27:05] inside of the Pi coding agent and make
[27:06] them do agentic work for us. Let's find
[27:09] out. All right, we're going to run
[27:10] Jbench Pi M4 and we're going to take
[27:13] that exact same prompt, go over to our
[27:16] M5. Let's go ahead and kick this off and
[27:18] let's see how our models perform inside
[27:20] of an agent. This is where most agentic
[27:22] work is happening. One-off prompt calls
[27:24] aren't really that useful, right? We
[27:26] want agents operating our software and
[27:29] our products. Here we go. One, two,
[27:31] three.
[27:33] >> [music]
[27:35] >> Unloading the models, restarting them,
[27:38] and then they're going to do a quick
[27:39] warm-up. And after that, we're going to
[27:41] start doing agentic work. So here's what
[27:44] we're looking at here with our Pi coding
[27:46] agent. M5 just came in really really
[27:48] fast followed by the M4. You can see our
[27:50] model lineup here. We are leaving off
[27:52] the Gemma 4 MLX variant due to some
[27:54] configuration issues. As I mentioned,
[27:56] the Claude API just went down
[28:01] and I didn't want to finish the work
[28:03] with GPT 5.4. We didn't get the full MLX
[28:06] server up. That's some local model work
[28:07] that I have to continue doing in agentic
[28:10] coding tasks. What's going on here? So
[28:12] we do have a correctness check here. And
[28:14] we're also tracking total tokens and
[28:16] total tool calls. All right, so you can
[28:18] see two, two, three. Looking good so
[28:20] far. We want to see these numbers be the
[28:22] same on both sides. Of course, we know
[28:24] that our M5 device is going to do a
[28:26] little bit better here when it comes to
[28:28] the raw speed and performance of this.
[28:30] But you can see the correctness looking
[28:32] the same on both sides, which is
[28:33] fantastic. And you can see the total
[28:34] tokens that our models are using to
[28:36] complete the task. So input and output
[28:38] tokens. So this is going to matter here
[28:40] as you saw in the previous benchmark, as
[28:43] tokens increase, performance goes down.
[28:45] Already we're starting to see a break in
[28:47] performance. M5 completing all these
[28:49] tasks, 7 seconds, 10 seconds, 14
[28:51] seconds. And let's go ahead and look at
[28:52] some of the tasks, right? What are we
[28:53] actually running? So we're starting
[28:54] simple, create file hello, print hello
[28:57] world, nothing else. Super super simple.
[28:58] Testing file writing, testing execution.
[29:01] And then our correctness test is
[29:02] actually going to execute the code and
[29:04] make sure that it's right. We're moving
[29:05] to Fibonacci. And so just slowly scaling
[29:08] up the difficulty for these models
[29:10] inside of the Pi coding agent. Create a
[29:11] file called fib. Here we're memoizing
[29:14] with a dictionary. And then we're
[29:15] printing out Fibonacci 10, which should
[29:17] be 55. Easy to validate, very very easy
[29:20] to scale and add complexity. So right
[29:22] now we're running the Quinn 3.5. We're
[29:24] going to run Quinn 3.5 MLX and then
[29:26] finally Gemma 4 GGUF. So this is what
[29:28] really matters now. It's can your model
[29:31] run inside of an agent. Both these
[29:33] models are really really great at
[29:34] running inside of the agent up to a
[29:37] certain context window length, okay?
[29:39] Because of course, as mentioned, as that
[29:41] context window increases, performance
[29:43] goes down in these local models. You
[29:46] know, something I'm really happy about
[29:47] here with both is serious real work you
[29:50] can do on your local device. These
[29:52] models are great at coding, they're
[29:53] great at parsing, summarizing, doing
[29:55] what I like to call micro agent tasks.
[29:58] I'm starting to incorporate these local
[30:00] models into my pie coding agent
[30:02] specifically and into skills and into
[30:04] kind of sub agent processes where I can
[30:06] run micro agents to do a little bit of
[30:08] work in a very consolidated simple way.
[30:10] Why am I doing that? I'm preparing for a
[30:12] future where these local models become
[30:14] more viable. If you don't understand
[30:16] what these models can do now, you won't
[30:18] understand the next step. And I, maybe
[30:20] like you, want to have the advantages of
[30:23] these local models, private, fast, zero
[30:25] dependency on an outside API. We want
[30:28] these properties inside of local
[30:30] intelligence that we can run on our
[30:31] device. I'm really, really excited for
[30:33] what Apple could do next here. They've
[30:35] really fallen into or fallen into a
[30:38] position is
[30:39] not giving them enough credit here, but
[30:41] they're taking advantage of their
[30:43] silicon, their chips, right? The M5
[30:45] Ultra or the M6 chip running in a Mac
[30:48] Mini is the next device that I'm really
[30:50] waiting on, that I'm really keeping my
[30:52] eye on because that is going to be the
[30:53] moment. If they give you 500 GB of RAM
[30:56] in one of those, we are going to have
[30:57] very, very powerful local models right
[31:00] on device. Truly, it's going to change
[31:01] the industry. So, I'm looking forward to
[31:03] that and all that to say, when that time
[31:05] comes, you want to know what you can do
[31:07] with these local models. You want to be
[31:09] able to spin them up in micro agents and
[31:12] sub agent processes. That's really the
[31:14] frontier I'm looking at for local
[31:16] models. What simple work can I hand off
[31:18] to my local model to run very quickly,
[31:21] very cheaply on my device? There's
[31:23] engineering work, there's your, you
[31:25] know, personal life work, and then
[31:27] there's product work, right? These are
[31:28] kind of three key buckets that you can
[31:30] think through when you are assigning
[31:33] tasks to these small models, cheap, fast
[31:35] models, to workhorse models, and then to
[31:38] state-of-the-art models. So, as you
[31:40] would expect, our large packaging
[31:42] prompt, so we're actually having the
[31:43] agent build a full package or some
[31:46] files, multiple writes. That's been the
[31:48] right format, it has to execute
[31:49] properly. For you have a bunch of pure
[31:51] functions that we want them to implement
[31:53] and we're using clean function syntax
[31:55] here. This is going to take quite a bit
[31:56] of time. What we effectively have is a
[31:57] spec for our agent to work through and
[31:59] build out a simple calculator app. And
[32:01] this could be anything. The trick is
[32:03] that we're having our agent do more
[32:04] work. Look at the tool calls down here
[32:06] on the M5 and on the M4. Test T1 through
[32:08] T4, relatively simple, three tool calls.
[32:10] But, test five and test six is requiring
[32:13] our models to do 14 and 26 plus tool
[32:16] calls. Okay, nice. We got 26 out of both
[32:19] the Quinn 3.5 and the MLX. That's good.
[32:21] You want to see these be relatively
[32:23] similar. Total tokens, you can see that
[32:25] used quite
[32:26] >> [laughter]
[32:27] >> a bit of tokens to complete. So, our
[32:29] models are really, really being pushed
[32:31] here. To be super clear here, these are
[32:33] running one-shot prompts. We don't
[32:35] actually have the full 200K inside of
[32:38] the context window of the agent as it
[32:40] continues to stack up its work.
[32:41] Correctness is looking really good. So,
[32:43] these models are actually able to
[32:45] perform and to get the job done. Once
[32:47] again, of course, you can see that the
[32:48] M5 is quite a bit ahead uh thanks to its
[32:52] lower total end-to-end agent execution
[32:54] times. You can see here the M4 across
[32:57] Quinn 3.5 GGUF and MLX and Gemma, you
[33:00] know, 9 seconds, 10 seconds, 20 seconds,
[33:03] 40 seconds, 60 seconds, 160 seconds.
[33:06] While the M5, 7 seconds, 10 seconds, 14
[33:08] seconds, 25, 50, 180, 100, so on and so
[33:11] forth. So, the M5 is the device you want
[33:14] if you're doing any local model work. I
[33:16] recommend you max this thing out if you
[33:19] have extra cash in the bank and you
[33:20] really want to understand what you can
[33:21] do with local models. I would just go
[33:23] all the way. I don't really think
[33:24] there's a purpose in buying the lower
[33:25] tier of these models unless you're
[33:27] getting the base model. Just go all the
[33:29] way, get all the RAM, get all the cores.
[33:31] If you boot up even a 35 or a 20 billion
[33:34] parameter model, you will be using a
[33:36] decent chunk of your RAM and all of your
[33:39] GPU when that thing executes. You know,
[33:41] side note, whenever you run these
[33:42] models, plug your device in. Running
[33:45] this, running the fans, running the GPU,
[33:47] running your cores like this saps your
[33:49] power very, very quickly. So, we're
[33:51] coming up to the last test here for M5.
[33:53] I don't think we need to let the M4
[33:54] complete. It's going to take some time
[33:55] for the M4 to work through all this.
[33:58] Local models can absolutely run inside
[34:01] of agents now. They can do agentic work
[34:03] for you. The complexity of that work is,
[34:05] of course, going to be on the simpler
[34:07] end for now, but you can absolutely do
[34:10] useful local model work up to that 8 to
[34:14] 16K token level. So, really, really
[34:17] impressive here. Okay, yeah, Gemma's
[34:18] doing a lot of work here. It didn't
[34:20] quite get the answer right. Looks like
[34:22] it got 0.7 on the package generation,
[34:24] but um it ran this with just five tool
[34:27] calls. Moreover, I'm really impressed
[34:29] with both of these models, both the
[34:32] Quinn 3.5 and the Gemma model. But, a
[34:34] big takeaway here is if you're on Mac
[34:37] devices, find the MLX model variants.
[34:39] Absolutely use the MLX model variants.
[34:42] Okay, so it got the answer wrong.
[34:44] >> [laughter]
[34:47] >> The M4 just kind of gave up here, one
[34:49] tool call, and then it just like could
[34:51] not finish. But, um all benchmarks are
[34:53] now complete. Yeah, you can see that
[34:55] this is clearly not a legitimate result.
[34:57] It tried, uh that's fine. This entire
[34:59] code base is going to be available to
[35:01] you. I can go ahead and just pull this
[35:02] up quickly here. The structure is
[35:04] relatively simple. You have four
[35:05] benchmarks, including a simple mock test
[35:08] YAML file benchmark. This is what
[35:09] defines every single benchmark. You can
[35:11] see kind of the general structure there.
[35:13] Um and then inside of apps, we have the
[35:15] kind of core back end, front end, the
[35:17] CLI that the M4 and the M5 use to ping
[35:20] messages to the server. And then we have
[35:22] the actual benchmarks that execute all
[35:25] of the tests. All of this is going to be
[35:26] here for you uh from me as a thank you
[35:29] for making it to the end of the video.
[35:30] Link will be in the description for you.
[35:32] This benchmark isn't perfect. There's
[35:34] always something to miss when you're
[35:35] creating these benchmarks, these local
[35:37] benchmarks, and cloud benchmarks,
[35:38] frankly. But, I think it's useful to sit
[35:40] down every once in a while when there's
[35:41] a stack of innovations coming out of the
[35:43] local model space and just understand
[35:45] what you can really do. So, feel free to
[35:47] play with this, customize this, create
[35:48] your own benchmarks, and, you know,
[35:50] throw an agent at this, throw a
[35:51] state-of-the-art agent at these
[35:53] benchmarks, have them tweak it, tune it,
[35:54] make it your own. Couple additional
[35:55] directions I want to go here. We saw
[35:57] here that, you know, the decode speeds
[35:59] are really, really good on these models
[36:00] up to really, really large context
[36:02] windows. This means that we can push to
[36:04] even larger parameter models. I'd love
[36:06] to find a good 50, 60, 70 billion
[36:09] parameter model to run on my M5. So,
[36:12] we're going to be looking out for that.
[36:13] Um I'm also going to be running a ton
[36:15] more pie coding agent related
[36:17] benchmarks. Uh the pie agent is just the
[36:19] simplest way to customize and understand
[36:21] what your agent can do. You can imagine
[36:22] a fully customized uh pie agent harness
[36:25] that can spec out a lot of the model
[36:27] performance right inside the UI since if
[36:30] you're using the pie coding agent, as
[36:31] we've talked about in previous videos,
[36:33] you can customize the entire agent
[36:34] harness. This is a big idea in 2026,
[36:37] control your agent harness to control
[36:39] your results. So, I'm going to be
[36:40] digging into specialized pie agent
[36:42] harness variants specifically built for
[36:44] SLMs. And then, you know, there's a ton
[36:46] of other directions here. The Gemma 4
[36:48] models can also take in images and
[36:51] audio. So, that's another benchmark.
[36:53] Quinn model can also take in text and
[36:55] images. So, there's a lot more work to
[36:57] be done here to understand what you can
[36:58] really do with local models. But, when
[37:00] it comes down to it, local models are
[37:02] still a work in progress. For 90% of
[37:05] tasks, using a model provider is the way
[37:08] to go. But, this is a space I'm keeping
[37:09] a sharp eye on because as soon as you
[37:12] can save all the costs, all the time by
[37:15] just running your model locally on your
[37:17] hardware, it's going to be a massive
[37:19] advantage that's going to compound very,
[37:21] very quickly. Those who prepare for
[37:22] local models and understand what's
[37:24] possible will be the ones saving
[37:25] resources. If you don't need a large
[37:27] model, don't use one. This especially
[37:29] matters for product engineering when you
[37:31] have hundreds, thousands, and hopefully
[37:33] hundreds of thousands of users hitting
[37:35] your service and your models over and
[37:37] over and over. The future is agentic, so
[37:39] make sure that you can control the
[37:41] performance, the cost, and the speed of
[37:43] your models over time. And of course, as
[37:45] you know, local models are going to play
[37:47] a huge role in that. Comment down below,
[37:49] let me know what other types of
[37:51] benchmarks, what other models, what
[37:52] other input types, output types you want
[37:54] to see in a local model benchmark. I can
[37:57] spin up my M5 and showcase what you can
[37:59] do with the best Mac hardware right now.
[38:02] So, feel free to comment down below.
[38:03] It's clear the AI improvements are not
[38:05] going to stop. In fact, they're speeding
[38:07] up. That means costs go down and
[38:09] performance goes up. It's only a matter
[38:11] of time, you know, and I'm predicting by
[38:13] the end of the year we should be able to
[38:15] run a Sonnet or Opus 4.0 level model on
[38:21] your device. Now, how useful it'll be
[38:23] and for what use cases will depend
[38:26] directly on the hardware you have
[38:27] available to you and the software
[38:30] improvements that come out over the
[38:31] years. So, make sure you're
[38:32] benchmarking, make sure you're vibe
[38:33] checking these local models. So, when
[38:36] they're ready, you know, and you can
[38:38] start getting the privacy, speed, and
[38:40] cost benefits the model providers don't
[38:43] want you to have. Let me super clear,
[38:44] model providers want you and I super,
[38:47] super hooked on their Kool-Aid, on their
[38:50] models, on everything they're building
[38:52] in their agent stack. You know where to
[38:54] find me every single Monday where we
[38:56] focus on hands-on agentic engineering.
[38:58] Stay focused and keep building.

← Volver al listado de vídeos