Autores varios - AI

RTX 5090, Mac Studio, or DGX Spark? I tried all three.

🇬🇧 EN🇪🇸 ES
AI
32:35 min youtube 2026 Semana 18 🇪🇸 ES

TL;DR

  • Soberanía de Datos Local: La tendencia actual es revertir la dependencia de la nube. El poder de cómputo personal es crucial para mantener el control y la privacidad sobre los flujos de trabajo personales, notas y sistemas de archivos únicos.
  • Estrategia Híbrida Inteligente: No se trata de elegir entre local o nube, sino de construir una pila tecnológica (stack) que use modelos locales eficientes (como Qwen o Gemma) para tareas repetitivas y enviar trabajos complejos a la nube solo cuando sea necesario.
  • El Hardware es un Enrutador: La elección del equipo (Mac Studio, DGX Spark, RTX 5090s) debe basarse en el tipo de carga de trabajo diaria que se desea ejecutar (ej. memoria unificada para notas vs. CUDA para codificación), funcionando la PC como un sistema inteligente de enrutamiento.

Resumen

YouTube: https://www.youtube.com/watch?v=iUSdS-6uwr4  |  Duración: 32 min

◆ La Revalorización del Ordenador Personal y la IA

La inteligencia artificial está revalorizando la importancia del ordenador personal, revirtiendo la tendencia de los últimos 15 años hacia la nube. Los agentes de IA son capaces de interactuar profundamente con el trabajo local, accediendo a archivos, carpetas y procesos. Esto obliga a considerar que el poder de cómputo personal es crucial para tareas útiles. El debate no es si la nube es mala o el local es bueno, sino una cuestión de propiedad sobre los flujos de trabajo. Gran parte del trabajo valioso está ligado al contexto personal, como notas, borradores y sistemas de archivos únicos. Por lo tanto, resulta vital decidir qué partes de ese flujo deben ser alquiladas a modelos en la nube y cuáles deben ser propias.

â–¶ Manejo de Datos Diferenciado por Necesidad

El objetivo es construir una pila de IA personal completa que garantice la privacidad y el control local del usuario. La computación personal ha triunfado al reducir la distancia entre la persona y la máquina, un principio que ahora aplica la inteligencia artificial. Gran parte del trabajo personal cotidiano es repetitivo, privado y muy dependiente del contexto, no son tareas de prueba extremas. Mantener el modelo de IA cerca de los archivos locales es crucial porque separar todo en la nube dificulta que la IA acceda a toda la información necesaria. Las grandes empresas ya implementan soluciones híbridas para conectar modelos en la nube con sistemas de memoria local. El ecosistema de pesos abiertos está evolucionando rápidamente, incorporando arquitecturas avanzadas como Mixture of Experts.

★ Los Servidores MCP no son Magia

El enfoque actual no está en el tamaño del modelo sino en cuánta parte de este se activa por token, favoreciendo modelos abiertos como Qwen y Gemma. La clave duradera es la pila tecnológica o stack, permitiendo que nuevos modelos y componentes se integren sin reemplazar toda la base de conocimiento. El ordenador personal de IA debe ser un sistema evolutivo, no una caja sellada. No hay una respuesta única para el hardware; la elección depende del tipo de carga de trabajo local que se desee ejecutar. Para tareas básicas como búsqueda documental o asistencia de codificación, un Mac con memoria unificada es suficiente y eficiente. El Mac ofrece ventajas en memoria unificada, bajo ruido y eficiencia energética. Por otro lado, la ruta CUDA con Nvidia proporciona excelente rendimiento de procesamiento, aunque la memoria puede estar fragmentada entre varias tarjetas.

â–º Principio de Interfaz: Muchas Superficies, Una Pila

La decisión de hardware debe basarse en el uso diario, no en el modelo más grande; se necesita memoria para notas o CUDA para codificación. El DGX Spark ofrece un paquete listo para usar con la pila de Nvidia, mientras que AMD Strix Halo es una opción atractiva desde el punto de vista del valor. Es fundamental entender que el software de ejecución o runtime define si la IA local funciona como una herramienta útil o como una carga de tiempo. Este runtime gestiona las cargas, la cuantización y expone APIs para que el hardware se use eficientemente. Herramientas como llama.cpp han sido clave para estandarizar formatos locales como GGUF en múltiples plataformas. Para simplificar la experiencia local, Ollama es recomendado por ofrecer una interfaz de línea de comandos limpia y un servidor compatible con OpenAI.

★ La Voz Local: Un Aspecto Subestimado

La construcción de una pila de IA local requiere elegir el runtime adecuado según la complejidad; opciones incluyen Ollama para uso diario, LM Studio para evaluación o vLLM para servir cargas de trabajo en Nvidia. Es crucial que la capa de runtime sea estable para permitir un cambio sencillo entre diferentes modelos sin grandes esfuerzos migratorios. En lugar de centrarse en un solo modelo, se debe construir alrededor de clases de modelos específicas para distintas tareas como codificación, memoria o visión. Esto permite tener una estrategia anti-dependencia de la nube, manteniendo opciones locales robustas. El ecosistema abierto actual muestra tendencias importantes con modelos que utilizan Mixture of Experts y son multimodales.

► Memoria Personal RAG y Ciclos de Codificación Privados

No existe un modelo único que domine todos los casos de uso, por lo que la clave es crear una mezcla de modelos especializados. En lugar de elegir un solo chatbot, se debe construir una caja de herramientas con diferentes capacidades. Esto incluye modelos pequeños para ciclos rápidos y modelos más grandes para tareas complejas o razonamiento profundo. Es crucial integrar componentes locales como modelos de embedding para memoria semántica privada, Whisper para transcripción rápida y visión local para extracción de datos. La estrategia principal es poseer el entorno de ejecución local y utilizar servicios en la nube solo en casos excepcionales.

â—† Captura de Reuniones sin Salir del Dispositivo

La decisión arquitectónica más importante en la IA personal es gestionar una memoria duradera que pertenezca al usuario, no al proveedor del modelo. Los sistemas de IA personales requieren almacenamiento externo para notas, documentos y estados de proyectos a largo plazo. Esto invierte el modelo tradicional donde el servicio en la nube busca poseer tu memoria. El orador presenta Open Brain como un sistema de memoria de código abierto que utiliza un enfoque híbrido SQL y de embeddings. Esta herramienta permite categorizar hechos estructurados mientras maneja múltiples documentos interconectados. Alternativas incluyen Obsidian para grandes volúmenes de documentación o Postgres para datos más cuantitativos. La propiedad clave es que tu conocimiento debe seguir existiendo incluso si la aplicación de IA desaparece.

★ Tres Perfiles de Compradores y Bases de Datos

La elección de la base de datos es fundamental, siendo Postgres con pgvector el estándar avanzado y SQLite una opción ligera para uso personal. El éxito en la recuperación depende del pipeline, ya que diferentes tipos de datos requieren estrategias de manejo distintas como las transcripciones o el código. Es vital mantener los datos brutos y los embeddings separados en la base de datos para asegurar la capacidad de reconstrucción ante fallos. Los servidores MCP permiten que herramientas externas consulten la memoria, pero deben ser sistemas bien gestionados con permisos y límites definidos. La inteligencia local debe configurarse como un sistema intencional, no simplemente una colección de herramientas locales sin estructura. Además, es crucial que el modelo resida en una interfaz cómoda donde se realice el trabajo, ya que un buen motor sin superficie de usuario será abandonado rápidamente.

â–º Mac Studio vs DGX Spark vs Dual RTX 5090s

El trabajo con agentes de código requiere un ciclo de planificación que integre el modelo, herramientas, repositorio y contexto. Las interfaces auxiliares como lanzadores y comandos son cruciales para la operatividad del LLM en el día a día. El principio clave es tener muchas superficies (editor, notas, terminal) que apunten a una única pila subyacente de memoria local. Esto permite construir una memoria institucional personal indexando documentos privados mediante sistemas RAG. La voz local se vuelve viable gracias a herramientas como Whisper para la transcripción y modelos locales para el entendimiento. Los asistentes de código privados pueden realizar tareas complejas como refactorización y generación de pruebas, complementando las capacidades de los modelos frontera. El enfoque debe pasar de solo ejecutar un modelo localmente a controlar flujos de trabajo personales avanzados.

â—† La Computadora de IA Personal es un Sistema de Enrutamiento

La computadora de IA personal funciona como un sistema de enrutamiento para decidir dónde realizar tareas específicas. Los modelos locales son suficientemente buenos para muchos bucles agenticos, permitiendo funciones como la captura y resumen de reuniones sin que el audio salga de la máquina, garantizando privacidad y ahorrando costos. La inferencia local es más económica para agentes de larga duración, ya que las APIs en la nube son caras. Sin embargo, la investigación profunda y la síntesis compleja aún requerirán modelos frontera. Para un trabajador del conocimiento enfocado en lo local, se recomienda empezar con una Mac Mini M4 Pro o Mac Studio M4 Max. Este usuario puede mantener privacidad y velocidad utilizando herramientas locales como Ollama y LM Studio, mientras conserva una suscripción a un modelo de vanguardia para tareas difíciles.

★ Memoria Acumulativa pero Auditada

El uso de hardware potente como Mac Studio o DGX Spark permite ejecutar trabajo crítico con soberanía y cumplimiento total. Los desarrolladores se benefician localmente al usar GPUs duales para manejar tareas repetitivas y privadas que justifican la inversión en infraestructura propia. La computadora de IA personal funciona como un sistema de enrutamiento, manteniendo lo privado y repetitivo localmente mientras envía trabajos complejos a la nube. El objetivo principal no es solo el ahorro de costos, sino acumular conocimiento continuamente. Este sistema se convierte en una capa operativa sobre tu trabajo, donde los datos fuente son la verdad permanente que se expande con el tiempo.

► Por Qué la IA en la Nube Debe Ser un Visitante

⚠️ Alerta Crítica de Propiedad: El objetivo principal es construir un sistema de memoria institucional que evite el encierro en aplicaciones propietarias de IA. Para lograrlo, se necesita una capa de cómputo subyacente con interfaces abiertas, como endpoints locales compatibles con OpenAI y protocolos de contexto de modelo. Es crucial tratar las herramientas del sistema como permisos, controlando estrictamente el acceso de los agentes para limitar la superficie de ataque.

La memoria debe ser acumulativa pero también auditable, permitiendo inspeccionar, rastrear y eliminar información almacenada. Se recomienda adoptar una experiencia híbrida que utilice modelos en la nube cuando sea necesario. El propósito final no es rechazar toda IA en la nube, sino poseer el sustrato al que estos modelos puedan conectarse.

◆ A Través del Espejo: Las Preguntas Clave

La construcción de una pila personal permite que la inteligencia artificial en la nube actúe como un visitante y no como el sistema dominante. Al tener esta alternativa local, surgen preguntas críticas sobre por qué las aplicaciones necesitan subir datos o por qué los asistentes pierden memoria al cerrar pestañas. El argumento del AI local no se trata de vencer a la nube, ya que esta seguirá siendo relevante para modelos avanzados. En cambio, poseer el resto de la pila fortalece el caso del AI personal. Se debe utilizar el modelo fronterizo en la nube como un especialista contratado para tareas específicas. Esto permite evitar alquilar servicios esenciales y mantener el control sobre el flujo de trabajo diario. El ordenador de IA personal es una apuesta estratégica por una inteligencia más cercana al trabajo, no solo un pasatiempo nostálgico.

★ Tu Computadora, Tu IA

El mensaje central es que la computadora debe ser una herramienta personal y local para funcionar como tu propia inteligencia artificial. No necesita ser la máquina más potente del mundo, sino simplemente tu equipo de trabajo. El orador impulsa a los usuarios a adoptar un modelo de desarrollador local-first y prosumer. Para quienes deseen construir este ecosistema, ofrece recomendaciones detalladas en su Substack. También menciona Open Brains como una forma accesible de empezar a controlar la parte memoria de tu pila computacional. La conclusión es que el usuario debe sentirse dueño de su destino digital. Es crucial evitar que los agentes y LLMs basados en la nube controlen los parámetros de inteligencia a largo plazo en la vida personal.

💡 Recomendaciones Clave para Implementar tu Pila Local

  • Priorizar el Stack sobre el Modelo: Enfocarse en construir una pila tecnológica robusta y evolutiva (usando herramientas como Ollama o llama.cpp) que permita cambiar modelos sin reestructurar todo el sistema.
  • Definir la Carga de Trabajo: Elegir hardware basándose en la necesidad específica: si es para notas/contexto, priorizar memoria unificada (Mac); si es para cómputo pesado/codificación, considerar CUDA (Nvidia).
  • Asegurar la Propiedad de Datos: Implementar sistemas de memoria local y auditable (como Open Brains o Postgres con pgvector) para que tu conocimiento persista independientemente del proveedor de IA.

â—† Buscar el alpha

La tesis central que se desprende de este análisis técnico es una rotación estratégica del valor en la IA. El capital no debe fluir hacia los proveedores de modelos monolíticos en la nube, sino hacia la soberanía del flujo de trabajo personal. La verdadera ventaja competitiva y económica reside en poseer el sustrato local (la pila tecnológica) que gestiona el contexto privado, transformando el ordenador personal en un sistema de enrutamiento inteligente.

  • Catalizador/Cambio de Régimen: El auge de los agentes de IA obliga a revertir la tendencia de 15 años hacia la nube; el poder de cómputo local se vuelve crucial porque gran parte del trabajo valioso está ligado al contexto personal (notas, archivos únicos).
  • Rotación de Capital: La inversión debe priorizar la infraestructura de memoria y almacenamiento privado (Open Brains, Postgres con pgvector) sobre simplemente comprar la GPU más potente. El valor reside en poseer el conocimiento acumulado, no solo en ejecutar la inferencia.
  • Mejor Expresión del Tema: El foco estratégico se ha desplazado del tamaño del modelo al *stack* de ejecución (runtime). La utilidad de la IA local está definida por herramientas como Ollama y llama.cpp que estandarizan formatos, no por el hardware en sí mismo.
  • Posicionamiento Óptimo (Híbrido): El sistema debe operar como un "router". Las tareas repetitivas, privadas o de bajo costo deben residir localmente; los modelos frontera en la nube solo se utilizan como especialistas contratados para síntesis compleja y investigación profunda.
  • Condición de Invalidez: La dependencia total de servicios en la nube es una vulnerabilidad estratégica que lleva al encierro en aplicaciones propietarias, lo cual debe evitarse a toda costa mediante el control del sustrato local.
La vuelta de tuerca: El oyente superficial ve esta discusión como una carrera armamentística de hardware (RTX vs DGX). La visión profunda es que el verdadero "moat" no es la potencia bruta, sino la soberanía del dato. Al poseer la pila local, se transforma al usuario de un mero consumidor de servicios en un arquitecto de su propia inteligencia a largo plazo, controlando quién accede y cómo se usa su memoria digital.

► Resumen por capítulos

Plain markdown plus Git is the boring immortal version (0:00)

La inteligencia artificial está revalorizando la importancia del ordenador personal, revirtiendo la tendencia de los últimos 15 años hacia la nube. Los agentes de IA son capaces de interactuar profundamente con el trabajo local, accediendo a archivos, carpetas y procesos. Esto obliga a considerar que el poder de cómputo personal es crucial para tareas útiles. El debate no es si la nube es mala o el local es bueno, sino una cuestión de propiedad sobre los flujos de trabajo. Gran parte del trabajo valioso está ligado al contexto personal, como notas, borradores y sistemas de archivos únicos. Por lo tanto, resulta vital decidir qué partes de ese flujo deben ser alquiladas a modelos en la nube y cuáles deben ser propias.

Different data needs different memory handling (2:30)

El objetivo es construir una pila de IA personal completa que garantice la privacidad y el control local del usuario. La computación personal ha triunfado al reducir la distancia entre la persona y la máquina, un principio que ahora aplica la inteligencia artificial. Gran parte del trabajo personal cotidiano es repetitivo, privado y muy dependiente del contexto, no son tareas de prueba extremas. Mantener el modelo de IA cerca de los archivos locales es crucial porque separar todo en la nube dificulta que la IA acceda a toda la información necesaria. Las grandes empresas ya implementan soluciones híbridas para conectar modelos en la nube con sistemas de memoria local. El ecosistema de pesos abiertos está evolucionando rápidamente, incorporando arquitecturas avanzadas como Mixture of Experts.

MCP servers are not magic (5:00)

El enfoque actual no está en el tamaño del modelo sino en cuánta parte de este se activa por token, favoreciendo modelos abiertos como Qwen y Gemma. La clave duradera es la pila tecnológica o stack, permitiendo que nuevos modelos y componentes se integren sin reemplazar toda la base de conocimiento. El ordenador personal de IA debe ser un sistema evolutivo, no una caja sellada. No hay una respuesta única para el hardware; la elección depende del tipo de carga de trabajo local que se desee ejecutar. Para tareas básicas como búsqueda documental o asistencia de codificación, un Mac con memoria unificada es suficiente y eficiente. El Mac ofrece ventajas en memoria unificada, bajo ruido y eficiencia energética. Por otro lado, la ruta CUDA con Nvidia proporciona excelente rendimiento de procesamiento, aunque la memoria puede estar fragmentada entre varias tarjetas.

The interface principle: many surfaces, one stack (7:30)

La decisión de hardware debe basarse en el uso diario, no en el modelo más grande; se necesita memoria para notas o CUDA para codificación. El DGX Spark ofrece un paquete listo para usar con la pila de Nvidia, mientras que AMD Strix Halo es una opción atractiva desde el punto de vista del valor. Es fundamental entender que el software de ejecución o runtime define si la IA local funciona como una herramienta útil o como una carga de tiempo. Este runtime gestiona las cargas, la cuantización y expone APIs para que el hardware se use eficientemente. Herramientas como llama.cpp han sido clave para estandarizar formatos locales como GGUF en múltiples plataformas. Para simplificar la experiencia local, Ollama es recomendado por ofrecer una interfaz de línea de comandos limpia y un servidor compatible con OpenAI.

Voice is underrated now that local whisper works (10:00)

La construcción de una pila de IA local requiere elegir el runtime adecuado según la complejidad; opciones incluyen Ollama para uso diario, LM Studio para evaluación o vLLM para servir cargas de trabajo en Nvidia. Es crucial que la capa de runtime sea estable para permitir un cambio sencillo entre diferentes modelos sin grandes esfuerzos migratorios. En lugar de centrarse en un solo modelo, se debe construir alrededor de clases de modelos específicas para distintas tareas como codificación, memoria o visión. Esto permite tener una estrategia anti-dependencia de la nube, manteniendo opciones locales robustas. El ecosistema abierto actual muestra tendencias importantes con modelos que utilizan Mixture of Experts y son multimodales.

Personal RAG and private coding loops (12:30)

No existe un modelo único que domine todos los casos de uso, por lo que la clave es crear una mezcla de modelos especializados. En lugar de elegir un solo chatbot, se debe construir una caja de herramientas con diferentes capacidades. Esto incluye modelos pequeños para ciclos rápidos y modelos más grandes para tareas complejas o razonamiento profundo. Es crucial integrar componentes locales como modelos de embedding para memoria semántica privada, Whisper para transcripción rápida y visión local para extracción de datos. La estrategia principal es poseer el entorno de ejecución local y utilizar servicios en la nube solo en casos excepcionales.

Meeting capture without audio leaving the machine (15:00)

La decisión arquitectónica más importante en la IA personal es gestionar una memoria duradera que pertenezca al usuario, no al proveedor del modelo. Los sistemas de IA personales requieren almacenamiento externo para notas, documentos y estados de proyectos a largo plazo. Esto invierte el modelo tradicional donde el servicio en la nube busca poseer tu memoria. El orador presenta Open Brain como un sistema de memoria de código abierto que utiliza un enfoque híbrido SQL y de embeddings. Esta herramienta permite categorizar hechos estructurados mientras maneja múltiples documentos interconectados. Alternativas incluyen Obsidian para grandes volúmenes de documentación o Postgres para datos más cuantitativos. La propiedad clave es que tu conocimiento debe seguir existiendo incluso si la aplicación de IA desaparece.

Three buyer profiles: knowledge worker, maximalist, builder (17:30)

La elección de la base de datos es fundamental, siendo Postgres con pgvector el estándar avanzado y SQLite vec una opción ligera para uso personal. El éxito en la recuperación depende del pipeline, ya que diferentes tipos de datos requieren estrategias de manejo distintas como las transcripciones o el código. Es vital mantener los datos brutos y los embeddings separados en la base de datos para asegurar la capacidad de reconstrucción ante fallos. Los servidores MCP permiten que herramientas externas consulten la memoria, pero deben ser sistemas bien gestionados con permisos y límites definidos. La inteligencia local debe configurarse como un sistema intencional, no simplemente una colección de herramientas locales sin estructura. Además, es crucial que el modelo resida en una interfaz cómoda donde se realice el trabajo, ya que un buen motor sin superficie de usuario será abandonado rápidamente.

Mac Studio vs DGX Spark vs dual RTX 5090s (20:00)

El trabajo con agentes de código requiere un ciclo de planificación que integre el modelo, herramientas, repositorio y contexto. Las interfaces auxiliares como lanzadores y comandos son cruciales para la operatividad del LLM en el día a día. El principio clave es tener muchas superficies (editor, notas, terminal) que apunten a una única pila subyacente de memoria local. Esto permite construir una memoria institucional personal indexando documentos privados mediante sistemas RAG. La voz local se vuelve viable gracias a herramientas como Whisper para la transcripción y modelos locales para el entendimiento. Los asistentes de código privados pueden realizar tareas complejas como refactorización y generación de pruebas, complementando las capacidades de los modelos frontera. El enfoque debe pasar de solo ejecutar un modelo localmente a controlar flujos de trabajo personales avanzados.

The personal AI computer is a routing system (22:30)

La computadora de IA personal funciona como un sistema de enrutamiento para decidir dónde realizar tareas específicas. Los modelos locales son suficientemente buenos para muchos bucles agenticos, permitiendo funciones como la captura y resumen de reuniones sin que el audio salga de la máquina, garantizando privacidad y ahorrando costos. La inferencia local es más económica para agentes de larga duración, ya que las APIs en la nube son caras. Sin embargo, la investigación profunda y la síntesis compleja aún requerirán modelos frontera. Para un trabajador del conocimiento enfocado en lo local, se recomienda empezar con una Mac Mini M4 Pro o Mac Studio M4 Max. Este usuario puede mantener privacidad y velocidad utilizando herramientas locales como Ollama y LM Studio, mientras conserva una suscripción a un modelo de vanguardia para tareas difíciles.

Memory needs to be cumulative but auditable (25:00)

El uso de hardware potente como Mac Studio o DGX Spark permite ejecutar trabajo crítico con soberanía y cumplimiento total. Los desarrolladores se benefician localmente al usar GPUs duales para manejar tareas repetitivas y privadas que justifican la inversión en infraestructura propia. La computadora de IA personal funciona como un sistema de enrutamiento, manteniendo lo privado y repetitivo localmente mientras envía trabajos complejos a la nube. El objetivo principal no es solo el ahorro de costos, sino acumular conocimiento continuamente. Este sistema se convierte en una capa operativa sobre tu trabajo, donde los datos fuente son la verdad permanente que se expande con el tiempo.

Why cloud AI should be a visitor to your system (27:30)

El objetivo principal es construir un sistema de memoria institucional que evite el encierro en aplicaciones propietarias de IA. Para lograrlo, se necesita una capa de cómputo subyacente con interfaces abiertas, como endpoints locales compatibles con OpenAI y protocolos de contexto de modelo. Es crucial tratar las herramientas del sistema como permisos, controlando estrictamente el acceso de los agentes para limitar la superficie de ataque. La memoria debe ser acumulativa pero también auditable, permitiendo inspeccionar, rastrear y eliminar información almacenada. Se recomienda adoptar una experiencia híbrida que utilice modelos en la nube cuando sea necesario. El propósito final no es rechazar toda IA en la nube, sino poseer el sustrato al que estos modelos puedan conectarse.

Through the looking glass: the questions you start asking (30:00)

La construcción de una pila personal permite que la inteligencia artificial en la nube actúe como un visitante y no como el sistema dominante. Al tener esta alternativa local, surgen preguntas críticas sobre por qué las aplicaciones necesitan subir datos o por qué los asistentes pierden memoria al cerrar pestañas. El argumento del AI local no se trata de vencer a la nube, ya que esta seguirá siendo relevante para modelos avanzados. En cambio, poseer el resto de la pila fortalece el caso del AI personal. Se debe utilizar el modelo fronterizo en la nube como un especialista contratado para tareas específicas. Esto permite evitar alquilar servicios esenciales y mantener el control sobre el flujo de trabajo diario. El ordenador de IA personal es una apuesta estratégica por una inteligencia más cercana al trabajo, no solo un pasatiempo nostálgico.

Your computer, your AI (31:30)

El mensaje central es que la computadora debe ser una herramienta personal y local para funcionar como tu propia inteligencia artificial. No necesita ser la máquina más potente del mundo, sino simplemente tu equipo de trabajo. El orador impulsa a los usuarios a adoptar un modelo de desarrollador local-first y prosumer. Para quienes deseen construir este ecosistema, ofrece recomendaciones detalladas en su Substack. También menciona Open Brains como una forma accesible de empezar a controlar la parte de memoria de tu pila computacional. La conclusión es que el usuario debe sentirse dueño de su destino digital. Es crucial evitar que los agentes y LLMs basados en la nube controlen los parámetros de inteligencia a largo plazo en la vida personal.

Generado con algoritmo v1-chunked · modelo google/gemma-4-e4b · 2026-05-03T12:07:28Z

Transcripción

[0:00] The strangest thing about AI right now
[0:01] is that it's making the computer on your
[0:03] desk important again. For the last 15
[0:06] years, the story of personal computing
[0:08] was basically the story of the computer
[0:09] disappearing. Your files moved into
[0:11] someone else's cloud, your apps became
[0:13] browser tabs, your storage became a sink
[0:15] of some sort, your OS became a launcher
[0:18] for other people's infrastructure. And
[0:20] for a lot of software, that seemed fine.
[0:22] It was convenient. It was maybe the
[0:24] right trade at the time. But agents are
[0:26] changing the direction of travel for
[0:28] compute because a useful agent doesn't
[0:30] just answer a question. It wants to
[0:32] touch the work. It wants to read the
[0:33] file and inspect the folder and run the
[0:35] test and edit the spreadsheet and search
[0:37] your notes and open the browser and
[0:39] remember the decision you made and try
[0:40] again when the first attempt fails. So,
[0:42] the more useful the agent becomes, the
[0:44] more it starts reaching back toward the
[0:46] oldest primitives of computing, files
[0:48] and processes and permissions and memory
[0:51] and local state and execution. That's
[0:53] why the personal AI computer matters.
[0:56] Now, a quick caveat up front because I
[0:57] talk a lot about frontier models on this
[0:59] channel and I'm going to keep doing
[1:01] that. The best cloud models are
[1:02] incredibly useful and one of the most
[1:04] important trends is that they're moving
[1:06] closer to our personal computers, not
[1:07] farther away. So, Codex, Cloud Code, and
[1:10] the whole class of coding agents matter
[1:12] precisely because a cloud model can now
[1:14] interact with your repo, your terminal,
[1:16] your files, and your tools on the
[1:18] machine right in front of you. So, the
[1:19] argument here is not cloud is bad, local
[1:22] is good. The argument is that as AI
[1:24] reaches deeper into the personal
[1:25] computer, the ownership question for you
[1:28] gets sharper. If models are going to
[1:29] touch your files and remember your work
[1:31] and call your tools and sit inside your
[1:33] workflows, there is still room, maybe
[1:35] more room for a stack that is all yours.
[1:37] And that stack matters because some of
[1:39] the most valuable AI work is not the
[1:41] most difficult work in the abstract.
[1:44] It's not the work that takes a cloud
[1:45] model at the very edge of the frontier.
[1:48] It's the work that is closest to your
[1:49] own context, your notes, your meetings,
[1:52] your drafts, your unfinished projects,
[1:53] your weird folder system. And the
[1:55] question for you becomes which parts of
[1:58] that should you keep renting and which
[2:01] parts should you own, and how do you
[2:03] start to intentionally think about that
[2:05] as models keep getting better and that
[2:08] workflow divide starts to change?
[2:10] Because even a few months ago,
[2:12] open-source models could not do a lot of
[2:16] what I just described at all. And now,
[2:18] they still aren't as good as the
[2:19] closed-source frontier models and you
[2:21] still can't give them as much messy work
[2:23] as say ChatGPT 5.5 by any stretch, but
[2:26] they're getting a lot better and it's
[2:27] worth thinking about at least for some
[2:29] of your workflows, especially if you
[2:31] value privacy or have highly
[2:33] confidential information on your
[2:34] computer. So, by the end of this video,
[2:36] I want you to have a mental model for
[2:37] the whole personal AI stack, not just
[2:40] which GPU should I buy, not just which
[2:42] model is best this week, but the actual
[2:44] stack, the machine, the runtime, the
[2:46] models, the memory, the apps, and the
[2:48] workflows that make local AI worth
[2:50] owning in the first place. Because the
[2:52] biggest mistake you can make is buying a
[2:54] really fancy computer whose only job is
[2:56] to run benchmark prompts or to do your
[2:58] emails, which is what so many people do
[3:00] with their Mac minis and open claw. The
[3:02] best version of a personal computer is a
[3:04] lot more compelling than that. It's
[3:07] building a durable place where AI can
[3:09] attach to the rest of your computing
[3:10] life and you still have privacy. There's
[3:12] a historical echo here that I think is
[3:14] easy to miss. Before the personal
[3:16] computer, the dominant model was
[3:18] actually time-sharing. You rented
[3:19] compute on someone else's mainframe. You
[3:22] waited in queues, you worked inside
[3:24] rules set by an operator you would never
[3:26] meet. The first personal computers did
[3:28] not beat that mainframe on raw power.
[3:31] They won because they collapsed the
[3:33] distance between the person and the
[3:34] machine. AI is creating a similar
[3:37] opening. Frontier models are still
[3:38] better at the hardest tasks and they're
[3:40] going to stay better for a while, but
[3:42] most personal work is not a moonshot
[3:44] benchmark. Most personal work is messy
[3:47] and it's repeated. It's not too huge.
[3:49] It's private and it's context-heavy.
[3:51] It's like, what did we decide here in
[3:53] the meeting? Please find this draft.
[3:56] Look in this repo and explain why the
[3:57] test is failing or can you make a
[4:00] follow-up memo or help me do a
[4:02] journaling program? All of that work
[4:04] benefits from the model being in your
[4:06] files, your tools, your memory, and the
[4:08] places where you're already doing
[4:10] personal computing. When all of that
[4:11] gets separated out into the cloud, it
[4:13] gets harder for the AI to touch all of
[4:15] the files and folders you want that you
[4:17] need taken care of on a single computing
[4:21] space. And frankly, that's why a lot of
[4:24] enterprise workflows involve a lot of
[4:26] harnesses that tie a cloud model into a
[4:30] local memory file system attached to
[4:32] Azure or attached to AWS. They're
[4:34] essentially doing the grown-up
[4:36] enterprise version of exactly what I'm
[4:38] describing here for a company. It's the
[4:40] same principles. You want to get the
[4:42] model close to the work it needs to do.
[4:44] And if you want to go local, the open
[4:46] weight ecosystem is moving fast enough
[4:48] now that this is no longer a theoretical
[4:50] conversation. Meta's practical open
[4:52] weight line is no longer just about the
[4:54] old Llama 3 story. Llama 4 Scout and
[4:56] Llama 4 Maverick have moved that Llama
[4:59] lineage into mixture of experts models
[5:01] where the important question is no
[5:02] longer how big is the model, but how
[5:04] much of the model fires for each token.
[5:06] Open AI has GPT-OSS-20B
[5:09] and GPT-OSS-120B,
[5:11] which are open weight reasoning models
[5:13] under Apache 2.0. They're not ChatGPT.
[5:16] They're not models you call through the
[5:17] normal OpenAI API. They're weights you
[5:20] run on infrastructure you control. Qwen
[5:22] has become one of the most important
[5:23] local model families for agents, for
[5:25] coding, for multilingual work, and for
[5:27] tool use. Google's Gemma 4 pushed
[5:29] serious capability down into smaller
[5:31] local models under a more permissive
[5:33] license. It's designed for open claw.
[5:35] Mistral's newer open models fill in both
[5:37] large frontier cell deployments and
[5:39] efficient local ones. Now, in April 24,
[5:41] DeepSeek previewed V4 with Pro and Flash
[5:44] variants, which is a good reminder that
[5:45] any model list you make today, it starts
[5:47] aging right away, right? That's the
[5:49] point. The model list is not the durable
[5:51] thing. The durable thing is the stack.
[5:53] If you build this right, you're not
[5:54] buying a single model appliance, you're
[5:56] building a local substrate that you can
[5:58] evolve over time. New models can drop
[6:01] in. New runtimes can replace old ones.
[6:03] New memory stores can be added. New
[6:04] agents can call the same tools. New
[6:06] interfaces can show up without taking
[6:08] your knowledge base with them. The
[6:09] personal AI computer should not be a
[6:11] sealed box that does just one trick. It
[6:13] should be a place where the rest of AI
[6:15] can connect to the rest of computing.
[6:17] So, start with the least glamorous part,
[6:19] the hardware. This is where people get
[6:20] trapped because they want one universal
[6:22] answer. Mac or Nvidia, the CUDA tower or
[6:25] the DGX Spark, buy now or wait. There
[6:27] isn't only one answer because local AI
[6:29] is constrained by memory capacity,
[6:31] bandwidth, accelerator support, software
[6:34] maturity, cooling, power, noise, and
[6:36] that annoying one, what you do every
[6:37] day. So, the better question is not what
[6:39] is the best AI computer, period. The
[6:41] better question is what local workload
[6:43] are you trying to own? If you're
[6:44] learning the stack, if you're running
[6:46] private document search and doing local
[6:48] writing and local coding assistance and
[6:49] maybe transcribing audio, the boring
[6:51] answer is that a recent Mac with enough
[6:53] unified memory is enough. A Mac mini
[6:55] with M4 Pro and 64 gigs is a great entry
[6:58] point. A Mac Studio becomes interesting
[7:00] when you want 128 gigs or 256 or even
[7:03] more, 512 gigs of unified memory. The
[7:06] Mac advantage is not raw tensor
[7:08] throughput here. The advantage is
[7:10] unified memory and low noise and power
[7:12] efficiency and the fact that the machine
[7:13] feels like a computer instead of a
[7:15] project. Now, this is the CUDA path. An
[7:17] RTX 5090 gives you 32 gigs of GDDR7, say
[7:21] that five times fast, and excellent
[7:23] throughput. Two of them gives you 64
[7:25] gigs across cards, but that's not one
[7:28] clean 64 gig pool of memory, right? The
[7:30] payoff is speed and ecosystem support.
[7:32] And so, you're dealing with a cost of
[7:34] drivers, of heat, of power, sharding
[7:36] maybe, maintenance. So, you have to
[7:38] think that through, right? And then
[7:40] there's the Nvidia DGX Spark, which is
[7:41] the appliance version of the Nvidia
[7:43] path. You get a Grace Blackwell chip on
[7:45] the desk, you get 128 gigs of coherent
[7:48] unified memory, you get Nvidia's
[7:50] software stack and a product story
[7:51] around local inference and fine-tuning
[7:53] instead of just a parts list. That
[7:55] doesn't mean it beats every custom rig.
[7:57] It means it packages the Nvidia stack in
[7:59] a way that may be worth paying for if
[8:01] you want a CUDA-native local AI without
[8:03] building the tower yourself. AMD's Strix
[8:05] Halo systems are kind of the value
[8:07] wildcard here, right? The hardware story
[8:09] is attractive, the software story is
[8:10] still less mature than CUDA and less
[8:13] frictionless than Apple silicon. Which
[8:15] brings us back to the real buying rule.
[8:17] Don't buy for the biggest model you read
[8:19] about. Buy the thing you're going to run
[8:20] daily. If the work is private writing or
[8:23] notes or documents or meetings, you want
[8:25] to buy memory and simplicity. If the
[8:26] work is coding agents and throughput,
[8:28] buy CUDA and just accept the
[8:30] maintenance. If the work is long context
[8:32] personal memory, buy storage, buy
[8:34] unified memory, buy a real database,
[8:36] right? If you're just experimenting,
[8:38] start with what you own. The box needs a
[8:40] job before it arrives, so do that work.
[8:43] Once the machine exists, the next
[8:45] question is whether the software makes
[8:47] it feel like a tool or just a tax on
[8:49] your time. And this is where runtime
[8:51] really matters, the software that loads
[8:53] the weights, that serves the inference,
[8:54] that handles quantization, that exposes
[8:56] APIs, that manages batching, and that
[8:58] decides whether your expensive hardware
[9:00] is actually being used well. Most people
[9:03] underestimate this layer because it
[9:04] isn't as exciting as the model name. But
[9:06] runtime is the difference between local
[9:08] AI feeling like a normal part of your
[9:10] computer and local AI feeling like a
[9:12] weekend that you just have never had a
[9:14] chance to recover from. The foundation
[9:16] underneath a lot of this is a tool
[9:18] called llama.cpp. Even if you never call
[9:20] it directly, you benefit from it all the
[9:22] time if you run your own stack. It
[9:24] helped make GGUF, the common local model
[9:27] format. It runs across your CPU, across
[9:30] Apple Metal, across CUDA, across Vulkan,
[9:32] and more. And for most people, the
[9:34] runtime on top of that should still be
[9:35] Ollama. It's not always the fastest or
[9:37] the most configurable, but it gives you
[9:39] a clean command-line interface, a local
[9:41] server, a simple model registry, and an
[9:43] OpenAI-compatible surface that other
[9:45] tools can talk to. That makes local
[9:47] inference feel normal, especially if
[9:49] you're used to cloud models. And just a
[9:51] quick note on all of the technical terms
[9:53] I'm using, I know that I'm using a bunch
[9:55] of very specific terms in this video,
[9:58] don't be scared by them. If you want to
[10:00] build your own local stack, you really
[10:02] can start with a Mac Mini, and I'm going
[10:04] to give you a complete teardown across
[10:06] multiple degrees of complexity at the
[10:08] end of this video to help you understand
[10:11] which approach you want to take
[10:12] depending on the workloads you're going
[10:13] after. So, don't let the technical
[10:15] terminology scare you. And in fact, you
[10:17] can load the transcript from this video
[10:19] into your AI of choice and have it
[10:22] explained to you what each of the
[10:24] technical terms I'm mentioning mean. So,
[10:26] let's keep moving. If you want to go
[10:27] with a more sophisticated runtime, LM
[10:29] Studio is a polished workbench for
[10:31] testing models and quantization. If you
[10:33] want to go with something Apple native,
[10:35] MLX matters on Apple silicon because
[10:37] it's a more native performance path. And
[10:39] if you're serving real workloads on
[10:40] Nvidia hardware, vLLM is where the
[10:43] conversation starts to really up-level,
[10:45] right? It handles batching, OpenAI
[10:47] compatible serving, and enough
[10:48] throughput for a team or an internal
[10:50] product. Beyond that, you can tackle SG
[10:52] Lang or TensorRT-LLM or an even Nvidia
[10:55] and NeMo. Those are all for serious
[10:56] deployment tiers. That's where you get
[10:57] into latency, structured generation,
[10:59] agents, and serving economics that
[11:01] enable you to justify the complexity of
[11:04] your build because of how much you're
[11:06] getting done. But, the practical default
[11:07] is simple. Ollama for daily use, LM
[11:10] Studio for evaluation, maybe MLX if
[11:13] you're tackling the Mac side of things,
[11:15] vLLM when serving becomes
[11:16] infrastructure, and that deeper Nvidia
[11:19] stack when you've committed to CUDA.
[11:20] Notice what happened here. We haven't
[11:22] picked the model yet. That's
[11:23] intentional. If the runtime layer is
[11:25] healthy, models become very swappable.
[11:28] If the runtime layer is brittle, every
[11:30] new model becomes a migration effort.
[11:32] It's a lot of pain. Now, the model layer
[11:35] is where the yelling in the discourse
[11:37] gets loudest and also where it ages out
[11:38] the fastest. So, I would not build a
[11:41] personal AI computer around a single
[11:43] model name. I would build around model
[11:46] classes for particular workloads. So,
[11:48] for example, you probably want a fast
[11:50] local model for cheap calls, a stronger
[11:52] local generalist model, a coding model
[11:55] if that's what you're into, an embedding
[11:56] model for memory, a speech model, maybe
[11:58] a vision model, and of course, a
[12:00] frontier cloud fallback for the work
[12:02] that still deserves it if that's what
[12:03] you're willing to do. So, the personal
[12:05] AI computer that I'm describing here is
[12:07] not necessarily anti-cloud, it's just
[12:10] anti-dependence.
[12:11] You don't want to be dependent on the
[12:12] cloud models. And for general work, the
[12:14] local landscape now has real choices.
[12:17] Llama 4, Scout, and Maverick are
[12:19] important because they show where the
[12:20] open ecosystem is headed. They have
[12:22] mixture of experts models. It's a
[12:24] multimodal approach, longer context,
[12:26] more deployment nuance there. GPT-OSS
[12:29] matters because OpenAI put permissively
[12:31] licensed reasoning models out into the
[12:33] self-hosted world. Qwen matters because
[12:35] it's become a default family for lots of
[12:37] agents, for coding, for multilingual
[12:39] work, and for tool use. Gemma matters
[12:41] because Google is pushing very capable
[12:43] local models down to smaller sizes
[12:45] designed specifically for open claw type
[12:47] applications. Mistral matters because it
[12:49] keeps offering serious open weight
[12:51] alternatives with a strong enterprise
[12:53] and deployment story. But, the most
[12:54] important takeaway here is this. There
[12:56] is no one right model that wins at all
[12:59] the use cases. Part of what you're doing
[13:01] when you set up a strong personal
[13:03] computer for AI is you're asking
[13:05] yourself, "What is the mixture of models
[13:07] I need?" And that's what I'm looking to
[13:08] give you is the sense of choices and the
[13:10] rationale you'd use to make those
[13:12] choices. For example, for coding, you
[13:14] don't want one model doing everything.
[13:16] You want a small autocomplete model, a
[13:18] repo-aware editor model, and a deeper
[13:21] reasoning model for architectural
[13:22] changes, for debugging, for migrations.
[13:25] If you're doing docs, you probably want
[13:27] to think about an embedding model and
[13:28] how you handle embeddings so that you
[13:30] can retrieve semantic memory correctly.
[13:34] Qwen embedding models are good here.
[13:35] There's other options that as well.
[13:37] Whatever fits your stack. Embeddings are
[13:39] very cheap to run. They're easy to
[13:40] cache, and they're central to privacy if
[13:42] you value a private set of core
[13:45] documents that don't go to the cloud.
[13:47] You know, if your documents end up
[13:49] leaving your machine just to become
[13:50] vectors, you've missed one of the
[13:52] easiest wins in local AI. If we're
[13:54] talking about speech, Whisper is still a
[13:56] reference point. Local transcription is
[13:58] fast and private, and if you own the
[13:59] hardware, it's very economical. For
[14:01] vision, local models are finally good
[14:03] enough for document screenshots, for
[14:05] chart extraction, not for all visual
[14:07] reasoning, but for a lot of personal
[14:09] media search and work, and that belongs
[14:11] in your stack now. Ultimately, your
[14:12] model portfolio should feel less like
[14:14] picking your favorite chatbot and a lot
[14:17] more like building a tool cabinet. A
[14:18] small model for fast loops, bigger
[14:20] models for hard local work, a
[14:22] specialized models like I've been
[14:23] describing for various aspects of code
[14:26] editing, code production, media, and
[14:28] then a cloud model for the frontier
[14:29] cases. The principle should be you own
[14:32] the runtime, and you only rent the cloud
[14:34] model in exceptional cases. Now, if
[14:36] you're wondering, "Do I have to do all
[14:38] of this? This feels like a lot of work.
[14:40] Nate, can I just use a cloud model?" The
[14:42] answer is absolutely you can. And for
[14:44] many people, that's going to be the
[14:46] answer. But, I know a lot of folks in my
[14:47] audience who value the privacy that
[14:50] comes with their own local stack, and I
[14:51] want you to have the tools to be able to
[14:54] build that stack in a way that aligns
[14:56] with your workflows because a lot of the
[14:58] videos that I see are really useful for
[15:01] building your own personal computer, but
[15:02] they're not useful for helping you
[15:04] decide what stack you should be on,
[15:06] which is arguably the more important
[15:08] thing to do. Figure out the workflows
[15:09] you need to go after, and then build the
[15:12] stack that fits. And that's really my
[15:13] focus here, and I'm giving you
[15:14] essentially lots of choices that you can
[15:16] dig into. And if you want a full punch
[15:18] list, yes, it's absolutely going to be
[15:20] on the Substack. Getting back to our
[15:21] stack, the layer that actually turns
[15:23] this from a toy into infrastructure is
[15:25] memory. And that's the part that I think
[15:27] people tend to underbuild. The model is
[15:29] stateless, but your life isn't
[15:31] stateless. Your life remember You you
[15:32] remember things. You go through your
[15:34] life with durable memory. Every useful
[15:36] personal AI system also needs durable
[15:38] memory outside the model. It needs notes
[15:39] and documents and transcripts and email
[15:41] and tasks and calendar events and code
[15:43] decisions and research and preferences
[15:45] and a sense of long-running project
[15:47] state. And so, your most important
[15:48] architectural decision is that this
[15:50] memory should belong to you, not the
[15:52] model provider. And that's why I built
[15:54] Open Brain. Open Brain is an
[15:56] open-source, GitHub available memory
[15:59] system that allows you to build a
[16:01] SQL-driven database approach to memory
[16:04] with an easy MCP server attached, but
[16:06] that also recently we've added an
[16:08] embedding management system for. So, you
[16:11] can do almost an Andrej Karpathy-like
[16:13] hybrid memory system where you have the
[16:15] Karpathy approach to memory involving
[16:17] lots of different interlinked and
[16:19] interweaved embeddings that help you
[16:21] make sense of multiple documents at
[16:22] once, and also a SQL approach that lets
[16:25] you store and categorize facts in a neat
[16:28] way. And so, that's something to think
[16:29] about. You obviously don't need to use
[16:30] Open Brain to solve for this, but I
[16:32] built it because I think that memory is
[16:34] very high leverage, and it's important
[16:35] to manage your own memory in the age of
[16:37] AI so you're not beholden to a
[16:39] particular cloud provider. After all, in
[16:41] the cloud-first model, the AI service
[16:43] really wants to own your memory, and you
[16:45] visit your memory. Whereas in the
[16:46] personal compute model that I'm
[16:48] describing here, you own the memory, and
[16:50] the models come to you if you choose to
[16:52] rent them. And that inversion is the
[16:53] heart of the whole thing. The source
[16:55] material for your life, your memory,
[16:57] should live somewhere durable. If you
[16:58] don't want to go with Open Brain, you
[17:00] know what? You can go with Obsidian.
[17:01] It's a It's a default if you have a lot
[17:03] of docs. It won't work as well for lots
[17:05] of quantitative storage and facts. But,
[17:07] if most of your work is in docs, it will
[17:09] store it in markdown in folders you can
[17:11] control, and you can absolutely use
[17:13] Obsidian. I know a lot of people who do.
[17:14] Plain markdown plus Git is like the
[17:16] boring immortal version. For structured
[17:18] work, you might go with Postgres, which
[17:20] might be better than your notes. That's
[17:21] why I built Open Brain that way. But,
[17:23] the key property for memory overall is
[17:24] very simple. Your knowledge keeps
[17:26] existing even if the AI app disappears.
[17:29] Then, you need retrieval. For many
[17:30] serious systems, Postgres with pgvector
[17:33] is the grown-up default because it lets
[17:34] you keep relational data and metadata
[17:36] and permissions and vector search all in
[17:38] one place. SQLite with SQLite vec is the
[17:41] lightweight personal version. It's just
[17:43] a single file. It's easy to back up.
[17:44] It's easy to understand. Now, the part
[17:46] almost everybody gets wrong is on the
[17:48] pipeline side. Good retrieval is not
[17:50] throw every document into chunks and
[17:53] hope. By the way, if you're wondering,
[17:54] "Wow, this sounds complicated," Open
[17:56] Brain does take care of a lot of the
[17:57] chunking strategy, a lot of the
[17:59] retrieval strategy, a lot of the input
[18:01] and classification strategy for you. And
[18:03] so, that's an option for you if you if
[18:05] you'd like it. But, the point here is
[18:06] that different kinds of data need
[18:08] different memory handling, and you have
[18:10] to think about that in advance. Like,
[18:11] PDFs need different handling than
[18:13] markdown. Meeting transcripts need
[18:15] speakers. They need timestamps. Code
[18:17] needs symbol-aware indexing. Notes need
[18:19] links preserved. You need to know what
[18:20] changed, what was indexed, and what
[18:22] should be regenerated when a better
[18:24] embedding model comes along, which is
[18:25] why it's so important to have your raw
[18:27] data and your embeddings in your
[18:28] database separately. Because then you
[18:30] can always rebuild it if something goes
[18:32] wrong. Most of the time when something
[18:33] goes wrong with a memory system, it's
[18:35] not the model itself, it's the pipeline
[18:38] that's the issue. And you have to think
[18:39] about how the pipeline affected your
[18:41] chunking strategy, for example, or how
[18:43] it affected your ability to handle
[18:45] retrieval, etc. And then there's the
[18:47] access there where MCP becomes
[18:48] interesting. Open Brain has MCP. An MCP
[18:51] server in front of your database can let
[18:52] Claude or ChatGPT or any custom tool you
[18:54] want query that memory. That is the
[18:57] right direction. But, don't assume just
[18:59] because you have an MCP in front of
[19:01] something, you can treat it like magic.
[19:02] MCP servers are just executable tool
[19:04] surfaces. They still need permissions
[19:06] and logging and secrets management and
[19:08] and boundaries to work well. Your
[19:10] personal AI computer should not just be
[19:12] a pile of local tools that any model can
[19:14] call for anything. It should be an own
[19:16] system that you set up intentionally.
[19:18] That's the difference between useful
[19:20] local intelligence and just giving the
[19:21] model the keys to the vehicle and hoping
[19:23] it all goes well. The next failure mode
[19:25] is interface. A great runtime with no
[19:27] comfortable surface is just a setup that
[19:29] you're going to stop using after a week
[19:31] cuz you're not in it. And that's why
[19:32] local AI can't just live in the
[19:34] terminal. The model has to live where
[19:36] your work lives. You can use something
[19:38] like Open Web UI for chat. Anything LLM
[19:41] is worth considering when you really,
[19:42] really want to focus on retrieval
[19:44] heavily. LM Studio is good for direct
[19:46] model work. You just want to pick the
[19:49] tools for the interface that feel like
[19:51] they align with your current workflow.
[19:53] That's the principle. For for editors,
[19:55] continue is one of the obvious bridges
[19:56] because it can point at OpenAI
[19:58] compatible endpoints. Aider remains very
[20:00] good for terminal-based code editing,
[20:02] and there's a whole class of coding
[20:03] agents that are converging on a very
[20:05] similar pattern that you'll want to work
[20:06] with, right? Model plus tools plus repo
[20:09] plus context in a in a planning loop.
[20:10] And that's really how it works, whether
[20:12] you're using a cloud model or a local
[20:14] model, if you're into coding. Now, for
[20:15] launchers and command surfaces, the
[20:17] things that get the models going, the
[20:19] boring tools matter more than you might
[20:20] think. Stuff like Raycast and Alfred and
[20:23] shortcuts and shell commands, small menu
[20:25] bar apps, an LLM command line interface.
[20:28] A personal AI computer basically
[20:29] shouldn't require you to open a chatbot
[20:31] just to talk to the LLM. You should be
[20:33] able to call it from your editor, from
[20:34] your notes, from your browser, from your
[20:36] finder. You get it, right? Anywhere
[20:38] you're in the computer, you should just
[20:39] be able to speak or to type, and you
[20:41] should be able to get the LLM. Voice is
[20:42] underrated here because hosted voice
[20:44] assistants trained everyone to expect
[20:47] disappointment over the last few years.
[20:48] But, local voice can be different now.
[20:50] Whisper handles transcription, a local
[20:52] or hybrid model handles intent and clean
[20:54] up and summarization and routing. And
[20:55] the interface principle then is not just
[20:57] install a bunch of AI apps, it's just
[20:59] speak what you're looking for, and it
[21:01] sticks what you're asking into a single
[21:03] stack underneath. The principle is many
[21:05] surfaces, one stack underneath. So, your
[21:08] editor, your note app, your browser,
[21:09] your launcher, your terminal, and your
[21:11] voice recorder, they those don't have
[21:13] separate memory layers, right? They
[21:14] should call into the same local runtime
[21:17] in the same memory layer, and they'll
[21:19] actually work well. This is the part
[21:21] that a lot of products aren't going to
[21:22] give you because their business model
[21:23] depends on owning the memory underneath
[21:25] the input channel. And so, you
[21:27] accumulate that memory inside a
[21:29] particular cloud model for meeting
[21:31] transcripts, right? And then you can't
[21:32] get it out again. The last layer that
[21:34] you should think about is where you want
[21:37] to put your workflows. And this is where
[21:39] you stop asking, "Can I run the model
[21:41] locally?" and you start asking, "What is
[21:44] the workflow I now control beyond the
[21:46] model itself?" If you're thinking about
[21:48] managing workflows, personal RAG or a
[21:51] personal memory system like I described
[21:52] earlier with Open Brain, that is still a
[21:54] clean first win. You can index your
[21:56] notes and your drafts and your PDFs, you
[21:58] can create a database. The value there
[22:00] is not generic search, it's that you
[22:02] actually develop a long-term
[22:04] institutional memory of your work over
[22:06] time. A frontier model might have read
[22:08] the public internet. It's not read the
[22:10] past few years of your meeting notes,
[22:11] and it shouldn't need to. Private coding
[22:13] is another obvious loop, right? A local
[22:15] coding assistant with repo access can do
[22:17] a lot more than auto complete these
[22:19] days, right? do refactoring, it can do
[22:21] test generation, it can do drafting. It
[22:23] may not be up to what frontier models
[22:24] can do on the code side, but it can do a
[22:26] lot. And you can keep frontier models
[22:28] for the hardest tasks. Again, I keep
[22:30] emphasizing this is not about a hard
[22:31] rule, it's just about choosing where you
[22:33] want to fight your battles. And local
[22:34] models now are good enough for a lot of
[22:36] the agentic loop to work by default on
[22:38] many of the simpler software problems
[22:40] out there. Meeting capture is another
[22:42] one. You have local Whisper plus a local
[22:44] summarizer, it means you can record and
[22:46] transcribe and summarize and extract
[22:48] decisions and create tasks and store
[22:50] that result in your memory layer. No
[22:51] audio ever leaves the machine, no per
[22:53] hour transcription bill. You can run
[22:55] that on every call for a year, and
[22:57] you're going to start to see things over
[22:58] time, right? Your decisions become
[23:00] searchable, your commitment that you
[23:01] make becomes something you can retrieve
[23:03] and look at, your recurring
[23:04] conversations become part of effectively
[23:06] a private institutional memory that you
[23:08] own. Long-running agents also start to
[23:10] make more economic sense when inference
[23:12] is local because cloud APIs they they're
[23:14] expensive, right? You might
[23:15] psychologically not want to run as many
[23:18] tokens because you don't want to pay for
[23:19] it. But, if you're just limited by the
[23:21] cost of electricity, you're going to be
[23:22] more inclined to set up really
[23:24] long-running agentic loops, which is
[23:25] exactly what we see with the open claw
[23:27] phenomenon where people set up local
[23:29] computers and they just have their
[23:30] agents always on. Research and synthesis
[23:32] are probably going to stay at least
[23:34] partially hybrid for a long time because
[23:36] local models can retrieve and organize
[23:38] and summarize and prep context, but
[23:40] frontier models are needed for hard
[23:41] synthesis type problem types, hard
[23:43] research in the same way that they're
[23:45] needed for very difficult coding
[23:47] problems. So, at this point, I think the
[23:49] buying decision becomes a lot clearer if
[23:51] you're going back to the stack you need.
[23:53] Imagine three people. One is a
[23:55] local-first knowledge worker. They
[23:56] write, they research, they code a little
[23:58] bit, they handle sensitive documents,
[24:00] and and maybe you want private AI
[24:01] without turning the home office into a
[24:03] complicated server room. That person
[24:05] should probably start with a Mac mini M4
[24:07] Pro with 64 gigs or maybe a Mac Studio
[24:10] M4 Max with 128 gigs if the budget
[24:12] allows. They'll use Ollama, LM Studio,
[24:14] maybe MLX, probably local embeddings or
[24:17] local memory system of some sort,
[24:19] Whisper, Open Web UI, Continue, and a
[24:21] very simple retrieval stack that maybe
[24:23] has an SQLite and Obsidian mixed in or
[24:25] something that has the markdown and
[24:26] something that has the database on the
[24:28] Open Brain side. It's not too
[24:29] complicated. I know that sounds like a
[24:31] lot of names, but you really can load
[24:33] this into an LLM, and it will literally
[24:35] give you a punch list of what you need
[24:37] to get it in what order you need to set
[24:38] it up. And I have a whole write-up on
[24:40] Substack, too. And that person can still
[24:41] keep one frontier subscription or API
[24:43] account for the hard work. And it gives
[24:45] you a sane default if that's you, right?
[24:47] You get privacy, you get speed, you get
[24:49] ownership, you get enough capability for
[24:51] daily use without pretending the cloud
[24:52] is irrelevant. Another person, you maybe
[24:54] you're a all local maximalist, right?
[24:56] You're not hearing this desire for
[24:58] cloud, you're like, "No, no, no, I've
[24:59] got to have privacy." So, you want
[25:00] privacy, you want compliance, you want
[25:02] sovereignty, you want to run your core
[25:04] work without a dependency. At that
[25:05] point, you're looking at a high memory
[25:07] Mac Studio or a DGX Spark or a similar
[25:09] serious workstation. You have to have
[25:11] something that gives you full control,
[25:13] right? You might even look at a mini
[25:14] Nvidia stack. The memory layer would be
[25:16] something like Postgres with PG Vector.
[25:18] Tools would probably sit behind MCP with
[25:20] permissions and audit logs. And I've got
[25:22] to be honest, this is not the cheapest
[25:24] build, right? But, it's the cleanest
[25:25] expression of the local thesis, right?
[25:27] Local modes, local memory, local tools,
[25:29] local workflows. And then, you can just
[25:31] go to town, right? Last but not least,
[25:33] there's the local-first builder. A
[25:35] developer or a small team building
[25:37] software, running agents, testing
[25:39] products, or just trying to reduce cloud
[25:41] inference spend. That person probably
[25:43] cares more about CUDA throughput, about
[25:45] serving, about evals, and about
[25:46] repeatability. So, they might get dual
[25:48] RTX 5090s, workstation GPUs, DGX Spark,
[25:51] or maybe a mixed local-cloud GPU setup.
[25:54] The LLM for serving, Ollama for
[25:56] prototyping, TensorRT LLM or NeMo when
[25:58] deployment efficiency matters. The
[26:00] principles are simple here. Local models
[26:02] absorb development, they take care of
[26:04] private data, they provide opportunity
[26:06] to handle batch jobs and high volume
[26:07] inner loops, and those economics start
[26:09] to add up because you're handling it
[26:11] locally. Local inference does not have
[26:12] to replace every single hosted call to
[26:14] add value. It only needs to absorb
[26:16] enough of the repetitive, private, high
[26:18] volume work that you feel like you get
[26:20] your money's back on that purchase. And
[26:21] that's the key distinction. Ultimately,
[26:23] the personal AI computer is not a purity
[26:25] test, it's just a routing system. Some
[26:27] work stays local because it's private
[26:29] and it's cheap and it's repetitive or
[26:31] context-heavy. Some work is going to go
[26:32] to the cloud because it's rare and it's
[26:34] hard and it's high value or maybe it
[26:36] needs the frontier. The power comes from
[26:38] you deciding instead of just defaulting
[26:39] to what the cloud providers want. The
[26:41] long-term reason to build this stack is
[26:43] not cost savings, although the cost
[26:45] savings can be real. The deeper reason
[26:47] is compounding your knowledge over time,
[26:49] and that's why I talked about memory so
[26:50] much. Every project, note, meeting,
[26:53] decision, correction, preference, and
[26:54] workflow can become part of a memory
[26:56] system you own. Over time, the personal
[26:58] AI computer becomes less like a chatbot
[27:00] and more like an operating layer over
[27:02] your work. The model might change out
[27:03] every few months, the memory can get
[27:05] better every year, and that's why
[27:06] extensibility matters a lot, but
[27:08] fundamentally, the source data that
[27:10] you're storing on this system, the
[27:12] markdown notes, the PDFs, the
[27:13] transcript, the code repositories, the
[27:15] media files, they stay, they're a source
[27:17] of truth, and you can just continue to
[27:19] expand and improve your data set that
[27:21] you build off of that over time, whether
[27:22] you're building with embeddings or
[27:24] whether you're building a SQL database.
[27:25] However you decide to solve that
[27:27] problem, and I've got other videos on
[27:29] that, you can absolutely
[27:31] build a memory system that evolves over
[27:34] time and that gets better over time,
[27:36] that preserves your institutional
[27:38] memory, that preserves the workflows
[27:40] that you have. And the mission is simple
[27:42] here, right? Your goal would if you care
[27:44] about this is to not let a proprietary
[27:47] AI app capture you and become the only
[27:49] place your knowledge exists. I talk a
[27:51] lot about the idea that there are
[27:52] multiple good models out there. Well, we
[27:54] need an underlying compute layer that
[27:56] enables us to take advantage of that.
[27:58] So, build open interfaces, right? OpenAI
[28:00] compatible local endpoints let many apps
[28:02] talk to your models, you're not locked
[28:04] into local only, you can talk to cloud
[28:06] if you want. Model context protocol lets
[28:08] multiple clients talk to your tools and
[28:10] your memory. Postgres or SQLite keep
[28:12] retrieval from becoming trapped inside
[28:14] one product, it's a lot of the basis for
[28:15] Open Brain. Plain files and Git keep the
[28:17] whole thing very inspectable. Treat your
[28:19] tools as you use them on this system
[28:21] like permissions instead of just
[28:23] conveniences. That's a that's an
[28:24] important principle as you're thinking
[28:25] about the design. The more useful an
[28:27] agent becomes, the more you have to
[28:30] think about this because agents with
[28:31] access to shell permissions, access to
[28:34] payments, agents with access to serious
[28:36] parts of your computing stack are agents
[28:38] that need serious permissions to operate
[28:40] responsibly. So, you need to think ahead
[28:43] and ask yourself, "If I'm operating
[28:45] multiple agents on this machine, what is
[28:47] a responsible access pattern here?" A
[28:48] writing agent does not need shell
[28:50] access. A coding agent doesn't need my
[28:52] bank statements. A meeting summarizer
[28:54] doesn't need permission to delete files.
[28:56] Think about how you control the attack
[28:59] surface of your agents if you're going
[29:01] to do this, right? Otherwise, you'll
[29:02] have extensibility without boundaries,
[29:04] and you'll just be in trouble. You want
[29:06] to be in a position where you are
[29:07] managing the scope your agents have so
[29:10] that they are not irresponsibly
[29:12] permitted to do anything on the machine.
[29:14] Now, I've been emphasizing memory as the
[29:16] heart of this system, to give you a few
[29:17] tips there. Memory needs to be
[29:19] cumulative, but also auditable. The
[29:21] system should be able to learn from your
[29:22] work, and you should also be able to
[29:24] inspect what it stored, delete what's
[29:26] wrong, trace where a fact came from, and
[29:28] rebuild indexes when better embeddings
[29:30] arrive. Assume in general that you're
[29:32] going to persist a hybrid experience
[29:33] where you call the cloud and call these
[29:35] larger models sometimes. They'll
[29:36] continue to get better. In most cases,
[29:38] you're going to want that unless you are
[29:40] a very hardcore local compute-only
[29:42] person, in which case I've got a stack
[29:44] for you, and I I talked about it. But,
[29:46] for most of us, the point of a personal
[29:47] AI computer is not to reject every cloud
[29:50] model forever. The point is actually to
[29:52] positively own the substrate that cloud
[29:54] models or any other model can plug into
[29:56] it well. Cuz a frontier model can still
[29:58] be called for rare and hard and
[29:59] high-value work whenever you want. But
[30:01] this kind of setup allows cloud AI to be
[30:03] a visitor to the system, not dominant
[30:06] across the system as a whole. And by the
[30:07] way, if you're like, "No, no, no, I just
[30:09] want to use cloud models, Nate." That's
[30:10] fantastic. I talk about cloud models all
[30:12] the time. There's lots of future videos
[30:15] and past videos that are all about
[30:17] setting up cloud models and cloud agents
[30:19] on your system. And I'll keep making
[30:20] those cuz so many people need that
[30:22] fluency as well. Now, once you have your
[30:23] personal stack, the rest of the
[30:25] computing world starts to look a little
[30:26] different. You're through the mirror.
[30:28] You ask yourself, "Why does this app
[30:29] need to upload my draft to its server?
[30:31] Why does this agent want a token for my
[30:33] entire account? Why does this assistant
[30:35] lose its memory the moment I close the
[30:37] tab? Or why am I paying per interaction
[30:39] for a model that can handle this routine
[30:41] job on the box already sitting on my
[30:43] desk?" And those questions, they tend to
[30:46] only be visible once you actually go
[30:48] through that looking glass, you build a
[30:49] personal stack, and you have an
[30:51] alternative. That's what makes the
[30:52] questions that I described just now feel
[30:54] tangible and real. And this is where I
[30:56] think people get the local AI argument a
[30:58] little bit wrong. I hear a lot about
[31:00] beating the cloud. It's not about
[31:01] beating the cloud. The cloud frontier is
[31:03] going to keep mattering. It may matter
[31:05] more, not less, as the hardest models
[31:06] become more expensive to train and
[31:08] serve. But that actually strengthens the
[31:09] case for owning the rest of the stack.
[31:11] It lets you use the frontier model as
[31:13] the specialist. You don't make it your
[31:15] memory, your file system, your workflow
[31:17] engine, your operating layer. You hire
[31:18] it for the job it's best at, and you
[31:20] stop renting it the rest of your life.
[31:22] Your personal AI computer is then not
[31:24] really a nostalgia play. It's not a
[31:26] hobbyist retreat from the internet. It's
[31:27] a bet that intelligence becomes more
[31:29] useful when it's closer to work, when
[31:31] it's closer to the files, closer to the
[31:33] tools, closer to your memory, closer to
[31:35] the person, you, that's asking it to
[31:37] act. The machine on your desk has a job
[31:41] to do. That's the whole point of this
[31:42] video. It doesn't have to be the
[31:44] smartest computer in the world. It can
[31:45] just be your computer. It can just be
[31:48] your AI. And that's why I made this
[31:50] video. I want you to feel empowered to
[31:52] make an intelligent choice and say,
[31:54] "Actually, I do want that world of the
[31:56] prosumer. I do want an all-local world.
[31:58] I want a local-first developer model and
[32:00] developer machine stack." If that's you,
[32:02] you can head on over to the Substack.
[32:03] I've got a full punch list and build
[32:05] recommendations. I've also got a nice
[32:07] reminder and guide to Open Brains so you
[32:10] can dig into the memory side because
[32:11] there are lots of people that are just
[32:12] using Open Brain for the memory piece,
[32:14] and they're not going after the full
[32:15] hardware stack, and that's another way
[32:17] to put your toes in the water on owning
[32:18] part of your compute stack. Whatever
[32:20] your choice is, I just want you to feel
[32:22] comfortable and feel like you own your
[32:23] destiny, and like the AI agents and the
[32:26] LLMs out there that are cloud-provided
[32:27] don't get to run the long-term
[32:30] parameters of intelligence in your life.
[32:32] It's up to you, and it should be up to
[32:33] you.
[32:34] I'll see you next time.

← Volver al listado de vídeos

Scroll al inicio