Get the Linkedin stats of Paul Iusztin and many LinkedIn Influencers by Taplio.
open on linkedin
I am a senior machine learning engineer and contractor with ๐ฒ+ ๐๐ฒ๐ฎ๐ฟ๐ ๐ผ๐ณ ๐ฒ๐ ๐ฝ๐ฒ๐ฟ๐ถ๐ฒ๐ป๐ฐ๐ฒ. I design and implement modular, scalable, and production-ready ML systems for startups worldwide. My central mission is to build data-intensive AI/ML products that serve the world. Since training my first neural network in 2017, I have 2 passions that fuel my mission: โ Designing and implementing production AI/ML systems using MLOps best practices. โ Teaching people about the process. . I currently develop production-ready Deep Learning products at Metaphysic, a leading GenAI platform. In the past, I built Computer Vision and MLOps solutions for CoreAI, Everseen, and Continental. Also, I am the Founder of Decoding ML, a channel for battle-tested content on learning how to design, code, and deploy production-grade ML and MLOps systems. I am writing articles and posts each week on: - ๐๐ช๐ฏ๐ฌ๐ฆ๐ฅ๐๐ฏ: 29k+ followers - ๐๐ฆ๐ฅ๐ช๐ถ๐ฎ: 2.5k+ followers ~ ๐ https://medium.com/@pauliusztin - ๐๐ถ๐ฃ๐ด๐ต๐ข๐ค๐ฌ (๐ฏ๐ฆ๐ธ๐ด๐ญ๐ฆ๐ต๐ต๐ฆ๐ณ): 6k+ followers ~ ๐ https://decodingml.substack.com/ . If you want to learn how to build an end-to-end production-ready LLM & RAG system using MLOps best practices, you can take Decoding MLโs self-guided free course: โ ๐๐๐ ๐๐ธ๐ช๐ฏ ๐๐ฐ๐ถ๐ณ๐ด๐ฆ: ๐๐ถ๐ช๐ญ๐ฅ๐ช๐ฏ๐จ ๐ ๐ฐ๐ถ๐ณ ๐๐ณ๐ฐ๐ฅ๐ถ๐ค๐ต๐ช๐ฐ๐ฏ-๐๐ฆ๐ข๐ฅ๐บ ๐๐ ๐๐ฆ๐ฑ๐ญ๐ช๐ค๐ข ~ ๐ https://github.com/decodingml/llm-twin-course . ๐ฌ If you need machine learning solutions for your business, letโs discuss! ๐ Only open to full remote positions as a contractor. . Contact: ๐ฑ Phone: +40 732 509 516 โ๏ธ Email: p.b.iusztin@gmail.com ๐ป Decoding ML: https://linktr.ee/decodingml ๐ต๐ปโโ๏ธ Personal site & Socials: https://www.pauliusztin.me/
Check out Paul Iusztin's verified LinkedIn stats (last 30 days)
Use Taplio to search all-time best posts
Fine-tuning isnโt hard. Here's where most pipelines fall apart: Integrating it into a full LLM system. So hereโs how we architected our training pipeline: ๐๐ป๐ฝ๐๐๐ ๐ฎ๐ป๐ฑ ๐ผ๐๐๐ฝ๐๐๐ The training pipeline has one job: โ Input: โข A dataset from the data registry โข A base model from the model registry โ Output: โข A fine-tuned model registered in a model registry and ready for deployment In our case: - Base: Llama 3.1 8B Instruct โข Dataset: Custom summarization data generated from web documents โข Output: A specialized model that summarizes web content ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ ๐๐๐ฒ๐ฝ๐ 1. Load base model โ apply LoRA adapters 2. Load dataset โ format using Alpaca-style instructions 3. Fine-tune with Unsloth AI on T4 GPUs (via Colab) 4. Track training + eval metrics with Comet 5. If performance is good โ push to Hugging Face model registry 6. If not โ iterate with new data or hyperparameters Most research happens in Notebooks (and thatโs okay). So we kept our training pipeline in a Jupyter Notebook on Colab. Why? โ Let researchers feel at home โ No SSH friction โ Visualize results fast โ Enable rapid iteration โ Plug into the rest of the system via registries Just because itโs manual doesnโt mean itโs isolated. Here's how it connects: - Data registry: feeds in the right fine-tuning set โข Model registry: stores the fine-tuned weights โข Inference service: serves the fine-tuned model solely using the model registry โข Eval tracker: logs metrics + compares runs in real-time The Notebook is completely decoupled from the rest of the LLM system. Can it be automated? Yes... and weโre almost there. With ZenML already managing our offline pipelines, the training code can be converted to a deployable pipeline. The only barrier? Cost and compute. Thatโs why continuous training (CT) in the LLM space is more of a dream than something that you actually want to do in practice. TL;DR: If youโre thinking of training your own LLMs, donโt just ask โhow do I fine-tune this?โ Ask: โข How does it integrate? โข What data version did I use? โข Where do I store the weights? โข How do I track experiments across runs? โข How can I detach the fine-tuning from deployment? Thatโs what separates model builders from AI engineers. ๐ Full breakdown here: https://lnkd.in/de_ndNbQ
Fine-tuning should NEVER be the first step when building an AI system. Hereโs the only time you should do it: When nothing else works. But let's face it... Most teams jump straight into fine-tuning Why? Because it feels technical. Custom. Smart. In reality, itโs often just unnecessary complexity. Before you spend hours generating synthetic data and burning through GPUs, you must ask yourself three questions: โ Can I solve this with smart prompt engineering? โ Can I improve it further by adding RAG? โ Have I even built an evaluatable system yet? If the answer to those isnโt a solid "YES," you have no business fine-tuning anything. I say this all the time - "You donโt need your own model; you need better system design." - Prompt engineering handles ~30โ50% of cases - RAG handles another ~30โ40% - Fine-tuning? Reserve it for the last 10% (when the problem demands it) For example, in our work at Decoding ML, we only fine-tune when: - The context window is too small for RAG to help - The task requires domain-specific tone, behavior, or reasoning - The system is mature enough to warrant the extra complexity Anything sooner is overkill. Thanks to Maxime Labonne for helping sharpen this thinking during our work on The LLM Engineerโs Handbook (especially when mapping tradeoffs between fine-tuning, prompting, and RAG) Want to learn more? Check out Lesson 4 of the Second Brain AI Assistant course. Link in the comments.
90% of RAG systems struggle with the same bottleneck: (And better LLMs are not the solution) It's retrieval. And most teams donโt realize it because they rush to build without proper evaluation. Before I tell you how to fix this, let me make something clear - ๐ก๐ฎ๐ถ๐๐ฒ ๐ฅ๐๐ ๐ถ๐ ๐ฒ๐ฎ๐๐. You chunk some docs, embed them, drop a top_k retriever on top, and call it a pipeline. Getting it production-ready? Thatโs where most teams stall. โ They get hallucinations. โ They miss key info. โ Their outputs feel... off. Why? Because the quality of generation is downstream of the quality of context. ... and naive RAG often pulls in irrelevant or partial chunks that confuse the LLM. If you're serious about improving your system, here's the progression that actually works: ๐ฆ๐๐ฒ๐ฝ ๐ญ: ๐๐ถ๐ ๐๐ต๐ฒ ๐๐ฎ๐๐ถ๐ฐ๐ These โtable-stakesโ upgrades outperform fancy models most of the time: โ Smarter Chunking - Dynamic over fixed-size. Respect structure. โ Chunk Size Tuning - Too long = loss in the middle. Too short = fragmented context. โ Metadata Filtering - Boosts precision by narrowing scope semantically and structurally. โ Hybrid Search - Combine vector + keyword filtering. ๐ฆ๐๐ฒ๐ฝ ๐ฎ: ๐๐ฎ๐๐ฒ๐ฟ ๐ผ๐ป ๐๐ฑ๐๐ฎ๐ป๐ฐ๐ฒ๐ฑ ๐ฅ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น When basic techniques arenโt enough: โ Re-ranking (learned or rule-based) โ Small-to-Big Retrieval: Retrieve sentences, synthesize larger windows. โ Recursive Retrieval (e.g., LlamaIndex) โ Multi-hop + agentic retrieval: When you need reasoning across documents. ๐ฆ๐๐ฒ๐ฝ ๐ฏ: ๐๐๐ฎ๐น๐๐ฎ๐๐ฒ ๐ผ๐ฟ ๐๐ถ๐ฒ ๐ง๐ฟ๐๐ถ๐ป๐ด There's no point iterating blindly. Do the following: โ End-to-End eval - Is the output good? Ground truths, synthetic evals, user feedback. โ Component-level eval - Does the retriever return the right chunks? Use ranking metrics like MRR, NDCG, success@k. ๐ฆ๐๐ฒ๐ฝ ๐ฐ: ๐๐ถ๐ป๐ฒ-๐๐๐ป๐ถ๐ป๐ด = ๐๐ฎ๐๐ ๐ฅ๐ฒ๐๐ผ๐ฟ๐ Donโt start here. Do this only when: โ Your domain is so specific general embeddings fail. โ Your LLM is too weak to synthesize even when context is correct. โ Youโve squeezed all juice from prompt + retrieval optimizations. Fine-tuning adds cost, latency, and infra complexity. Itโs powerful, but only when everything else is dialed in. ๐ก๐ผ๐๐ฒ: These notes are from a talk over a year old. And yet... most teams are still stuck in Step 0. That tells you something - The surface area of RAG is small. But building good RAG is still an unsolved craft. Letโs change that. Want to learn to implement advanced RAG systems yourself? The link is in the comments. ๐๐บ๐ฎ๐ด๐ฒ ๐ฐ๐ฟ๐ฒ๐ฑ๐ถ๐: LlamaIndex and Jerry Liu
Everyoneโs building agents. But very few are building them for production... And thatโs the gap we wanted to close with Lesson 2 of the ๐ฃ๐ต๐ถ๐น๐ผ๐๐ด๐ฒ๐ป๐๐ course. Too often, agentic demos look impressive - until you try scaling them. Then comes the hidden complexity: โ Orchestrating LLM calls โ Managing memory โ Debugging emergent behavior โ Building in retrieval without breaking the flow Thatโs why this lesson doesnโt stop at โtoy demos.โ We show you how to build a real, production-ready RAG agent inside a gaming simulation. An agent that can impersonate philosophers, carry context-aware conversations, and dynamically adapt to user input. Not just an NPC, but a character. Hereโs what youโll build: โ An agentic RAG system powered by LangGraph โ A memory architecture backed by MongoDB โ Persona-specific prompt templates streamed via Groq LLM APIs โ Observability and evaluation pipelines instrumented with Opik (by Comet) โ A system designed to scale, recover, and impersonate in real time This is how you go from scripts to systems. From chatbots to characters. Lesson 2 is live. (Link in the comments) P.S. A massive shout out and thanks to Miguel Otero Pedrido for the collab ๐
Finding the right open-source LLMs to work with is a pain in the backside. 98% of LLM leaderboards are bloated. Too many closed models. Too many broken repos. Too little clarity on what actually works in production. It's frustrating. Fortunately, I found something to help with mitigate this issue... If youโre looking for open-source LLMs that just run - For fine-tuning, quantization, and deployment... Unsloth AI has done the hard work for you. Theyโve compiled a list of all the popular, supported, and production-viable models that: โ Fine-tune easily (with Unsloth + QLoRA) โ Quantize to GGUFs for local inference (Ollama, llama.cpp, OpenWebUI) โ Play well with Hugging Face and Python โ Come with working code and notebook examples โ Easy to deploy to Hugging Face Inference Endpoints, AWS, GCP, Modal, and more No more jumping between broken GitHub repos or guessing which models will survive a production pipeline. Itโs the fastest way to stay current without losing your mind. If youโre working with open-source LLMs, just bookmark this list. Link in the comments!
90% of AI engineers are dangerously abstracted from reality. They work with: โ Prebuilt models โ High-level APIs โ Auto-magical cloud tools But hereโs the thing - If you donโt understand how these tools actually work, youโll always be guessing when something breaks. Thatโs why the best AI engineers I know go deeper... They understand: How Git actually tracks changes. How Redis handles memory. How Docker isolates environments. If youโre serious about engineering, you'd go build the tools you use. And itโs why I recommend CodeCrafters.io (YC S22) You wonโt just learn tools. Youโll rebuild them (from scratch). โ Git, Redis, Docker, Kafka, SQLite, Shell... โ Step by step, test by test โ In your favorite language (Rust, Python, Go, etc.) Itโs perfect for AI engineers who want to: โ Level up their backend + system design skills โ Reduce debugging time in production โ Build apps that actually scale under load And most importantly... โ Stop being a model user โ Start being a systems thinker If I had to level up my engineering foundations today, CodeCrafters is where Iโd start. The ink is in the comments. P.S. We only promote tools we use or would personally take. P.S.S. Subscribe with my affiliate link to get a 40% discount :)
The difference between RAG and Agentic RAG isnโt technical. Itโs philosophical... RAG assumes answers are linear. Agentic RAG assumes thinking is iterative. That single belief changes how you architect everything. Let me explain. Most RAG pipelines follow this recipe: โ Embed a bunch of documents โ Retrieve top-K chunks โ Slam them into a prompt โ Pray the model gets it right It works until the query gets complex. Then the whole thing falls apart. Why? Because RAG is passive. It retrieves once and hopes for the best. But real questions aren't solved with one shot. They evolve. They require clarification, follow-ups, and refined context. Thatโs where Agentic RAG comes in... Agentic RAG doesnโt just retrieve, it also reasons: โ Do I have enough context? โ Should I re-query with a better search? โ Should I ask the user for clarification? โ Which tool should I use next? The result? A system that thinks before it speaks. If youโre building copilots, assistants, or longform Q&A tools, this matters. Because reliability comes from better decisions. Agentic RAG introduces that decision loop. It turns workflows into systems. It trades static pipelines for dynamic reasoning. And that mindset shift is where real GenAI builders separate themselves from the hype. Want to see what Agentic RAG looks like in action? We break it down with code, graphs, and production use cases in the Second Brain AI Assistant course. ๐ Link: https://lnkd.in/dA465E_J
The #1 mistake in building LLM agents? Thinking the project ends at reasoning. Here's when it actually ends: When your agent can talk to the world securely, reliably, and in real time. And thatโs what ๐๐ฒ๐๐๐ผ๐ป ๐ฐ ๐ผ๐ณ ๐๐ต๐ฒ ๐ฃ๐ต๐ถ๐น๐ผ๐๐ด๐ฒ๐ป๐๐ ๐ฐ๐ผ๐๐ฟ๐๐ฒ is all about. Up to this point, we focused on making our agents think: โ Philosophical worldviews โ Context-aware reasoning โ Memory-backed conversations But intelligence alone isnโt enough. To be useful, agents need a voice. To be deployable, they need an interface. To be real, they need to exist as APIs. This lesson is the bridge from the local prototype to the live system. Hereโs what youโll learn: โ How to deploy your agent as a REST API using FastAPI โ How to stream responses token-by-token with WebSockets โ How to wire up a clean backendโfrontend architecture using FastAPI (web server) + Phaser (game interface) โ How to think about agent interfaces in real-world products (not just demos) In short: ๐ง๐ต๐ถ๐ ๐ถ๐ ๐ต๐ผ๐ ๐๐ผ๐ ๐๐ต๐ถ๐ฝ ๐ฎ๐ป ๐ฎ๐ด๐ฒ๐ป๐ ๐๐ต๐ผ ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป๐ ๐๐ก๐ ๐ฟ๐ฒ๐๐ฝ๐ผ๐ป๐ฑ๐ ๐ถ๐ป ๐ฝ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป. Shoutout to Anca-Ioana Martin for helping shape this lesson and write the deep-dive article. And of course... big thanks to my co-creator Miguel Otero Pedrido for the ongoing collab. ๐ Link to Lesson 4 in the comments.
Everyone chunks documents for retrieval. But what if thatโs the wrong unit? Let me explain.. In standard RAG, we embed small text chunks and pass those into the LLM as context. Itโs simple, but flawed. Why? Because small chunks are great for retrieval precision, but terrible for generation context. Thatโs where Parent Retrieval comes in. (aka small-to-big retrieval) Hereโs how it works: โ You split your documents into small chunks โ You embed and retrieve using those small chunks โ But you donโt pass the chunk to the LLM... โ You pass the parent document that the chunk came from The result? โ Precise semantic retrieval (thanks to small, clean embeddings that encode a single entity) โ Rich generation context (because the LLM sees the broader section) โ Fewer hallucinations โ Less tuning needed around chunk size and top-k Itโs one of the few advanced RAG techniques that work in production. No fancy agents. No latency bombs. No retraining. We break it all down (with diagrams and code examples) in ๐๐ฒ๐๐๐ผ๐ป ๐ฑ ๐ผ๐ณ ๐๐ต๐ฒ ๐ฆ๐ฒ๐ฐ๐ผ๐ป๐ฑ ๐๐ฟ๐ฎ๐ถ๐ป ๐๐ ๐๐๐๐ถ๐๐๐ฎ๐ป๐ ๐ฐ๐ผ๐๐ฟ๐๐ฒ. ๐ Link to the full lesson in the comments.
Hereโs the problem with most AI books: They teach the model, not the system. Which is fine... until you try to deploy that model in production. Thatโs where everything breaks: - Your RAG pipeline is duct-taped together - Your eval framework is an afterthought - Your prompts arenโt versioned - Your architecture canโt scale Thatโs why Maxime and I wrote the LLM Engineerโs Handbook... We wanted to create a practical guide for AI engineers who build real world AI applications. This isnโt just another guide... It's a practical road map for designing and deploying real-world LLM systems. In the book, we cover: โ Efficient fine-tuning workflows โ RAG architectures โ Evaluation pipelines with LLM-as-judge โ Scaling strategies for serving + infra โ MLOps + LLMOps patterns baked in Whether youโre building your first assistant or scaling your 10th RAG app... This book gives you the mental models and engineering scaffolding to do it right. ๐ Here's the link to get your copy: https://lnkd.in/dVgFJtzF
Back in 2023, I was struggling to keep track of my notes. So I did something the black mirror producers would be proud of... I built a second brain. All I wanted was an AI-powered assistant connected to my knowledge base. Something I could use to recall notes, surface ideas, and help me think. But making it real wasnโt as simple as connecting a chatbot to Notion. To get it working, I had to build a full system: โ A modular RAG pipeline to retrieve from custom notes โ Ingest, crawl, and clean all my noisy resources regardless of their form โ Real-time APIs to stream responses as I typed โ A memory layer to track context across conversations โ Observability and evaluation to measure what worked No hacks. No hardcoded prompts. Just an LLM agent that understood my notes and helped me reason through them. After building, I open-sourced the entire thing - code and lessons. And over the past year, thousands of engineers have cloned, forked, and built on top of it. This week, the GitHub repo ๐ฝ๐ฎ๐๐๐ฒ๐ฑ ๐ญ,๐ฌ๐ฌ๐ฌ ๐๐๐ฎ๐ฟ๐. I just want to say a massive thank you to everyone who tried it, shared it, or built something new with it. And to those who havenโt seen it yet - ๐ The linkโs in the comments. P.S. Let me know what youโd create with it.
98% of people consume AI content. But only 2% are actually building with it (and we wanted to change that)... So we created 5 open-source project-based AI courses that teach you how to go from zero to production. Each course is built with developers in mind. Backed by best practices from MLOps, LLMOps, and modern software engineering. And 100% free. Hereโs whatโs inside: ๐ฃ๐ต๐ถ๐น๐ผ๐๐ด๐ฒ๐ป๐๐ (๐๐ถ๐๐ต The Neural Maze) Build a character simulation engine that brings AI agents to life with memory, retrieval, and real-time dialogue powered by: - Groq - LangGraph (by LangChain) - Opik (by Comet) โ Learn agents, RAG, persona design, and modular LLM architecture. ๐ฆ๐ฒ๐ฐ๐ผ๐ป๐ฑ ๐๐ฟ๐ฎ๐ถ๐ป ๐๐ ๐๐๐๐ถ๐๐๐ฎ๐ป๐ Build an AI assistant that chats with your personal knowledge base. โ Learn end-to-end agentic RAG pipelines, fine-tuning, modular design, and full-stack AI integration. ๐๐บ๐ฎ๐๐ผ๐ป ๐ง๐ฎ๐ฏ๐๐น๐ฎ๐ฟ ๐ฆ๐ฒ๐บ๐ฎ๐ป๐๐ถ๐ฐ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต Master vector search over structured data by building a natural language search engine for e-commerce products. โ Learn how to embed, index, and retrieve relevant product data using semantic search. ๐๐๐ ๐ง๐๐ถ๐ป: ๐ฌ๐ผ๐๐ฟ ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป-๐ฅ๐ฒ๐ฎ๐ฑ๐ ๐๐ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ Create your own LLM-powered twin from scratchโdesigned to reflect your knowledge and communication style. โ Learn fine-tuning, embedding, vector databases, and serving production-grade AI. ๐&๐ ๐ฅ๐ฒ๐ฎ๐น-๐ง๐ถ๐บ๐ฒ ๐ฅ๐ฒ๐ฐ๐ผ๐บ๐บ๐ฒ๐ป๐ฑ๐ฒ๐ฟ ๐ฆ๐๐๐๐ฒ๐บ Deploy a neural recommender system for fashion items with real-time serving using Hopsworks and KServe. โ Learn feature engineering, MLOps, Kubernetes deployment, and retrieval-augmented recsys. Just: โ Clone the repo โ Open the Substack lesson โ Follow the guide + run the code โ Remix it, fork it, and make it your own If youโre tired of learning in isolation and want to actually build production AI, these courses are for you. ๐ Link to all 5 courses: https://lnkd.in/d8gP9cxC
Evaluation is the bottleneck of every serious GenAI system. And 90% of teams are still treating it as an afterthought... If youโre building LLM apps, especially with RAG or agentic systems, youโve probably hit the same wall: โ Messy prompt changes with zero version control โ Vector search that โfeelsโ right, but fails silently โ Outputs that kinda work, but you have no way to quantify why โ No strategy to measure the impact of new features So ahead of my upcoming Open Data Science Conference (ODSC) 2025 webinar, Iโm releasing the full open-source evaluation playbook. If you want wants to explore the code before the talk drops, here's your chance... Note, you donโt have to attend the webinar to use it - the README is detailed enough to guide you. Hereโs what youโll get: ๐ ๐ผ๐ฑ๐๐น๐ฒ ๐ญ: ๐ฃ๐ฟ๐ผ๐บ๐ฝ๐ ๐ ๐ผ๐ป๐ถ๐๐ผ๐ฟ๐ถ๐ป๐ด + ๐ฉ๐ฒ๐ฟ๐๐ถ๐ผ๐ป๐ถ๐ป๐ด Track every LLM call and prompt change using Opik by Comet. Visualize agent traces, compare versions, and finally debug with confidence. ๐ ๐ผ๐ฑ๐๐น๐ฒ ๐ฎ: ๐ฅ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น ๐๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป ๐ณ๐ผ๐ฟ ๐ฅ๐๐ Use UMAP/t-SNE to visualize embeddings. Compute retrieval recall/precision with LLM-as-a-judge. ๐ ๐ผ๐ฑ๐๐น๐ฒ ๐ฏ: ๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป-๐๐ฒ๐๐ฒ๐น ๐ ๐ฒ๐๐ฟ๐ถ๐ฐ๐ Detect hallucinations, moderation issues, and quality drops with custom judges. Log everything into Opik to track iterations across builds. ๐ ๐ผ๐ฑ๐๐น๐ฒ ๐ฐ: ๐๐ผ๐น๐น๐ฒ๐ฐ๐๐ถ๐ป๐ด ๐ฅ๐ฒ๐ฎ๐น ๐จ๐๐ฒ๐ฟ ๐๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ Capture structured feedback from users to fuel future eval splits or fine-tuning jobs (e.g. preference alignment). ๐๐๐ฎ ๐๐ข ๐ ๐๐ค๐๐ฃ๐ ๐ฉ๐๐๐จ? Because evaluation is hard. And most teams donโt have a mental model for how to think about these moving parts - let alone code for it. This project brings structure, tooling, and clarity to that chaos. ๐ Link to the repo in the comments. P.S. If you're joining my session at ODSC, keep it bookmarked, weโll walk through the full stack live.
Here's why 98% agent demos break after 3 turns: (Hint: it's not because the prompts are bad) ๐ง๐ต๐ฒ ๐ฎ๐ด๐ฒ๐ป๐ ๐ฑ๐ผ๐ฒ๐๐ปโ๐ ๐ฟ๐ฒ๐บ๐ฒ๐บ๐ฏ๐ฒ๐ฟ ๐๐ต๐ฎ๐ ๐ท๐๐๐ ๐ต๐ฎ๐ฝ๐ฝ๐ฒ๐ป๐ฒ๐ฑ. Without memory, you donโt get reasoning. Without reasoning, you donโt get believable agents... You just get brittle demos that fall apart under pressure. Thatโs why we made Lesson 3 of the PhiloAgents course all about memory. In this lesson, we cover how memory enables: โ Conversational flow via short-term memory โ Grounded reasoning via long-term memory (with agentic RAG) โ Semantic vs. Episodic vs. Procedural long-term memory โ Scalable architecture across threads, users, and interactions โ Fast, focused context handling through smart summarization We also break down the critical design choices: โ What kind of memory structures you actually need โ How to avoid bloated infra with a single vector DB โ Why long-term memory โ just sticking a RAG on top Interested? Lesson 3 is now live. Youโll build all this directly into a philosopher NPC simulation. (Link in comments) P.S. Huge thanks to Miguel Otero Pedrido for the collab on this one. This was one of the most fun to build and itโs a piece most agentic builders overlook.
Writing a book felt like a gamble. But looking back, it was one of the best decisions Iโve ever made. As of today, the ๐๐๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟโ๐ ๐๐ฎ๐ป๐ฑ๐ฏ๐ผ๐ผ๐ธ has: - Sold 12,000+ copies - Become an Amazon bestseller - Given me the freedom to build without pressure When I completely denounced my social life to focus on writing - I didnโt know if anyone would read it. I didnโt know if it would open any doors. I didnโt know if it would be worth the effort. Fortunately, it all paid off. The book gave me breathing room to focus, reinvest, and go all-in on what I love: โ Content โ AI & Software โ Building Decoding ML But the impact went far beyond the numbers... โIt gave me the confidence that my content is good โ It led to speaking invites at QCon, ODSC, and DataCamp โ It connected me to incredible collaborators like [@whats-ai]โwhich sparked our next course on agents โ And it directly led to my current consulting role (plus many more Iโve had to turn down) In short: itโs been the catalyst for almost everything Iโm building today. I'm extremely grateful to Maxime Labonne for co-authoring this journey and Gebin George for trusting me with the opportunity. TL;DR: If youโre thinking about writing a book, do it. Youโre not just publishing words... Youโre publishing proof of who you are and what you stand for.
Here's the best piece of advice you need to build real-world agents: "Stop thinking in prompts; start thinking in graphs." Because under the hood, serious agentic systems arenโt just string manipulation. Theyโre structured, dynamic workflows. And thatโs exactly how the ๐ฃ๐ต๐ถ๐น๐ผ๐๐ด๐ฒ๐ป๐ works... Itโs not a prompt wrapped in a Python script. Itโs a full agentic RAG system - Letโs break it down... We use a stateful execution graph to drive our philosopher NPCs. Hereโs how: ๐ญ. ๐๐ผ๐ป๐๐ฒ๐ฟ๐๐ฎ๐๐ถ๐ผ๐ป ๐ก๐ผ๐ฑ๐ฒ Handles the primary logic. It merges incoming messages, current state, and philosopher identity (style, tone, perspective) to generate the next reply. ๐ฎ. ๐ฅ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น ๐ง๐ผ๐ผ๐น ๐ก๐ผ๐ฑ๐ฒ If the agent needs more information, it calls a MongoDB powered vector search to fetch relevant facts about the philosopher's life and work. This turns simple RAG into agentic RAG, since the LLM dynamically chooses tool calls. ๐ฏ. ๐ฆ๐๐บ๐บ๐ฎ๐ฟ๐ถ๐๐ฒ ๐๐ผ๐ป๐๐ฒ๐ ๐ ๐ก๐ผ๐ฑ๐ฒ We summarize long-retrieved passages before injecting them into the prompt. This keeps prompts clean and focused, avoiding dumping in whole Wikipedia pages. ๐ฐ. ๐ฆ๐๐บ๐บ๐ฎ๐ฟ๐ถ๐๐ฒ ๐๐ผ๐ป๐๐ฒ๐ฟ๐๐ฎ๐๐ถ๐ผ๐ป ๐ก๐ผ๐ฑ๐ฒ If the conversation gets long, we summarize and trim earlier messages and keep only recent context, while preserving meaning. We need the summary for the agent to be consistent and reference early topics from the conversation. This helps keep the context window short and focused, lowering costs and latency and improving accuracy. ๐ฑ. ๐๐ป๐ฑ ๐ก๐ผ๐ฑ๐ฒ Wraps up the cycle. Memory is updated, context evolves, and the agent grows with every message. ๐๐ฒ๐ฟ๐ฒ'๐ ๐๐ต๐ฒ ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป ๐ฑ๐ฒ๐๐ฎ๐ถ๐น๐: - The short-term memory is kept as a Pydantic in-memory state: the PhilosopherState - Tool orchestration with LangChain - Low-latency LLMs, such as Llama 70B, served by Groq - Smaller 8B models used for summarization tasks - Prompt templates are dynamically generated per philosopher - Served as a real-time REST API through FastAPI & WebSockets to power the game UI - Monitoring + evaluation wired through Opik by Comet ๐๐ป ๐๐ต๐ผ๐ฟ๐: Agents come alive through structure, memory, tools, and flow control. You can adapt this exact system to build - โ Context-aware assistants โ Multi-turn RAG copilots โ NPCs, tutors, or internal tools that think and retrieve We walk through every step (with code) in Lesson 2 of the ๐ฃ๐ต๐ถ๐น๐ผ๐๐ด๐ฒ๐ป๐๐ course. ๐ Link to the comments
Here's the most annoying thing about MLOps pipelines: (And it's contrary to popular belief) Most break at the last mile. Not during training. Not during evaluation. But at the moment of deployment, more specifically, when testing your local models in production. It's the part where: - DevOps gets looped in late - ML engineers get blocked by infra - Debugging takes forever And worst of all? You might wait 20 minutes just to find out your endpoint doesnโt work. The long cycles to test your ML deployments kill productivity and, most importantly, the inspiration and experimentation speed, critical to building AI solutions. But thereโs a better way to approach this... Instead of treating deployment as someone elseโs problem, ML teams can take control by testing their models locally before handing them off. Hereโs what that looks like in practice: 1. Train and log your model using MLflow 2. Wrap it with a custom class that defines your prediction logic (e.g. convert labels to readable outputs) 3. Download the model artifact using MLflowโs CLI 4. Serve it locally using the MLflow inference server 5. Test the /invocations endpoint with real requests to ensure contract correctness 6. Validate edge cases (e.g. malformed input) to catch failures early This flow ensures your deployment logic works **before** involving production infra, speeding up the development cycle by 10x. A simple shift in mindset. A massive win in practice. Thanks to Maria Vechtomova and Baลak Tuฤรงe Eskili for outlining this workflow so clearly in their latest article. P.S. I highly recommend their course End-to-end MLOps with Databricks.
If you're thinking about consulting in AI, think twice. Here's what it looks like: In week 1 with one of my existing clients shared the product vision and roadmap then dropped a task list that said: โ Deploy the product to AWS with CI/CD โ Support multiple deployment modes โ Add LLM observability โ Optimize and stabilize the core system. Thatโs it. No onboarding, no long handovers. From there, it was all on me to: โ Reverse-engineer the code and architecture โ Identify missing pieces in the infra โ Learn whatever tool or system I needed on the fly โ Ship fast in an environment with low resources and zero handholding This is the reality of working as a contractor in early-stage AI teams. You donโt get to ask for the perfect setup. You make decisions in ambiguity. You learn fast, adapt faster, and ship before you feel ready. Hereโs what Iโve learned: โ Youโll never know everything going in โ Mastering fundamentals matters more than mastering tools โ You need to balance speed with system thinking โ Your job isnโt to follow processโitโs to create one that works under fire If youโre looking to freelance or consult in AI, prepare to be thrown into the fire. ๐ฌ๐ผ๐๐ฟ ๐๐ฎ๐น๐๐ฒ ๐ถ๐ ๐ถ๐ป ๐ต๐ผ๐ ๐ณ๐ฎ๐๐ ๐๐ผ๐ ๐ณ๐ถ๐ป๐ฑ ๐ฐ๐น๐ฎ๐ฟ๐ถ๐๐, ๐ป๐ผ๐ ๐ต๐ผ๐ ๐บ๐๐ฐ๐ต ๐๐ผ๐ ๐ฎ๐น๐ฟ๐ฒ๐ฎ๐ฑ๐ ๐ธ๐ป๐ผ๐. No better prep than building, breaking, and repeating.
The most underestimated part of building LLM applications? Evaluation. Evaluation can take up to 80% of your development time (because itโs HARD) Most people obsess over prompts. They tweak models. Tune embeddings. But when itโs time to test whether the whole system actually works? Thatโs where it breaks. Especially in agentic RAG systems - where youโre orchestrating retrieval, reasoning, memory, tools, and APIs into one seamless flow. Implementation might take a week. Evaluation takes longer. (And itโs what makes or breaks the product.) Letโs clear up a common confusion: ๐๐๐ ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป โ ๐ฅ๐๐ ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป. LLM eval tests reasoning in isolation - useful, but incomplete. In production, your model isnโt reasoning in a vacuum. Itโs pulling context from a vector DB, reacting to user input, and shaped by memory + tools. Thatโs why RAG evaluation takes a system-level view. It asks: Did this app respond correctly, given the user input and the retrieved context? Hereโs how to break it down: ๐ฆ๐๐ฒ๐ฝ ๐ญ: ๐๐๐ฎ๐น๐๐ฎ๐๐ฒ ๐ฟ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น. โ Are the retrieved docs relevant? Ranked correctly? โ Use LLM judges to compute context precision and recall โ If ranking matters, compute NDCG, MRR metrics โ Visualize embeddings (e.g. UMAP) ๐ฆ๐๐ฒ๐ฝ ๐ฎ: ๐๐๐ฎ๐น๐๐ฎ๐๐ฒ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป. โ Did the LLM ground its answer in the right info? โ Use heuristics, LLM-as-a-judge, and contextual scoring. In practice, treat your app as a black box and log: - User query - Retrieved context - Model output - (Optional) Expected output This lets you debug the whole system, not just the model. ๐๐ฐ๐ธ ๐ฎ๐ข๐ฏ๐บ ๐ด๐ข๐ฎ๐ฑ๐ญ๐ฆ๐ด ๐ข๐ณ๐ฆ ๐ฆ๐ฏ๐ฐ๐ถ๐จ๐ฉ? 5โ10? Too few. 30โ50? Good start. 400+? Now youโre capturing real patterns and edge cases. Still, start with how many samples you have available, and keep expanding your evaluation split. Itโs better to have an imperfect evaluation layer than nothing. Also track latency, cost, throughput, and business metrics (like conversion or retention). Some battle-tested tools: โ RAGAS (retrieval-grounding alignment) โ ARES (factual grounding) โ Opik by Comet (end-to-end open-source eval + monitoring) โ Langsmith, Langfuse, Phoenix (observability + tracing) TL;DR: Agentic systems are complex. Success = making evaluation part of your design from Day 0. We unpack this in full in Lesson 5 of the PhiloAgents course. ๐ Check it out here: https://lnkd.in/dA465E_J
95% of agents never leave the notebook. And itโs not because the code is bad... Itโs because the system around them doesnโt exist. Here's my point: Anyone can build an agent that works in isolation. The real challenge is shipping one that survives real-world conditions (e.g., live traffic, unpredictable users, scaling demands, and messy data). That's exactly what we tackled in ๐๐ฒ๐๐๐ผ๐ป ๐ญ ๐ผ๐ณ ๐๐ต๐ฒ ๐ฃ๐ต๐ถ๐น๐ผ๐๐ด๐ฒ๐ป๐๐ ๐ฐ๐ผ๐๐ฟ๐๐ฒ. We started by asking, "What does an agent need to survive in production?" And decided on 4 things - It needs an LLM to run in real-time. A memory to understand what just happened. A brain that can reason and retrieve factual information. And a monitor to ensure it all works under load. So we designed a system around those needs. The frontend is where the agent comes to life. We used Phaser to simulate a browser-based world. But more important than the tool is the fact that this layer is completely decoupled from the backend. (so game logic and agent logic evolve independently) The backend, built in FastAPI, is where the agent thinks. We stream responses token-by-token using WebSockets. All decisions, tool calls, and memory management happen server-side. Inside that backend sits the agentic core - a dynamic state graph that lets the agent reason step-by-step. The agent is orchestrated by LangGraph and powered by Groq for real-time inference speeds. It can ask follow-up questions, query external knowledge, or summarize whatโs already been said (all in a loop). When the agent needs facts, it queries long-term memory. We built a retrieval system that mixes semantic and keyword search, using cleaned, de-duplicated philosophical texts crawled from the open web. That memory lives in MongoDB and gets queried in real time. Meanwhile, short-term memory tracks the conversation thread across turns. Without it, every new message would be a reset. With it, the agent knows whatโs been said, whatโs been missed, and how to respond. But hereโs the part most people skip: observability. If you want to improve your system, you need to see and measure what it's doing. Using Opik (by Comet), we track every prompt, log every decision, and evaluate multi-turn outputs using automatically generated test sets. Put it all together and you get a complete framework that remembers, retrieves, reasons, and responds in a real-world environment. Oh... and we made the whole thing open source. ๐ Link: https://lnkd.in/d8-QbhCd P.S. Special shout out to my co-creator Miguel Otero Pedrido
Content Inspiration, AI, scheduling, automation, analytics, CRM.
Get all of that and more in Taplio.
Try Taplio for free
Sabeeka Ashraf
@sabeekaashraf
20k
Followers
Sahil Bloom
@sahilbloom
1m
Followers
Izzy Prior
@izzyprior
82k
Followers
Richard Moore
@richardjamesmoore
105k
Followers
Shlomo Genchin
@shlomogenchin
49k
Followers
Sam G. Winsbury
@sam-g-winsbury
49k
Followers
Matt Gray
@mattgray1
1m
Followers
Daniel Murray
@daniel-murray-marketing
150k
Followers
Ash Rathod
@ashrathod
73k
Followers
Amelia Sordell ๐ฅ
@ameliasordell
228k
Followers
Vaibhav Sisinty โ๏ธ
@vaibhavsisinty
451k
Followers
Wes Kao
@weskao
107k
Followers
Austin Belcak
@abelcak
1m
Followers
Justin Welsh
@justinwelsh
1m
Followers
Luke Matthews
@lukematthws
188k
Followers
Tibo Louis-Lucas
@thibaultll
6k
Followers