InsightFlow

Videos

A list of all the videos in your channel.

Thumbnail	Channel	Playlist ID	Playlist Title	Title	YouTube ID	Duration	Status		Created	Action
	aiDotEngineer	UULKPca3kwwd-B59HNr-_lvA	No playlist title	AIE Miami Keynote & Talks ft. OpenCode. Google Deepmind, OpenAI, and more!	6IxSbMhT7v4	475 min	PENDING		4/20/2026
	aiDotEngineer	UULKPca3kwwd-B59HNr-_lvA	No playlist title	Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI	a2muGkT4WD4	11 min	COMPLETED	AI Engineer EUROPE PRESENTING SPONSOR Google DeepMind PLATINUM SPONSORS Braintrust WorkOS OpenAI Running Gemma 4 on iPhone with MLX Adrien Grondin - Founder, Locally AI Locally AI AI Engineer Google DeepMind AI Engineer EUROPE Locally AI is a chatbot that allow you to run on-device models on your iPhone with MLX. So, basically locally AI is a chatbot that allow you to run on-device models on your iPhone with MLX. And many models like Gemma 4, Gemma 2, and Gemma 1.1, all on device, and can research needed running at 600 to 800 milliseconds with MLX optimized for Apple Silicon. And many models like Gemma 4, Gemma 2, and Gemma 1.1, all on device, and can research needed running at 600 to 800 milliseconds with MLX optimized for Apple Silicon. MLX MLX MLX is a framework made by Apple that is optimized for Apple Silicon, mainly the chips on iPhone, but also the chips on Mac. And basically, you can have a look at the models on Hugging Face, MLX Community. This is where all the models are uploaded, so you can just download the model and run it, and the API is very straightforward, very simple to implement in less than 10 minutes. You can have an iOS app running on your iPhone, it's really fast, it runs really well on MLX and it works on Mac OS and Apple Silicon, and everything is built to be as optimized as possible on this devices. So, if you want to run language models on iPhone, you can download the MLX Swift LM and install it with your agent, or anything. So you just need to grab the ID and pass it to the framework. And you can also run MLX LLM for Python, from prince, maybe you have seen him running LLM or video generation model or video generation model with MLX. LM and also MLX Video, to run image image generation model or video generation model. MLX, so the ecosystem is getting bigger, it's really great right now, you can do pretty much everything like omni models, text to speech, speech to speech, there's a lot of things that you can do with the models. And basically, let's say you want to integrate the model on your iPhone, you don't just run any model on it. you need to get some model. And a good place to get the model is Hugging Face. So, basically, you can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. So, basically, you can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. So, basically, you can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. So, basically, you can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. And you can just download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. And you can just download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. And you can just download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. And you can just download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. You can download the MLX Swift LM, install it with your agent or anything. So you just need to grab the ID and pass it to the framework. 4-bit / 6-bit / 8-bit Usually when you're running the model on on iPhone, what you want to do is quantize is selecting some quantized version of the model because the full size will be way too large. What I recommend is going is trying depending on the size between 3-bit and 4-bit / 6-bit / 8-bit 8-bit. Usually between 4-bit and 8-bit. Usually under like four bits is getting starting to get a lot of impact on the output and and the model are not that great. Usually 4-bit is a lower I would go and 8-bit is a higher I would go if you're using really small models. Um in my app for example, I have like some bigger model like Gemma 4, but I also have some liquid model for that is 350 parameters and this, it can run in shortcuts, so you can do a lot of kind of automation because the models are really fast and really efficient at that size to do some text processing and things like this. And on the on the latest iPhone like if you take 40,000 tokens, it's extremely fast, like it can run easily at 40 tokens per second. I would just do an update in the slide because I removed, I removed the slide as the video, but maybe I can add it back because I have a little bit more time. just go like this. So just to show you in demo, what what the 40 token per second means. And that's running live offline. And as you can see, it's like really fast. Like 40 tokens per second is more than acceptable for a lot of use cases. This is of course streaming, you can also like not do streaming and do a UI that just will wait for for 40 tokens. And output is quite long, so it it's generating a lot of tokens. So on device with LLMs, it's working, it's here. Like um, it's really not easy and really not hard to to integrate. As I said, if you go to the to the repo, the LLMs, it's a breeze to to install. On top of that, as I said, like again, latest iPhone is really great, but it works also with all the iPhones. Like you will not get 40 tokens per second, which is quite fast, but even if you get 20 tokens per second, that's already great and useful for a lot of applications, a lot of use cases that you would want to do with, with your app. You can go to and scan this QR code if you want to try it by yourself. If one of, if you have an iPhone, the app is on the App Store, it's free to use. Only thing is that you will have to download the model that's usually one gigabyte or or 3 gigabytes, really depends on on which model, but that's the biggest barrier right now is the size of the model, but this also is getting better. Models are getting smaller and getting smarter and there are also the iPhone is getting better. So next iPhone and second and the next, next iPhone, everything will just is reaching really great usability from what I can see. And also on top of that, like maybe you have the news yesterday, locally LM has been acquired by LLM Studio. If you don't know LLM Studio, it's basically kind of an AI studio for all your local models, so you can download the model with any model with LLM Studio directly from hugging face. You can run them and you can open the server. You can run them with LLAMA CPP, but also with MX. So you can really compare the different engines how they how they work. You can, as I said, you can open a server locally and connect your application to this to this host server with various various response response type for example, open IP response type or or or antropic response type for streaming, anything, and you can just get any model running really easily with that. So I want to thank you, that was a really short introduction how you can do the same and run any model like Gemma 4, if you want on your on your iPhone. If you have any questions also. Sorry? Yes, like so yeah, I forgot about that. So it's support tool calling um, not yet, custom not yet structured generation. There are some packages on top of like Swift that are trying to to make this make this working. Um, I will let you, face is doing it, but you can easily find them online. But LMX5LM yes, it's support tool calling, so that's useful if you want to do and system and the models are getting also better at. They were not so great like a year ago, now it's getting much better. I think you mentioned two things. Yes. And then the second thing was the. Um so you have a a GitHub repo for LLMs, for LLM Swift LLM that's the package that will install in your app, and then you need to go to hugging face and and get... Yes. the GitHub repo for LLM Swift LLM, that's the package that will install in your app and then you need to go to hugging face and and get... to get the weight of a model for a for a normal user they can just download the app from the app and yes if you want to try my app you can if you want to try it like right now without having anything to install you can do that and you can choose any open source model inside that there's a selection that that they are not it's not any like I'm not sure that all of the model ones correctly on the on the iPhone because not all of them work work well thank you very much thank you thank you	4/20/2026