I recently discovered just how capable the M2 chip in my Mac is when it comes to artificial intelligence, especially language models. By installing an app called Ollama, you can chat with AI in much the same way as with ChatGPT. These are what are known as LLMs (Large Language Models). Add a user interface called Open WebUI, and the experience becomes strikingly similar to ChatGPT. A big plus here is that nothing from your conversation ever leaves your device, and you might even reduce your environmental impact a bit. On the flip side, though, the models you can run locally on a MacBook aren’t nearly as advanced as GPT-4 or other larger models.
The models available through apps like Ollama are open-source models, unlike proprietary ones like GPT-3 and GPT-4, which, as far as I know, you can’t download anywhere. In theory, there’s nothing stopping open models from being just as smart as GPT-4. It really depends on the training data and the resources put into training them. What makes the difference then is the hardware and computing power most people have available to actually run these models.
For example, there are no official numbers on the size of OpenAI’s best models, but based on what’s publicly known about their parameters, GPT-4 is estimated to be somewhere between 750GB and 1TB in size, likely using 32-bit precision. That kind of scale demands a ton of computing power—no problem for a company like OpenAI with massive data centers full of GPUs. For everyday users, though, this becomes a real limitation.
The largest open-source model I found for use with Ollama (without spending a ton of time searching) is Llama 3.1, weighing in at 812GB in 16-bit precision. It was developed and released by Meta, another major company with plenty of resources to create a high-quality language model, right?. But for someone like me, trying to run this model on a MacBook means it needs to be significantly shrunk down, a process called quantization. This simplifies the model by removing “redundant” parameters, reducing its file size significantly. The most quantized version of Llama 3.1 so far is just 3.2GB at 2-bit precision, making it more than manageable even on a MacBook Pro M2.
But let’s be real—something must have been lost in translation between 812GB and 3.2GB, right? That’s probably the biggest difference between running LLMs locally and using services like ChatGPT in my opinion. From my experience, the quantized version of Llama 3.1 isn’t great at factual accuracy, language fluency, or even basic English grammar sometimes. Writing in Norwegian? Forget it. But where it does shine is with logical tasks like math and programming. I’ve found it pretty reliable as a coding assistant. One downside, though, is that it seems to struggle to maintain context in longer conversations. Still, there’s that major benefit of everything staying local on your Mac. And let’s not forget the environmental angle!
AI, Energy Use, and Sustainability
The connection between AI, energy consumption, and sustainability has been a hot topic lately. I like to think that running a quantized language model locally reduces my environmental footprint, but training these models is another story. Training GPT-3, for instance, reportedly emitted about 552 tons of CO2, which GPT-3 itself tells me is roughly equivalent to the lifetime emissions of about 20 cars. Once a model is trained, though, it’s done. If you spread that environmental cost across all the people using it and the number of times it’s used, it doesn’t seem so bad. But then there’s the ongoing energy demand for running a model that large. That’s where I think quantized models are a clever workaround. If we use them for all the tasks they can handle and avoid using the larger models when it’s not needed, it might save some energy. That said, I’m venturing into territory I don’t fully understand here, so take my musings with a grain of salt. Plus, these models need periodic updates to stay relevant.
And What About iPads?
So far, I’ve been rambling about using language models on a Mac, but wait—didn’t the title of this post mention something about the iPad? Yep, even though I got sidetracked, I was planning to get there eventually. While chatting with Ollama on my Mac, I started wondering about my iPad, which has the same M2 chip. Well, sort of. My Mac has the M2 Pro chip with 16 GPU cores and 16GB of memory, while the iPad has 9 GPU cores (still impressive!) but only 8GB of memory. So, yeah, there are noticeable differences.
Another limitation of the iPad is iPadOS. You can’t just install whatever you want like you can on macOS. Luckily, others have already thought about this and developed plenty of apps for running language models on iPads. The one I ended up trying is called Mollama (that name seems oddly familiar 🤔). The app itself is tiny, but of course, you’ll need to download some language models to go with it. Mollama comes preloaded with one called Qwen 2.5, which is just 265.1MB. My expectations for that weren’t high, so I downloaded a slightly less quantized 4GB model instead. I noticed my iPad starts to struggle with models larger than that, but I’m genuinely impressed this works at all on an iPad. Now I can sit on a plane, iPad in flight mode, and chat with AI!
…But then there’s that environmental footprint again… ✈️