First Impressions of Using gpt-oss Locally on Mac

# First Impressions of Using gpt-oss Locally on Mac OpenAI has recently announced two new [open-source reasoning models](https://openai.com/index/introducing-gpt-oss/), calling them `gpt-oss`. They have claimed the two models, 20b and 120b variants, "deliver strong real-world performance at low cost." Particularly, the evaluation results shared by OpenAI show the 20b variant of `gpt-oss` performing better than or comparable to o3-mini and o4-mini in Codeforces Competition code, Humanity's Last Exam, HealthBench, and AIME benchmarks, to name a few. Given that I purchased a MacBook Pro with an M4 Pro SoC and 48 GB memory less than ten months ago, I was curious to see if I can properly run this model locally, and whether the performance would be sufficient for me to use this model instead of (or in addition to) [GPT-5](https://openai.com/index/introducing-gpt-5/). ## Setup process I used `ollama` to run the 20b variant locally: `ollama pull gpt-oss:20b` Then I could interact with the model, either using the Ollama app on my Mac, or using the Terminal through running `ollama run gpt-oss:20b`. It should be noted that the Activity Monitor showed between 13 and 15 GB of memory used on the `ollama` process, after executing the `ollama run` command. ## How fast does the model perform? To test how my Mac can run the local model and whether I can use it in my day-to-day work, I performed some tests, each with over 100 prompts, and then averaged the time per prompt. I used four main tasks: - **Multiplication:** I randomly picked two numbers between 1 and 500 each time and asked the model to multiply them together. This test mainly measured the generation of the thinking/reasoning tokens, as the final answer used only a few tokens. - **Paraphrasing:** I asked the model to paraphrase the sentence *"I went to the store to buy some milk. They were out of stock. So I went to the other store."* This test required more tokens to be generated in the final output, compared to the previous one. - **Repeating Text:** To test the model generating more output tokens without the need for a large amount of reasoning, I asked the model to repeat 'hello world' 20 times. - **Generating JSON:** To test a mixture of generating more reasoning and output tokens, I asked the model to: ``` Return a JSON array with exactly 10 objects. Each object must have the keys "id", "name", and "value". The values should be: id = 1‑10, name = "Item X" (X is the id), value = X * 1.5 (with 1 decimal place). Do not include any extra keys or whitespace outside the JSON. ``` All the prompts were running when my other day-to-day applications (VS Code, Warp, Slack, Telegram, WhatsApp, Mail, Photos, Music, Safari, Arc, Weather, etc.) were also open (note that this was *not* a scientific experiment under controlled conditions by any means). ## Results ### Running time The table below shows the average time per prompt taken for each prompt: | Prompt | Time Taken Per Prompt (seconds) | | --------------- | ------------------------------- | | Multiplication | 4.11 | | Paraphrasing | 6.13 | | Repeating Text | 7.16 | | Generating JSON | 18.92 | While the results show the model does not provide *instant* responses, the time taken per prompt isn't very long either. For example, for a prompt with generally light reasoning (like the *Generating JSON* prompt), one can plug their MacBook into power, have it run through the prompts overnight, and obtain the outputs for a batch of more than 1300 prompts, over 7 hours of sleep, using a model with a comparable performance to o3-mini and o4-mini, without paying for any API or access to servers with GPU. For a simpler prompt (e.g., paraphrasing short texts for data augmentation), the same amount of time would allow more than 4100 prompts to run. ### Other observations - **Battery consumption:** I observed a consumption of around 1.3 percent of battery per minute. While this can make the model tricky to use for *batch* requests when away from a power supply, A) I doubt there would be any effective battery hit for occasional use of the model, as long as it's not in batches of requests arriving one after another, and B) this problem can effectively be solved by plugging in the device, which is, anyway, a reasonable assumption when running batch requests over a longer period of time. - **Heat:** None of the experiments made the device very hot. While I could feel that the place for resting the palms next to the trackpad was warmer than normal, it was not disturbing, uncomfortably hot, or anywhere close to burning heat. On the other hand, the heat was way more pronounced on the surface above the function keys of the keyboard, but it is not a surface on the device that many people touch in their day-to-day life. - **Fans:** There was almost no audible fan sound for a reasonable period of time. I started to hear a very faint fan spinning sound, possibly after around 6 minutes (this was particularly observed for the Repeating Text experiment), which continued to become a louder fan sound (still not disturbing, in my opinion) after a while. Even the more audible sound could not be heard when I stood a couple of seats away from the device. - **Device performance:** None of the experiments had any effect at all on the performance of the other apps running on the MacBook. The ProMotion (smooth refresh rate, up to 120 Hz) was still working properly, and I noticed absolutely no difference in how any other app performed. On a related note, I suspect putting the "energy mode" of the device on "High Power" from the Settings app can make the models run even faster, but I haven't tried it yet to provide any results. ## Conclusion All in all, the results showed a high potential for using my MacBook as a powerful local LLM inference machine. This becomes more interesting because the same laptop, with its powerful yet efficient processor, already provides many benefits to me throughout my work while maintaining great battery life, and thus, the fact that it casually also acts as a powerful LLM inference machine is just considered a cherry on top. Running all these experiments actually made me wonder if I made the right choice by not choosing an M4 Max model, which comes with a higher GPU core count, instead of an M4 Pro model. I would recommend all readers who have a Mac with an M4 Pro or similarly-powered systems-on-a-chip to try gpt-oss and share their opinions. I'm looking forward to seeing if I can integrate this model with the Spotlight support for the Shortcuts app, coming soon in macOS Tahoe, for easy invocation of the model. Also, if you still haven't upgraded your Mac to an Apple Silicon-based model, this could be yet another motivation for you to finally upgrade :)