Rufous

Performance of llama.cpp on Snapdragon X Elite/Plus · ggerganov/llama.cpp · Discussion #8273

Format: markdownScore: 5Link: https://github.com

I want to start a discussion on the performance of the new Qualcomm Snapdragon X similar to Apple M Silicon in #4167
This post got completely updated, because power-setting to "best performance" IS needed. Default it only uses 4 of the 10 cores fully, which prevents thermal throttling but gives much less performance.
I am agnostic to Apple/intel/AMD/... or any discussion on Windows/MacOS/Linux merits - please spare us any "religiosity" here on Operating-systems, etc. For me it's important to have good tools, and I think running LLMs/SLMs locally via llama.cpp is important. We need good llama.cpp benchmarking, to be able to decide. I am currently primarily a Mac user (MacBook Air M2, Mac Studio M2 Max), running MacOS, Windows and Linux.
I just got a Surface 11 Pro with the X Plus and these are my 1st benchmarks. The Surface Pros always had thermal constraints, so I got a Plus and not an Elite - even with the Plus it throttles quickly when its 10 CPUs are used fully. Also since there was an optimization of llama.cpp for Snapdragon-builds, I am NOT testing with build 8e672ef, but with the current build. But I'm trying to produce comparable results to Apple Silicon #4167.
Here are my results for my Surface 11 Pro Snapdragon(R) X 10-core X1P64100 @ 3.40 GHz, 16GB, running Windows 11 Enterprise 22H2 26100.1000 - with the 16GB, I could not test fp16 since it swaps.
llama-bench with -t 10 for Q8_0 and later after a bit of cool-down for Q4_0 (the throttled numbers were 40% !!! lower). F16 does swap with the 16GB RAM, so its not included.

model
size
params
backend
threads
test
t/s

llama 7B Q8_0
6.67 GiB
6.74 B
CPU
10
pp512
58.72 ± 2.50

llama 7B Q8_0
6.67 GiB
6.74 B
CPU
10
tg128
13.54 ± 1.12

llama 7B Q4_0
3.56 GiB
6.74 B
CPU
10
pp512
58.59 ± 3.12

llama 7B Q4_0
3.56 GiB
6.74 B
CPU
10
tg128
18.23 ± 6.23

build: a27152b (3285)
Update: Results for a Snapdragon X Elite (Surface Laptop 7 15"):

model
size
params
backend
threads
test
t/s

llama 7B Q8_0
6.67 GiB
6.74 B
CPU
12
pp512
63.51 ± 4.94

llama 7B Q8_0
6.67 GiB
6.74 B
CPU
12
tg128
12.65 ± 0.41

llama 7B Q4_0
3.56 GiB
6.74 B
CPU
12
pp512
66.63 ± 3.90

llama 7B Q4_0
3.56 GiB
6.74 B
CPU
12
tg128
20.72 ± 0.54

build: cddae48 (3646)
Update with new Q4_0_4_4 algorithm (lately automatically done on load of a Q4_0 model, no special model-file needed anymore):

model
size
params
backend
threads
test
t/s

llama 7B Q4_0_4_8
3.56 GiB
6.74 B
CPU
10
pp512
166.05 ± 1.83

llama 7B Q4_0_4_8
3.56 GiB
6.74 B
CPU
10
tg128
20.09 ± 3.86

build: 69b9945 (3425)
I think the new Qualcomm chips are interesting, the numbers are a bit faster than my M2 MacBook Air in CPU-only mode - feedback welcome!
It's early in the life of this SoC as well as with Windows for arm64, and a lot of optimizations are still needed. There is no GPU/NPU support (yet) and Windows/gcc arm64 is still work-in-progress. DirectML, QNN and ONNX seems to be the main optimization focus for Microsoft/Qualcomm, I will look into this later (maybe the llama.cpp QNN backend of #7541 would also help/be a starting-point). So this is work-in-progress.
I tested 2 llama.cpp build methods for Windows with MSVC, and the method in https://www.qualcomm.com/developer/blog/2024/04/big-performance-boost-for-llama-cpp-and-chatglm-cpp-with-windows got me a little better results, than the build-method in #7191. I still need to test building with clang, but I expect not much difference, since clang uses the MSVC backend on Windows.
Another update/extension - with WSL2/gcc using 10 CPUs / 8 GB RAM and Ubuntu 24.04, the numbers are very similar (all dependent on cooldowns/throttling):

model
size
params
backend
threads
test
t/s

llama 7B Q8_0
6.67 GiB
6.74 B
CPU
10
pp512
62.46 ± 2.69

llama 7B Q8_0
6.67 GiB
6.74 B
CPU
10
tg128
9.58 ± 3.04

llama 7B Q4_0
3.56 GiB
6.74 B
CPU
10
pp512
61.93 ± 2.76

llama 7B Q4_0
3.56 GiB
6.74 B
CPU
10
tg128
13.74 ± 10.70

build: a27152b (3285)

You must be logged in to vote

On my 32gb Surface Pro 11 using LM Studio with 4 threads on Llama 3 Instruct 8B q4_k_m gguf I am seeing 12 - 20+ tok/s pretty consistently. Doh, will try bumping LM Studio to 10 threads. The rSnapdragon arm64 release version of LM Studio is here : https://lmstudio.ai/snapdragon
I don't understand how llama.cpp projects are prioritized and queued but LM Studio 0.3.0 (beta) supposedly has some snapdragon/npu something already somehow? (on waiting list to get the beta bits)
Excitedly anticipating future NPU support!

You must be logged in to vote

7 replies

My llama-bench command-line is derived from the same one which got used by ggerganov for the initial Apple M-Series benchmarking ./llama-bench -m <model-name> -p 512 -n 128 -t 10 (10 is for the Plus' 10 cores, for the Elite use -t 12, if llama.cpp can use a GPU, you add -ngl 99 or -ngl 0, if you don't want it to use the GPU).
As far as I know, there is no download of Q4_0_4_8 models. You just download the model you want, ideally as Q4_0 variant, and convert it via: ./llama-quantize --allow-requantize <name of the downloaded model> <name of the new Q4_0_4_8 model> Q4_0_4_8.
On AI performance, it's complicated. Any AI response has 2 parts. 1) the AI analyzes the prompt (which can be quite long), to see what the situation is, and what is required as answer. This "prompt processing" (PP) is done for the complete prompt-string at once. And for this a lot of compute-horsepower is needed. There GPU-acceleration or the new ARM CPU-optimizations with this Q4_0_4_8 gives a 2-3x acceleration. 2) once the prompt is processed completely, the LLM generates the response token-per-token. For this "token-generation" (TG), the LLM needs to calculate the next token from ALL the many billion parameters as well as the context (all the token of the prompt and the previous response). So with TG the LLM shuffles GB of data - FOR EACH AND EVERY TOKEN to be generated - from its RAM into the CPU/GPU's chip-internal ultra-fast cache-memory (which is much to small to hold everything, and needs to be re-used/re-loaded all the time). So for TG the RAM-bandwidth (and less the compute-horsepower) becomes the limiting factor - how fast it can pump all the billions of parameters,... into the calculation. Therefore we see little improvement for TG with more compute-horsepower. Apple's Max and Ultra chips have 4x to 8x the memory-bandwidth of the base M-chip, or even the Snapdragon X (which has a 33% faster RAM than the base M2/M3), and this influences the TG numbers. This is why llama-bench gives "pp" and "tg" numbers separately in its tables. pp512 means, it got a 512 token long prompt to analyze - a very long one, these long prompts are e.g. very important for retrieval-augmented-generation (RAG), where the LLM gets a lot of context-information in the prompt for a question.
SK is Microsoft's open-source framework for building their Copilots (similar to langchain,... but simpler/easier). Currently more of a programmer thing. It's weird how programming changes with AI. E.g. programming a Spanish translator functionality for your application becomes just a few SK calls and telling an AI, to translate to Spanish. Totally unlike the geeky/complex recipes of traditional programming. And with AI Agents, the AI gets an input, and then decides, which tools it should use, and how, in order to accomplish the task - e.g. web-search,... - its still very early. The new Llama-3-Groq-8B-Tool-Use is the first LOCAL LLM, which is very capable of generating a good plan for this tool-use of AI-agents, until then it was only possible with cloud-AI.
Very long-winded answer, but I hope it helps.

You are a great explainer, thanks for your long-windedness!
To test your claims of 2 to 3x acceleration, I did the following:

Obtained Llama-3-Groq-8B-Tool-Use-Q4_K_M.gguf and quantized it using Q4_0_4_8.
Benched the original (llama 8B Q4_K - Medium) vs (llama 8B Q4_0_4_8) the quantized version using 10 and 12 threads, and the meager GPU on the SP11.

llama-bench -m "Llama-3-Groq-8B-Tool-Use-Q4_K_M.gguf" -p 512 -n 128 -t 12

model
size
params
backend
threads
test
t/s

llama 8B Q4_K - Medium
4.58 GiB
8.03 B
CPU
12
pp512
32.31 ± 0.40

llama 8B Q4_K - Medium
4.58 GiB
8.03 B
CPU
12
tg128
11.15 ± 1.26

llama-bench -m Groq_8B_Q4_0_4_8.gguf -p 512 -n 128 -t 10

model
size
params
backend
threads
test
t/s

llama 8B Q4_0_4_8
4.33 GiB
8.03 B
CPU
10
pp512
161.48 ± 11.56

llama 8B Q4_0_4_8
4.33 GiB
8.03 B
CPU
10
tg128
16.17 ± 2.24

llama-bench -m Groq_8B_Q4_0_4_8.gguf -p 512 -n 128 -t 12

model
size
params
backend
threads
test
t/s

llama 8B Q4_0_4_8
4.33 GiB
8.03 B
CPU
12
pp512
173.87 ± 34.26

llama 8B Q4_0_4_8
4.33 GiB
8.03 B
CPU
12
tg128
15.90 ± 3.48

llama-bench -m Groq_8B_Q4_0_4_8.gguf -p 512 -n 128 -t 12 -ngl 99

model
size
params
backend
threads
test
t/s

llama 8B Q4_0_4_8
4.33 GiB
8.03 B
CPU
12
pp512
174.54 ± 21.36

llama 8B Q4_0_4_8
4.33 GiB
8.03 B
CPU
12
tg128
16.37 ± 3.25

The benchmarks easily confirm the acceleration. In terms of pure t/s it is more like 5x, actually.
But how does the Q4_0_4_8 actually perform in a chat?
My interface is LM Studio. I copied the Q4_0_4_8 gguf to the models directory and, after restarting LM Studio, LM Studio did indeed show it as available but it wouldn't load it. I tried several different "preset files" (all the different settings, mlock etc etc) but nothing worked. The non-quantized version of the groq gguf worked fine, however. I will get on Discord and see if I can learn what is wrong.
On a side note, I spent some time at the Qualcomm AI hub where they say to "Bring your own model". I think the idea is use their hub to transform a model into something that uses the NPU? https://aihub.qualcomm.com/compute/models
Thanks again, for your good info.

OK, sorry, I never used LM Studio, so cannot help you there. I have used Ollama, but not yet with the Q4_0_4_8 models.
On chat performance with Q4_0_4_8, it's probably not a lot of improvement, since prompt-processing (pp) normally only plays a minor role in chats, tg is the major factor there, and for tg its not this much of an improvement. If you are mainly into chats, probably the new Lamma-3.1-8B-Instruct models of this week would be best. Llama-3-Groq-8B-Tool-Use is best for tool-use and not for chats. To my knowledge, the llama.cpp team is hard at work to try and support Llama-3.1.
To my knowledge, the Qualcomm AI hub (with its QNN technology) is all about small local models and power.savings, much smaller and less capable models than Llama 3.1 8B. There is an effort underway, to get llama.cpp support QNN, but I think it's still a long way off.

To my knowledge, the Qualcomm AI hub (with its QNN technology) is all about small local models and power.savings, much smaller and less capable models than Llama 3.1 8B. There is an effort underway, to get llama.cpp support QNN, but I think it's still a long way off.

From the list of models they host, I believe that's mostly true, but they also have deployable versions of Llama 2 7B and Llama 3 8B with support for Snapdragon 8 Gen 3 Mobile and Snapdragon X Elite.
I haven't done anything with qai_hub yet because it looks (comparatively) convoluted to get deployed for a pleb like myself.
8B Model (github)
They write that their llama 3 8B model is quantized to w4a16(4-bit weights and 16-bit activations)

FYI, I just tried WebGL, which supports the Ardeno GPU and it is surprisingly fast for e.g. Phi-3.5-mini-instruct - 1/2 the performance of my MacBook Air M2 10-GPU llama.cpp native Q4_0 model, but when running only in Chrome!! I will try and dig deeper into this.
Thanks! Let's see what developments will happen around the NPU with QNN.

Update for Surface Laptop 7 / Snapdragon X Elite - it might seem, that the Elite utilizes the memory-bandwidth better than the asymmetrical Plus (for token-generation):

model
size
params
backend
threads
test
t/s

llama 7B Q4_0_4_8
3.56 GiB
6.74 B
CPU
12
pp512
169.12 ± 8.85

llama 7B Q4_0_4_8
3.56 GiB
6.74 B
CPU
12
tg128
23.41 ± 1.35

build: cddae48 (3646)

You must be logged in to vote

0 replies

What about the GPU and NPU backend?

You must be logged in to vote

3 replies

What about the GPU and NPU backend?

@neozhang307 - llama.cpp on the Snapdragon X's GPU should in theory work via Vulkan. But llama.cpp with enabled Vulkan currently hangs on load (both for the native Qualcomm driver and for Microsoft's DX12 driver via SET GGML_VK_VISIBLE_DEVICES=1 ). As for the NPU, there is currently some work being done to support it via QNN, also there is some initial discussion about supporting DirectML - but both not running.
Use the Q4_0_4_8 (or _4) quantization for your models on Snapdragon X CPU's with llama.cpp. It runs quite fast because it uses the CPU's matrix instructions. The Snapdragon X Elite's CPUs with Q4_0_4_8 are similar in performance to an Apple M3 running Q4_0 on its GPUs.
llama-bench -m <model>-p 512 -n 128 -t 12 with a Snapdragon X Elite Surface Laptop 7 on build fa42aa6 (3897) yields:

model
size
params
backend
threads
test
t/s

llama 7B Q4_0
3.56 GiB
6.74 B
CPU
12
pp512
63.05 ± 7.40

llama 7B Q4_0
3.56 GiB
6.74 B
CPU
12
tg128
19.83 ± 1.33

llama 7B Q4_0_4_8
3.56 GiB
6.74 B
CPU
12
pp512
178.88 ± 11.31

llama 7B Q4_0_4_8
3.56 GiB
6.74 B
CPU
12
tg128
23.24 ± 0.84

llama 7B Q4_0_4_4
3.56 GiB
6.74 B
CPU
12
pp512
144.52 ± 14.16

llama 7B Q4_0_4_4
3.56 GiB
6.74 B
CPU
12
tg128
22.81 ± 1.28

Not sure how the Surface Laptop handles thermals with this stress-workload and how much it throttles (the CPUs are not always maxed out after the initial 100%).
There is a working Snapdragon X GPU Support via WebML in Chrome (e.g. via chat.webllm.ai). But llama.cpp Q4_0_4_8 on the CPU seems faster and much more versatile.
Also there is support for the Snapdragon X's NPU via ONNX's QNN drivers. I did not test the performance of this via the CPU (speed and power-consumption).

Thanks a lot Andreas. Keep us updated, seems no one else is doing much on this. Can you please be so kind and show us a complete example of a /llama-quantize you made?

@manuelpaulo - I just download a llama-2 7B Q4_0 gguf model from huggingface and did ./llama-quantize --allow-requantize <name of the downloaded Q4_0 model>.gguf <name of the new Q4_0_4_8 model>.gguf Q4_0_4_8. I used llama-2 7B because then you can compare the results to the Apple Silicon (with GPU/Metal) llama.cpp performance numbers in discussion #4167
With newer models like e.g. llama 3.2, there already are ready-made Q4_0_4_8 quantized gguf-file versions available for direct download from huggingface.

It's about 5 months that this processor is available and | have the opportunity to get one ASUS Vivobook S 15 OLED with 32Gb of RAM around it for 850 USD (Black Friday deal).
So, do we have at last NPU support for the Qualcomm Snapdragon X Plus X1P processor?
Alternatively, is there any NPU support for the [AMD Ryzen AI 9 HX 370?
The later one is supposed to offer 50 TOPS vs 45 TOPS for the Snapdragon chip.
And what is the situation for the latest LMStudio or Ollama builds?

You must be logged in to vote

4 replies

Welcome to the ghost land.

Below the status as far as I know it. FYI, I had a Surface Pro 11 (Snapdragon X Plus) at launch and later switched to a 15" Surface Laptop 7 (Surface X Elite). I am mainly a Mac user, even though I worked for Microsoft 1994-2009. And I am "multilingual" re OS - MacOS, Windows (native ARM and VM on Mac) , Linux (containers, VMs).

Qualcomm's NPU is a bit of an issue - a) its programming is strange and requires a dedicated, special Software (QNN). b) this software is mostly as-is and not very extensible (problematic for llama.cpp's always evolving quantization of parameters, the K/V-cache,...). Microsoft and some marketing-oriented people (yes, I'm a markter too) with strong opinions but a bit lacking in knowledge are pushing NPUs, where CPU-programming is already faster and rapidly evolving in speed - with dedicated SIMD instructions, new algorithms like mixed-precision math via lookup,...
Current status re Snapdragon X GPU/NPU as of late Nov 2024: In my opinion NPU support for the Snapdragon X makes little sense, it's slower, buggy (via Vulkan, Qualcomm's driver seems to have some issues) and consumes more power than the CPU. The NPU would also be slower than the CPU, and only available to run very limited models (very few quantization choices, and some size limits) even though it might save power-consumptions. Qualcomm/Microsoft are not putting any visible efforts into this. So there only is some llama.cpp with QNN work going on for mobile Snapdragon CPUs (see above).
Speed and recent llama.cpp innovations: with the Q4_0_4_4 CPU-optimizations, the Snapdragon X's CPU got 3x faster. Recent llama.cpp changes re-pack Q4_0 models automatically to accelerated Q4_0_4_4 when loading them on supporting arm CPUs (PR #9921). So now running llama.cpp on the Snapdragon X CPU is faster than on the GPU or NPU. Also the Microsoft open-source team around bitnet/T-MAC (they also seem to do the 1.58-Bit quantization efforts) is working on a very fast mixed-precision quantization math (PR #10181) which would accelerate this even faster. My Snapdragon X Elite with llama.cpp Q4_0 models currently runs approx. as fast on the CPU as my M2 10-GPU Mac.
TLDR: I don't think there will be, or needs to be a NPU/GPU support for Snapdragon X in llama.cpp. Because it makes little sense when it runs faster on the CPU. Currently this is true at least for Q4_0 models, more likely to come.

And what is the situation for the latest LMStudio or Ollama builds?

Snapdragon X on Windows on arm is now supported natively by ollama. LM-Studio to my knowledge also works. As for their support for the other acceleration efforts, I need to check them when I get back from vacation.

So now running llama.cpp on the Snapdragon X CPU is faster than on the GPU or NPU.

So one would question what is the purpose of a dedicated built-in NPU if the CPU is faster?

So now running llama.cpp on the Snapdragon X CPU is faster than on the GPU or NPU.

So one would question what is the purpose of a dedicated built-in NPU if the CPU is faster?

I try to answer this to the best of my knowledge:
The NPU is not supposed to be faster, its just fast enough and uses less power, can run in parallel to the CPU, but sadly it is also much less flexible and much more complicated to program. It's great for some specific small models (Microsoft calls them SLMs instead of LLMs). And it's also the same with Apple's ANE and their AI-software - even Apple's own excellent open-source MLX AI-framework does not run on their ANE, only on their GPUs. NPU/ANE are used by Microsoft/Qualcomm and Apple for comparatively small local AI things in their operating systems (some image-processing, live-transcription & -translation...). But I am disappointed how much marketing-hype and little available function there currently is.
Commonly it's better to use a larger parameter but heavily quantized AI model (e.g. Q4 or similar) over a smaller but full-precision model with similar memory-size/speed. Llama.cpp innovates rapidly with improvements in quantization and surrounding optimizations, but the inflexibel NPU/ANE don't seem to support the necessary custom code (GPUs with their custom shaders do). Also Vulkan has been establishing itself as a vendor-neutral GPU-programming (including their custom shaders), enabling llama.cpp to support multiple GPUs besides mainstream NVIDIA (CUDA) / Apple (Metal) with one back-end - nothing similar emerged yet for NPUs.
So if you want to currently use the Snapdragon X NPU, you have to use Qualcomm's QNN code and not llama.cpp. Qualcomm improved their QNN a bit since summer, but in my earlier tests, I could not get it to run even a good/modern 7B+ model (OK, I might be just plain stupid).

Thank you for the feedback.
Meanwhile, due to indecision I just missed the Black Friday offer that motivated my question but given your answers it is rather a chance. Besides I just found the same 50% Black Friday deal type from another ASUS reseller, but that time with the Qualcomm Snapdragon X Elite - X1E-78-100 (5% speedier :-).
But wait, if the NPU is not such a big deal, the AMD Ryzen AI 9 HX 370 looks even more appealing. It is 50% speedier than the Snapdragon X Elite - X1E-78-100 for the CPU side and is offering 50 TOPS vs 45 TOPS for its embedded NPU.
And, yes, there are also Black Friday 50% deals for notebooks featuring that AMD part :-)
So, same question but now addressing the AMD Ryzen AI 9 HX 370 (pending the release of the AMD Ryzen AI 9 HX 395, featuring up to 96Gb VRAM usage, hopefully).
Namely, are there specific llama.cpp builds for the AMD Ryzen AI 9 HX 370 or progress towards it?

You must be logged in to vote

1 reply

See my answer to your cross-post in the other thread. This thread is only about the Snapdragon X.
Just notice, that the synthetic benchmarks vendors/reviewers promote (e.g. NPU TOPS, geekbench) are completely useless in regard to llama.cpp/ollama/LM-Studio performance. E.g. for the new M4 base Macs, geekbench,... show it to be faster than the Ultra variant of the M1, but if you look at the measurements in discussion #4167, you see a completely different picture. So look in the github llama.cpp discussions for real performance number comparisons (best compared using llama-bench with the old llama2 model, Q4_0 and its derivatives are the most relevant numbers).

@AndreasKunar
Would you be interested in participating in a roundtable discussion with some qualcomm engineers? They want to discuss (with no commitments or promises ofc) what they can do to better support open source
Pinging @slaren and @JohannesGaessler too since you may be interested as well, not sure how much you deal with the low level or if you're at all interested in better support for qualcomm/snapdragon

You must be logged in to vote

7 replies

I don't have the time personally to work on a qualcomm NPU backend, so I will leave that to the people interested in working on that. That said, I think what we would need from qualcomm is quite simple, make their libraries easily accessible with open licenses, and flexible enough so that we can implement our own kernels.

To add to what Diego said, good documentation and developer tools would also be very much appreciated.

Okay great, I'll make sure these concerns are noted, thank you! If you have any other thoughts or suggestions feel free to either add them here or message me directly somewhere (twitter I guess?), will definitely be helpful during conversations

I would be very interested - beginning next week, currently on vacation, I'm in MET but participating mornings/evenings is not an issue. You can reach me via e-mail (should be public in my profile).
BUT, to clarify, I'm not a developer anymore. I was good with C, but am very rusty, never worked productively in C++, which the new llama.cpp/GGML backends are in. I'm more of a marketer/communicator with a strong technical background. My interests are in distributing AI to edge-devices, but mainly to modern laptops and modern desktops (small, low-power-consumption), not to mobiles/tablets. I currently have a Surface Laptop 7 Elite/16GB, an M2 10-GPU/24GB MacBook Air (getting replaced next week by an M4 Pro MacBook Pro 48GB) and a M2 Max 96GB Mac Studio. I am also very interested in running llama.cpp,... in containers/VMs, mainly for security reasons - and this currently means CPU-only, or with podman/krunkit/Vulkan-remoting.
About me: I am retired, used to be a marketer/business-leader at Microsoft in Europe (developer-tools, internet-technologies, servers, and technical-audience marketing, 1994-2009). Before this I was for a few years an "Advanced Technical Specialist" on behalf Intel in Austria+Easter-Europe, doing pre-sales for developer-tools and processor-architectures (386 to P5). And before this I was a software-developer for 10+ yrs - mainly system-level C and highly-optimized assembler programming. My formal meducation is computer-science (MSc) and electronics (BSc).

@bartowski1182 In addition to these comments, the qualcomm engineers can consider to open a discussion in the llama.cpp repository where we can discuss any topic that they are interested in. Similar discussions have already been done with teams from Intel (#3965), Nvidia (#6763), Arm (#5780) and others and these have resulted in better support for the hardware. The llama.cpp project is generally open to add support for all kinds of hardware, as long as there are developers that can help with the implementation and the maintenance.

Apologies for the delayed response to this thread.
Qualcomm engineers have been participating in llama.cpp development for some time now ;-)
Keen observers would have noticed that our ( @max-krasnyansky and @fmz ) PRs use QuIC (Qualcomm Innovation Center) ids.
We've been focusing primarily on the CPU so far, things like Windows on ARM64 build, Threading Performance and advanced features to take advantage of Snapdragon X-Elite CPU clusters, etc. We had our own version of Q4_0_X_X which we were going to contribute but Arm folks beat us to it with Q4_0_4_X :) so we just switched to that.
Q4_0_4_8 is the best performing layout on the current generation Snapdragon CPUs (8 Gen3, X Elite, 8 Elite).
You can find detailed perf reports for Snapdragon Gen 3 and X-Elite in the threadpool related PRs #8672.
Just search for Snapdragon in the PRs.
Some of our GPU folks ( @lhez ) are about to join the party. We're getting ready to submit OpenCL-based Backend with Adreno support for the current gen Snapdragons. I finished rebasing it on top of dynamic backend load updates yesterday and we should be able to start an official PR after some more testing.
Here is the fork/branch we're using for staging: CodeLinaro/Adreno
The NPU support will take more effort, sorry no ETA at this point. I'm fully aware of what is needed for this.
@bartowski1182 Feel free to connect those Qualcomm engineers you were referring to with me.

You must be logged in to vote

0 replies

user-report and endorsement , fwiw :
I've been traveling for the last 3 months in remote parts of Italy, often offline, and am not permitted to use online AI tools . I use LM Studio on a Surface Pro 11 , 32gb model with the 12-core Snapdragon X Elite . Performance and battery life using LM Studio have been excellent .
My layman's understanding is that the NPU is not just about efficiency / less battery but , because it is optimized for matrix operations and activation functions , increases performance for some AI/ML tasks .
Snapdragon CPU work so far : Great ! NPU work so far : still only promising , becoming irritating . (same for Microsoft although 2024 Ignite indicates they are making progress)
I applaud @bartowski1182 's roundtable recommendation , including @AndreasKunar .

You must be logged in to vote

2 replies

You should try https://chat.webllm.ai It's much faster, and it will work offline once the model is loaded to the cache.

Does it somehow run on the NPU? No indication of such on their GitHub..
Yes the NPU should be quite a bit more efficient, GPU as well I would think. Though the improvements made through optimizing the memory loading is quite a great step in the right direction!

It seems like NPU support for LM Studio is coming soon, as seen here. - It's possible they have built a new backend, similar to MLX-engine for apple.

You must be logged in to vote

0 replies