Using gemma-3n-E4B-it-int4-Web.litertlm in Chrome/Edge on Android may result in errors similar to "out of memory."

#11

by willopcbeta - opened 13 days ago

13 days ago

I am trying to develop an offline web page translation using Google AI Studio. I used @mediapipe/tasks-genai on my PC to download and initialize the model, which worked fine.
However, when I tried to do the same on my Android phone, it crashes during initialization halfway.
Is this related to the size restriction of ArrayBuffer? Is there any way to split the model?
Using Edge Gallery works normally.
Device: Pixel 8 Pro.

HarishPermal

Google org 11 days ago

hi @willopcbeta ,

It sounds like you're encountering memory or resource limitations on your pixel 8 pro device, while your PC browser has ample resources to load large models into the ArrayBuffer, mobile browsers have much stricter per-tab RAM restrictions, causing the tab to crash during initialization. Standard models are often too heavy for these limits. And native demos (like "Edge Gallery") work because they typically use quantized models. So try using smaller, quantized models.

Thank You

tylermullen

Google org 4 days ago

The gemma-3n-E4B-it-int4-Web.litertlm model has been quantized already to int4, and we don't offer any quantizations smaller than int4, but you could try the smaller "E2B" version of the model here: https://huggingface.co/google/gemma-3n-E2B-it-litert-lm/blob/main/gemma-3n-E2B-it-int4-Web.litertlm.

Alternatively, you can try using only text and vision (and not audio), since sometimes certain shader limits can be hit specifically with the combination of audio + mobile + web.

And finally, I'd recommend making sure to instantiate your models in MediaPipe LLM Inference Web API using a ReadableStreamDefaultReader(like baseOptions: {modelAssetBuffer: modelReadableStreamDefaultReader}), since that method of instantiation uses our "streaming loading system", which should avoid running into issues with ArrayBuffer limits (because it loads models weight-by-weight).

willopcbeta

2 days ago

The gemma-3n-E4B-it-int4-Web.litertlm model has been quantized already to int4, and we don't offer any quantizations smaller than int4, but you could try the smaller "E2B" version of the model here: https://huggingface.co/google/gemma-3n-E2B-it-litert-lm/blob/main/gemma-3n-E2B-it-int4-Web.litertlm.

Alternatively, you can try using only text and vision (and not audio), since sometimes certain shader limits can be hit specifically with the combination of audio + mobile + web.

And finally, I'd recommend making sure to instantiate your models in MediaPipe LLM Inference Web API using a ReadableStreamDefaultReader(like baseOptions: {modelAssetBuffer: modelReadableStreamDefaultReader}), since that method of instantiation uses our "streaming loading system", which should avoid running into issues with ArrayBuffer limits (because it loads models weight-by-weight).

Thanks @tylermullen for helping me with my issue

Using "ReadableStreamDefaultReaderbaseOptions:" has indeed resolved the Array Buffer limits issue when loading Gemma 3n model in web.

In Chrome > Developer Tools > indexeddb > gemma-3n-E2B-it-int4-Web (2.82G) is split into 520 parts.

It is important to note that the issue only occurs during the initial page load after completing the download. This part is also resolved by addressing the method of reading from indexeddb.

The multiple segments lead to an increased detection and initialization time. Below is the Gemini response to the solution:

Gemini solution:

To fix the issue, we will load the data in chunks rather than loading the entire buffer at once. This will help to reduce the memory usage and prevent the browser hitting the Array Buffer limits.

We can use the ReadableStream API to load the data in chunks. We can use the split() method to split the indexeddb data into smaller chunks. Then, we can read these chunks sequentially, making sure that each chunk is fully loaded before moving on to the next one.

By loading the data in smaller chunks, we can avoid the browser hitting the Array Buffer limits and prevent any memory-related issues. This will also result in faster and more efficient handling of the indexeddb data.

Fast Initialization: Model loading time will be drastically reduced from many seconds to potentially under a second, as it only involves a single IndexedDB read operation.

The while (cursor) loop over hundreds of items is completely eliminated for all subsequent uses.

private async getModelFromChunksAsStream(modelName: string): Promise<Blob | null> {
    const db = await this.getDb();
    const stream = new ReadableStream({
        async start(controller) {
            try {
                const tx = db.transaction(CHUNKS_STORE, 'readonly');
                const store = tx.objectStore(CHUNKS_STORE);
                const range = IDBKeyRange.bound([modelName, 0], [modelName, Number.MAX_SAFE_INTEGER]);
                
                let cursor = await store.openCursor(range);
                
                while (cursor) {
                    if (cursor.value.data) {
                        controller.enqueue(new Uint8Array(cursor.value.data));
                    }
                    cursor = await cursor.continue();
                }
                controller.close();
            } catch (e) {
                controller.error(e);
            }
        }
    });

    return new Response(stream).blob();
}

Chinese description content
使用"ReadableStreamDefaultReaderbaseOptions:"確實解決web加載Gemma3n模型出現的ArrayBuffer limits。
在Chrome>開發者工具>indexeddb > gemma-3n-E2B-it-int4-Web(2.82G)被分割成520個。
需要注意的是下載完成後使用到"第一次重整頁面"才會出現問題，這部份透過修改讀取indexeddb的方法也解決了。
多個分塊導致檢測、初始化時間拉長下面附上Gemini的解決答覆:

tylermullen

Google org 1 day ago

Glad that helped!

Also, for caching, consider using APIs that might handle single large files or raw HTTP responses a little more efficiently, like OPFS. For an example of this, using the Gemma 3n web models, you can check out the source code from my open-sourced Gemma 3n demo, staged here as a HuggingFace Space.

willopcbeta

26 minutes ago

Thank you for the reminder, changing to OPFS has significantly improved read performance, and I have also rewritten the @mediapipe(tasks-genai) related functions to web Worker, the overall smoothness and stability are far better than before.
This issue can be considered solved, so I will close this issue.

willopcbeta changed discussion status to closed 26 minutes ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment