
If you cannot directly download models from HuggingFace[1], you can use the https://github.com/AlphaHinex/hf-models repository and build a Docker image using GitHub Actions[2]. In the image, use huggingface_hub[3] to download the required models, then push the image to Docker Hub[4]. Finally, you can download the model using the image.
1Available Models (tags)
Currently available models can be seen in the repository tags[5], where the repository tag corresponds to the image tag. For example:
The command to download the image corresponding to the codet5-small[6] tag is:
docker pull alphahinex/hf-models:codet5-small
The model included in the image is Salesforce/codet5-small[7].
2How to Use
Download the image:
docker pull alphahinex/hf-models:codet5-small
If you have difficulty downloading the image directly from Docker Hub, you can refer to the Summary of Solutions for Current Docker Hub Access Issues in China[8] to configure the image source. The Shanghai Jiao Tong University mirror site https://docker.mirrors.sjtug.sjtu.edu.cn/ has been tested to be very fast.
Start the container:
docker run -d --name test --rm alphahinex/hf-models:codet5-small tail -f /dev/null
Check the model download path:
$ docker exec -ti test tree /root/.cache/huggingface/hub
/root/.cache/huggingface/hub
└── models--Salesforce--codet5-small
├── blobs
│ ├── 056c085b0bf1966a4658710891af6de209b608be
│ ├── 263a6f72aceb1716442638a3bcf20afe1eb0de9a
│ ├── 319fd0bbb49414442ca8c66a675ebce7b3fec747
│ ├── 38ed64670805e4a3ff4cfa6f764629324a4e3c1e
│ ├── 51b0295e221a3e91142cfedb6f3d6f9b74291487
│ ├── 6d34772f5ca361021038b404fb913ec8dc0b1a5a
│ ├── 968fb0f45e1efc8cf3dd50012d1f82ad82098107cbadde2c0fdd8e61bac02908
│ ├── 9e26dfeeb6e641a33dae4961196235bdb965b21b
│ └── e830a2bc8cae841f929043d588e1edcffb28fe9a
├── refs
│ └── main
└── snapshots
└── a642dc934e5475185369d09ac07091dfe72a31fc
├── README.md -> ../../blobs/51b0295e221a3e91142cfedb6f3d6f9b74291487
├── added_tokens.json -> ../../blobs/9e26dfeeb6e641a33dae4961196235bdb965b21b
├── config.json -> ../../blobs/056c085b0bf1966a4658710891af6de209b608be
├── merges.txt -> ../../blobs/319fd0bbb49414442ca8c66a675ebce7b3fec747
├── pytorch_model.bin -> ../../blobs/968fb0f45e1efc8cf3dd50012d1f82ad82098107cbadde2c0fdd8e61bac02908
├── special_tokens_map.json -> ../../blobs/e830a2bc8cae841f929043d588e1edcffb28fe9a
├── tokenizer_config.json -> ../../blobs/263a6f72aceb1716442638a3bcf20afe1eb0de9a
└── vocab.json -> ../../blobs/38ed64670805e4a3ff4cfa6f764629324a4e3c1e
5 directories, 18 files
Copy the model files out of the container:
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/README.md .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/added_tokens.json .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/config.json .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/merges.txt .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/pytorch_model.bin .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/special_tokens_map.json .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/tokenizer_config.json .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/vocab.json .
Verify the model file SHA256 code (which is consistent with the symlink file name):
$ shasum -a 256 pytorch_model.bin
968fb0f45e1efc8cf3dd50012d1f82ad82098107cbadde2c0fdd8e61bac02908 pytorch_model.bin
Consistent with the SHA256 code in https://huggingface.co/Salesforce/codet5-small/blob/main/pytorch_model.bin:
Git LFS Details
SHA256: 968fb0f45e1efc8cf3dd50012d1f82ad82098107cbadde2c0fdd8e61bac02908
Pointer size: 134 Bytes
Size of remote file: 242 MB
Delete the container:
$ docker rm -f test
3How to Create a New Model Image
-
Modify download.py[9] to download a single file, folder, or filter files to download based on patterns. For detailed usage, see huggingface_hub
Download files[10]. -
Modify the docker-image.yml
line 12[11] to change theIMAGE_NAME
variable inside the image tag.
download.py Example
-
Download a single file from huggingface_hub import hf_hub_download hf_hub_download(repo_id="tiiuae/falcon-7b-instruct", filename="config.json")
-
Download an entire path from huggingface_hub import snapshot_download snapshot_download("Salesforce/codegen25-7b-mono")
-
Include certain files from huggingface_hub import snapshot_download snapshot_download("bigcode/starcoder", ignore_patterns=["pytorch_model-00004-of-00007.bin", "pytorch_model-00005-of-00007.bin", "pytorch_model-00006-of-00007.bin"])
-
Exclude certain files from huggingface_hub import snapshot_download snapshot_download("bigcode/starcoder", allow_patterns=["pytorch_model-00004-of-00007.bin", "pytorch_model-00005-of-00007.bin", "pytorch_model-00006-of-00007.bin"])
4Constraints
Currently, the Runner[12] used by GitHub Actions runs on an Azure Standard_DS2_v2[13] virtual machine, with an 84GB data disk mounted at /
and 14GB of temporary storage mounted at /mnt
. The available free storage space for build tasks is around 25-29GB.
In docker-image.yml
, use the Maximize build disk space[14] action to expand the free space at the root path to about 45GB. If the total size of the model files to be downloaded exceeds this range, multiple images can be built, such as the StarCoder 15.5B[15] model files, which total over 60GB, can be built into two images: starcoder-01[16] and starcoder-02[17] to obtain all files.
References
HuggingFace: https://huggingface.co/
[2]GitHub Actions: https://github.com/features/actions
[3]huggingface_hub: https://github.com/huggingface/huggingface_hub
[4]Docker Hub: https://hub.docker.com/
[5]tags: https://github.com/AlphaHinex/hf-models/tags
[6]codet5-small: https://github.com/AlphaHinex/hf-models/releases/tag/codet5-small
[7]Salesforce/codet5-small: https://huggingface.co/Salesforce/codet5-small
[8]Summary of Solutions for Current Docker Hub Access Issues in China: https://zhuanlan.zhihu.com/p/642560164
[9]download.py: https://github.com/AlphaHinex/hf-models/blob/main/download.py
[10]Download files: https://huggingface.co/docs/huggingface_hub/en/guides/download
[11]line 12: https://github.com/AlphaHinex/hf-models/blob/main/.github/workflows/docker-image.yml#L12C35-L12C36
[12]Runner: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#cloud-hosts-used-by-github-hosted-runners
[13]Azure Standard_DS2_v2: https://docs.microsoft.com/azure/virtual-machines/dv2-dsv2-series#dsv2-series
[14]Maximize build disk space: https://github.com/marketplace/actions/maximize-build-disk-space
[15]StarCoder 15.5B: https://huggingface.co/bigcode/starcoder
[16]starcoder-01: https://github.com/AlphaHinex/hf-models/releases/tag/starcoder-01
[17]starcoder-02: https://github.com/AlphaHinex/hf-models/releases/tag/starcoder-02