A Method to Download Models from 🤗HuggingFace

A Method to Download Models from 🤗HuggingFace
https://www.itdog.cn/http/

If you cannot directly download models from HuggingFace[1], you can use the https://github.com/AlphaHinex/hf-models repository and build a Docker image using GitHub Actions[2]. In the image, use huggingface_hub[3] to download the required models, then push the image to Docker Hub[4]. Finally, you can download the model using the image.

1Available Models (tags)

Currently available models can be seen in the repository tags[5], where the repository tag corresponds to the image tag. For example:

The command to download the image corresponding to the codet5-small[6] tag is:

docker pull alphahinex/hf-models:codet5-small

The model included in the image is Salesforce/codet5-small[7].

2How to Use

Download the image:

docker pull alphahinex/hf-models:codet5-small

If you have difficulty downloading the image directly from Docker Hub, you can refer to the Summary of Solutions for Current Docker Hub Access Issues in China[8] to configure the image source. The Shanghai Jiao Tong University mirror site https://docker.mirrors.sjtug.sjtu.edu.cn/ has been tested to be very fast.

Start the container:

docker run -d --name test --rm alphahinex/hf-models:codet5-small tail -f /dev/null

Check the model download path:

$ docker exec -ti test tree /root/.cache/huggingface/hub
/root/.cache/huggingface/hub
└── models--Salesforce--codet5-small
    ├── blobs
    │   ├── 056c085b0bf1966a4658710891af6de209b608be
    │   ├── 263a6f72aceb1716442638a3bcf20afe1eb0de9a
    │   ├── 319fd0bbb49414442ca8c66a675ebce7b3fec747
    │   ├── 38ed64670805e4a3ff4cfa6f764629324a4e3c1e
    │   ├── 51b0295e221a3e91142cfedb6f3d6f9b74291487
    │   ├── 6d34772f5ca361021038b404fb913ec8dc0b1a5a
    │   ├── 968fb0f45e1efc8cf3dd50012d1f82ad82098107cbadde2c0fdd8e61bac02908
    │   ├── 9e26dfeeb6e641a33dae4961196235bdb965b21b
    │   └── e830a2bc8cae841f929043d588e1edcffb28fe9a
    ├── refs
    │   └── main
    └── snapshots
        └── a642dc934e5475185369d09ac07091dfe72a31fc
            ├── README.md -> ../../blobs/51b0295e221a3e91142cfedb6f3d6f9b74291487
            ├── added_tokens.json -> ../../blobs/9e26dfeeb6e641a33dae4961196235bdb965b21b
            ├── config.json -> ../../blobs/056c085b0bf1966a4658710891af6de209b608be
            ├── merges.txt -> ../../blobs/319fd0bbb49414442ca8c66a675ebce7b3fec747
            ├── pytorch_model.bin -> ../../blobs/968fb0f45e1efc8cf3dd50012d1f82ad82098107cbadde2c0fdd8e61bac02908
            ├── special_tokens_map.json -> ../../blobs/e830a2bc8cae841f929043d588e1edcffb28fe9a
            ├── tokenizer_config.json -> ../../blobs/263a6f72aceb1716442638a3bcf20afe1eb0de9a
            └── vocab.json -> ../../blobs/38ed64670805e4a3ff4cfa6f764629324a4e3c1e

5 directories, 18 files

Copy the model files out of the container:

docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/README.md .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/added_tokens.json .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/config.json .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/merges.txt .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/pytorch_model.bin .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/special_tokens_map.json .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/tokenizer_config.json .
docker cp -L test:/root/.cache/huggingface/hub/models--Salesforce--codet5-small/snapshots/a642dc934e5475185369d09ac07091dfe72a31fc/vocab.json .

Verify the model file SHA256 code (which is consistent with the symlink file name):

$ shasum -a 256 pytorch_model.bin
968fb0f45e1efc8cf3dd50012d1f82ad82098107cbadde2c0fdd8e61bac02908  pytorch_model.bin

Consistent with the SHA256 code in https://huggingface.co/Salesforce/codet5-small/blob/main/pytorch_model.bin:

Git LFS Details
SHA256: 968fb0f45e1efc8cf3dd50012d1f82ad82098107cbadde2c0fdd8e61bac02908
Pointer size: 134 Bytes
Size of remote file: 242 MB

Delete the container:

$ docker rm -f test

3How to Create a New Model Image

  1. Modify download.py[9] to download a single file, folder, or filter files to download based on patterns. For detailed usage, see huggingface_hub Download files[10].
  2. Modify the docker-image.yml line 12[11] to change the IMAGE_NAME variable inside the image tag.

download.py Example

  1. Download a single file

    from huggingface_hub import hf_hub_download
    hf_hub_download(repo_id="tiiuae/falcon-7b-instruct", filename="config.json")
    
  2. Download an entire path

    from huggingface_hub import snapshot_download
    snapshot_download("Salesforce/codegen25-7b-mono")
    
  3. Include certain files

    from huggingface_hub import snapshot_download
    snapshot_download("bigcode/starcoder", ignore_patterns=["pytorch_model-00004-of-00007.bin", "pytorch_model-00005-of-00007.bin", "pytorch_model-00006-of-00007.bin"])
    
  4. Exclude certain files

    from huggingface_hub import snapshot_download
    snapshot_download("bigcode/starcoder", allow_patterns=["pytorch_model-00004-of-00007.bin", "pytorch_model-00005-of-00007.bin", "pytorch_model-00006-of-00007.bin"])
    

4Constraints

Currently, the Runner[12] used by GitHub Actions runs on an Azure Standard_DS2_v2[13] virtual machine, with an 84GB data disk mounted at / and 14GB of temporary storage mounted at /mnt. The available free storage space for build tasks is around 25-29GB.

In docker-image.yml, use the Maximize build disk space[14] action to expand the free space at the root path to about 45GB. If the total size of the model files to be downloaded exceeds this range, multiple images can be built, such as the StarCoder 15.5B[15] model files, which total over 60GB, can be built into two images: starcoder-01[16] and starcoder-02[17] to obtain all files.

References

[1]

HuggingFace: https://huggingface.co/

[2]

GitHub Actions: https://github.com/features/actions

[3]

huggingface_hub: https://github.com/huggingface/huggingface_hub

[4]

Docker Hub: https://hub.docker.com/

[5]

tags: https://github.com/AlphaHinex/hf-models/tags

[6]

codet5-small: https://github.com/AlphaHinex/hf-models/releases/tag/codet5-small

[7]

Salesforce/codet5-small: https://huggingface.co/Salesforce/codet5-small

[8]

Summary of Solutions for Current Docker Hub Access Issues in China: https://zhuanlan.zhihu.com/p/642560164

[9]

download.py: https://github.com/AlphaHinex/hf-models/blob/main/download.py

[10]

Download files: https://huggingface.co/docs/huggingface_hub/en/guides/download

[11]

line 12: https://github.com/AlphaHinex/hf-models/blob/main/.github/workflows/docker-image.yml#L12C35-L12C36

[12]

Runner: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#cloud-hosts-used-by-github-hosted-runners

[13]

Azure Standard_DS2_v2: https://docs.microsoft.com/azure/virtual-machines/dv2-dsv2-series#dsv2-series

[14]

Maximize build disk space: https://github.com/marketplace/actions/maximize-build-disk-space

[15]

StarCoder 15.5B: https://huggingface.co/bigcode/starcoder

[16]

starcoder-01: https://github.com/AlphaHinex/hf-models/releases/tag/starcoder-01

[17]

starcoder-02: https://github.com/AlphaHinex/hf-models/releases/tag/starcoder-02

Leave a Comment