Streamlit Local Deployment Tutorial for DeepSeek-R1

Introduction

Happy Spring Festival everyone! Recently, DeepSeek has gained a lot of popularity. Today, I will share a program that uses Streamlit to deploy the DeepSeek-R1-Distill-Qwen-7B model. By deploying it locally, you can easily utilize DeepSeek’s conversational capabilities.

Relationship with Qwen

DeepSeek-R1-Distill-Qwen-7B is an open-source inference model based on the Qwen-7B architecture, distilled from the DeepSeek-R1 model using knowledge distillation techniques.

Here is a detailed introduction to this model and its relationship with Qwen: Model Introduction Architecture and Parameters:

1. DeepSeek-R1-Distill-Qwen-7B is based on the Qwen-7B architecture and has 7 billion parameters. It inherits the architectural advantages of the Qwen series models while gaining powerful inference capabilities from the DeepSeek-R1 model through distillation techniques.

2. Training and Distillation Process: This model fine-tunes Qwen-7B using 800,000 inference data samples generated by DeepSeek-R1. The distillation process primarily uses Supervised Fine-Tuning (SFT) without an additional reinforcement learning phase, making the training process more efficient.

3. Performance: DeepSeek-R1-Distill-Qwen-7B performs exceptionally well in inference benchmark tests, achieving a Pass@1 rate of 55.5% in the AIME 2024 test, surpassing other advanced open-source models like QwQ-32B-Preview.

4. Application Scenarios: This model focuses on tasks such as mathematics, code, and natural language inference, making it suitable for applications requiring efficient inference capabilities. Its smaller model size allows it to be deployed in resource-constrained environments, such as personal computers with moderate configurations.

5. Open Source and Deployment: DeepSeek-R1-Distill-Qwen-7B is open-source and supports various quantization methods (such as 4-bit, 8-bit) and inference engines (like vLLM, Transformers). Users can easily deploy this model using tools like Ollama or vLLM. Relationship with Qwen Architecture: DeepSeek-R1-Distill-Qwen-7B is developed based on the Qwen-7B architecture, inheriting the structure and some capabilities of the Qwen series models.

6. Performance Improvement: Through knowledge distillation techniques, this model gains powerful inference capabilities from DeepSeek-R1, outperforming the original Qwen-7B in inference tasks. Expanded Application Scenarios: While the Qwen series models are mainly used for general language tasks, DeepSeek-R1-Distill-Qwen-7B focuses more on inference, mathematics, and code generation tasks, expanding the application range of the Qwen models.

In summary, DeepSeek-R1-Distill-Qwen-7B is an optimized inference model distilled from DeepSeek-R1 using knowledge distillation techniques, developed based on the Qwen-7B architecture, with strong inference capabilities and high deployment efficiency.

Demo Video

Implementation Code

import streamlit as st
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time

# Configure model path
MODEL_PATH = r"D:\AI\DeepSeek-R1-Distill-Qwen\DeepSeek-R1-Distill-Qwen-7B"
MAX_INPUT_TOKENS = 2048

# Initialize model and tokenizer
@st.cache_resource
def load_model():
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_PATH,
        trust_remote_code=True
    )

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        trust_remote_code=True
    )

    # Warm up the model
    with torch.no_grad():
        _ = model.generate(
            **tokenizer("Warm up loading", return_tensors="pt").to(model.device),
            max_new_tokens=1
        )

    return model, tokenizer

model, tokenizer = load_model()

# Independent conversation processing function
def generate_response(prompt):
    start_time = time.perf_counter()

    # Directly construct the current prompt
    system_prompt = "You are a professional assistant. Please answer questions in a concise and clear manner, avoiding repetitive content."
    full_prompt = f"<system>{system_prompt}</system>\n<user>{prompt}</user>\n<assistant>"

    # Input length check
    input_ids = tokenizer.encode(full_prompt)
    if len(input_ids) > MAX_INPUT_TOKENS:
        return "Input is too long, please shorten your question", 0

    inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.1  # Add repetition penalty
    )

    time_used = time.perf_counter() - start_time
    return tokenizer.decode(outputs[0], skip_special_tokens=True), time_used

# Interface layout
st.title("🤖 DeepSeek-R1 Conversation Assistant")
st.caption("Each conversation is processed independently | Real-time performance monitoring")

# Display conversation records (for display only, does not affect the model)
if "dialogs" not in st.session_state:
    st.session_state.dialogs = []

# Display the last 5 conversations
for dialog in st.session_state.dialogs[-5:]:
    with st.chat_message("user"):
        st.write(dialog["question"])
        st.caption(f"📅 {dialog['time']}")

    with st.chat_message("assistant"):
        st.write(dialog["answer"])
        st.caption(f"⏱️ {dialog['time_used']:.2f}s | 📝 {dialog['tokens']} tokens")

# User input processing
if prompt := st.chat_input("Please enter your question"): 
    # Generate response
    with st.spinner("Generating response..."): 
        try:
            response, time_used = generate_response(prompt)
            clean_response = response.split("&lt;/user&gt;")[-1]  # Start capturing from user question
            clean_response = clean_response.split("&lt;/assistant&gt;")[0].strip()  # Capture until the assistant's end tag
            clean_response = clean_response.replace("&lt;assistant&gt;", "").strip()  # Remove possible inline tags
            token_count = len(tokenizer.encode(clean_response, add_special_tokens=False))
        except Exception as e:
            clean_response = f"Generation error: {str(e)}"
            time_used = 0
            token_count = 0

    # Record conversation (for display only)
    st.session_state.dialogs.append({
        "time": time.strftime("%Y-%m-%d %H:%M:%S"),
        "question": prompt,
        "answer": clean_response,
        "time_used": time_used,
        "tokens": token_count
    })

    # Refresh display immediately
    st.rerun()

# Sidebar control panel
with st.sidebar:
    st.header("Control Panel")

    # Performance statistics
    if st.session_state.dialogs:
        total_time = sum(d["time_used"] for d in st.session_state.dialogs)
        avg_time = total_time / len(st.session_state.dialogs)
        st.metric("Average Response Time", f"{avg_time:.2f} seconds")

        total_tokens = sum(d["tokens"] for d in st.session_state.dialogs)
        st.metric("Total Generated Tokens", total_tokens)

    # Clear records button
    if st.button("Clear Conversation Records"):
        st.session_state.dialogs = []
        st.rerun()

    # System information
    st.divider()
    st.markdown("""
    ### System Status
    - Conversation Mode: Independent Processing
    - Input Limit: 2048 tokens
    - Generation Length: 512 tokens
    """)
    with st.expander("See More"):
        st.json(st.session_state.dialogs)

# Bottom status bar
st.sidebar.markdown(f"""
<div style="position: fixed; bottom: 10px; color: #666;">
    <small>Model Version: DeepSeek-R1-Distill-Qwen-7B<br>
    Last Loaded: {time.strftime('%Y-%m-%d %H:%M')}</small>
</div>
""", unsafe_allow_html=True)

Implementation Principles

1. Model Loading and Optimization: Use the HuggingFace Transformers library to load the pre-trained DeepSeek-R1-Distill-Qwen-7B model. Automatically allocate GPU/CPU resources using device_map=”auto”, supporting mixed precision (bfloat16). Use the @st.cache_resource caching mechanism to avoid reloading the model. Add a model warm-up mechanism to reduce initial inference delay.

2. Input Processing and Generation Strategy:

Construct an independent conversation template: <user>{prompt}</user>\n<assistant> Input length validation (MAX_INPUT_TOKENS=2048). Generation parameter configuration: maximum output 512 tokens, temperature=0.7 + top-p=0.9 sampling strategy. Use EOS token as padding.

3. Interface Interaction Design:

Built a web interface based on Streamlit. Real-time conversation bubble display (retaining the last 5 conversations). Status prompt components (spinner) enhance user experience. Sidebar integrates performance monitoring dashboard.

4. Session Management Mechanism:

Independently maintain conversation records using session_state. Each request is processed independently (no contextual association). Exception handling and error prompt mechanisms. Support for clearing conversation records.

5. Performance Monitoring System:

Response time statistics (single/average). Token generation quantity statistics. Real-time display of system status (input/output limits). Loading time tracking and display.

6. Technical Highlights:

Inference speed optimized through model warm-up and BF16 precision. Independent conversation mode avoids memory leak risks. Dual monitoring mechanism for response time and token count. Automatic device allocation achieves cross-platform compatibility.

Dependencies

accelerate==1.1.1
altair==5.5.0
annotated-types==0.7.0
anyio==4.6.2.post1
asgiref==3.8.1
attrs==24.2.0
bcrypt==4.2.0
blinker==1.9.0
cachetools==5.5.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.4.0
click==8.1.7
colorama==0.4.6
contourpy==1.3.1
cryptography==43.0.1
cycler==0.12.1
diffusers==0.31.0
Django==5.1.2
djangorestframework==3.15.2
dnspython==2.7.0
ecdsa==0.19.0
email_validator==2.2.0
et-xmlfile==1.1.0
fastapi==0.115.2
fastapi-cli==0.0.5
filelock==3.16.1
fonttools==4.55.0
fsspec==2024.10.0
GDAL @ file:///D:/3D/GDAL-3.4.3-cp311-cp311-win_amd64.whl#sha256=f78861fb5115d5c2f8cf3c52a492ff548da9e1256dc84088947379f90e77e5b6
geopandas==1.0.1
gitdb==4.0.11
GitPython==3.1.43
greenlet==3.1.1
h11==0.14.0
httpcore==1.0.6
httptools==0.6.4
httpx==0.27.2
huggingface-hub==0.26.3
idna==3.10
importlib_metadata==8.5.0
itsdangerous==2.2.0
Jinja2==3.1.4
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
kiwisolver==1.4.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.3
mdurl==0.1.2
MouseInfo==0.1.3
mpmath==1.3.0
narwhals==1.17.0
networkx==3.4.2
numpy==1.26.4
opencv-python==4.10.0.84
openpyxl==3.1.5
orjson==3.10.7
packaging==24.1
pandas==2.2.3
passlib==1.7.4
pillow==11.0.0
protobuf==5.29.1
psutil==6.1.0
psycopg2==2.9.10
pyarrow==18.1.0
pyasn1==0.6.1
PyAutoGUI==0.9.54
pycparser==2.22
pycryptodome==3.21.0
pydantic==2.9.2
pydantic-extra-types==2.9.0
pydantic-settings==2.6.0
pydantic_core==2.23.4
pydeck==0.9.1
PyGetWindow==0.0.9
Pygments==2.18.0
PyJWT==2.9.0
PyMsgBox==1.0.9
pyogrio==0.10.0
pyparsing==3.2.0
pyperclip==1.9.0
pyproj==3.7.0
PyRect==0.2.0
PyScreeze==1.0.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-jose==3.3.0
python-multipart==0.0.12
pytweening==1.2.0
pytz==2024.2
PyYAML==6.0.2
referencing==0.35.1
regex==2024.11.6
requests==2.32.3
rich==13.9.2
rpds-py==0.22.3
rsa==4.9
safetensors==0.4.5
sageattention==1.0.6
scipy==1.14.1
sentencepiece @ file:///D:/AI/Text2Video/sentencepiece-0.2.0-cp311-cp311-win_amd64.whl#sha256=0993dbc665f4113017892f1b87c3904a44d0640eda510abcacdfb07f74286d36
shapely==2.0.6
shellingham==1.5.4
simplekml==1.3.6
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
SQLAlchemy==2.0.36
sqlparse==0.5.1
starlette==0.40.0
streamlit==1.41.0
streamlit-ace==0.1.1
sympy==1.13.3
tenacity==9.0.0
tokenizers==0.19.1
toml==0.10.2
torch @ file:///D:/AI/torch-3.11%E5%AE%89%E8%A3%85%E5%8C%85/torch-2.4.1%2Bcu121-cp311-cp311-win_amd64.whl#sha256=bc1e21d7412a2f06f552a9afb92c56c8b23d174884e9383259c3cf5db4687c98
torchao==0.1
torchvision @ file:///D:/AI/torch-3.11%E5%AE%89%E8%A3%85%E5%8C%85/torchvision-0.19.1%2Bcu121-cp311-cp311-win_amd64.whl#sha256=952dedd29ddd6010b7bda16a5e58e55eb051a46db941cc676bb880918c694ed8
tornado==6.4.2
tqdm==4.67.1
transformers==4.44.2
triton @ file:///D:/AI/Text2Video/%E8%85%BE%E8%AE%AFhunyuanvideo%E5%AE%89%E8%A3%85/hunyuanvideo/triton-3.1.0-cp311-cp311-win_amd64.whl#sha256=1628cd027ea4544e73e9827f294f79c6f7186af28debf8dc209f744a74fdbb10
typer==0.12.5
typing_extensions==4.12.2
tzdata==2024.2
ujson==5.10.0
urllib3==2.2.3
uvicorn==0.32.0
watchdog==6.0.0
watchfiles==0.24.0
websockets==13.1
zipp==3.21.0