Ollama本地部署并运行模型的详细步骤

Ollama是一个开源工具，用于本地运行大型语言模型（LLM），支持快速部署和API集成。它兼容多种模型，如Llama系列、Gemma等，提供CLI和API接口，便于开发和测试。下面详细介绍部署和运行步骤，包括系统要求、安装、模型拉取与运行。最后举例说明如何用Python代码与模型进行对话。步骤基于官方文档，确保2025年最新实践。

1. 系统要求

硬件：至少8GB RAM（运行7B参数模型推荐），16GB RAM（13B模型），32GB RAM（33B模型）。GPU支持NVIDIA/AMD（推荐NVIDIA CUDA 11.8+），但CPU也可运行（性能较低）。
操作系统：Windows、Linux（Ubuntu等）、macOS（12 Monterey或更高）。
存储：SSD推荐，模型文件大小从几GB到数十GB不等。
其他：确保防火墙允许本地端口（如11434），无互联网依赖（模型下载后离线运行）。

2. 安装步骤

安装Ollama后，它会自动启动服务，默认监听http://localhost:11434。以下分操作系统说明。

Windows：

访问https://ollama.com/download/OllamaSetup.exe 下载安装程序。
双击.exe文件运行安装向导，按照提示完成（默认安装到Program Files）。
安装后，Ollama会自动启动。打开命令提示符（CMD）或PowerShell，输入ollama --version验证安装成功。
如果需要手动启动：通过服务管理器启动”Ollama”服务。

Linux（如Ubuntu）：

打开终端，运行命令：curl -fsSL https://ollama.com/install.sh | sh（自动下载并安装）。
如果需要手动安装：参考https://github.com/ollama/ollama/blob/main/docs/linux.md，下载二进制文件并配置路径。
验证：运行ollama --version检查版本。
系统服务：Ollama会作为systemd服务运行，可用systemctl status ollama查看。

macOS：

访问https://ollama.com/download/Ollama.dmg 下载DMG文件。
双击DMG，拖拽Ollama.app到Applications文件夹。
打开终端，运行/Applications/Ollama.app/Contents/MacOS/Ollama启动（或直接从Launchpad启动）。
验证：输入ollama --version确认。

安装完成后，Ollama会下载必要组件（如llama.cpp引擎）。如果遇到问题，检查代理设置或参考GitHub issues。

3. 拉取和运行模型

拉取模型：从Ollama仓库或Hugging Face下载模型。命令：ollama pull <model_name>。

示例：ollama pull llama3.2（拉取Llama 3.2模型，默认latest标签）。
支持标签：如ollama pull llama3.2:8b（指定8B参数版本）。
模型列表：运行ollama list查看已安装，或访问https://ollama.com/library浏览。

运行模型：命令：ollama run <model_name>。

示例：ollama run llama3.2，进入交互模式（CLI聊天界面）。
输入提示如”Why is the sky blue?”，模型会响应。输入/bye退出。
选项：添加参数如ollama run llama3.2 --verbose（详细日志）。

管理模型：

列出：ollama list。
删除：ollama rm <model_name>。
更新：重新pull相同模型。
自定义：创建Modelfile文件（定义系统提示等），然后ollama create mymodel -f Modelfile。

4. 使用API运行模型

Ollama提供REST API，便于程序集成。默认端口11434。

生成响应（/api/generate）：
用于单次提示生成。
示例（curl命令）： curl http://localhost:11434/api/generate -d '{ "model": "llama3.2", "prompt": "Why is the sky blue?", "stream": false }'
- 响应：JSON对象，包括”response”字段（生成的文本）和统计如”total_duration”。
聊天模式（/api/chat）：
支持多轮对话，维护上下文。
示例（curl命令）： curl http://localhost:11434/api/chat -d '{ "model": "llama3.2", "messages": [ { "role": "user", "content": "Why is the sky blue?" } ], "stream": false }'
- 响应：JSON，包括”message”字段的回复。
示例(python)

async def chat(prompt: str, index: int)->str | None:
    # Create payload for chat API - using proper chat format
    payload = {
        "model": "qwen3:8b", #"gpt-oss:20b",
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "stream": False,
        "enable_thinking": False,
    }

    # Increase timeout to handle slower responses
    timeout = httpx.Timeout(60.0, read=120.0)
    try:
        async with httpx.AsyncClient(timeout=timeout) as client:
            # Make a regular POST request since we're not streaming
            url = "http://localhost:11434/api/chat"
            headers = {"Content-Type": "application/json"}
            response = await client.post(url, headers=headers, json=payload)
            print(f"Status code: {response.status_code}")

            try:
                # Parse the complete JSON response
                data = response.json()
                # Extract and print the assistant's message content
                if 'message' in data and 'content' in data['message']:
                    content = data['message']['content']
                    content = content.replace("<think>", "")
                    content = content.replace("</think>", "")
                    return content.strip()
                else:
                    return f"Unexpected response format: {data}"
            except json.JSONDecodeError:
                print(f"Could not parse response: {response.text}")
    except httpx.ReadTimeout:
        print("Request timed out. The server took too long to respond.")
        print("Try increasing the timeout value or check if the Ollama service is running properly.")
    except httpx.ConnectError:
        print("Could not connect to the Ollama service.")
        print("Make sure Ollama is running on http://localhost:11434")
    except Exception as e:
        print(f"An error occurred: {str(e)}")
    return None
```

流式响应：设置”stream”: true，响应逐块返回，便于实时显示。
其他参数：如”options”（temperature等）、”format”（JSON输出）。

5. 用Python代码与模型进行对话

Ollama提供官方Python库（ollama-python），支持同步/异步和流式交互。以下举例。

安装库：

  pip install ollama

确保Ollama服务运行。

示例1：简单生成文本：

  import ollama

  response = ollama.generate(model='llama3.2', prompt='Why is the sky blue?')
  print(response['response'])  # 输出生成的文本

示例2：同步聊天：

  import ollama

  response = ollama.chat(model='llama3.2', messages=[
      {'role': 'user', 'content': 'Why is the sky blue?'},
  ])
  print(response['message']['content'])  # 输出回复

支持多轮：添加更多messages，如添加系统提示{‘role’: ‘system’, ‘content’: ‘You are a helpful assistant.’}。

示例3：流式聊天：

  import ollama

  stream = ollama.chat(
      model='llama3.2',
      messages=[{'role': 'user', 'content': 'Tell me a joke.'}],
      stream=True,
  )

  for chunk in stream:
      print(chunk['message']['content'], end='', flush=True)  # 实时打印

示例4：异步聊天（适合并发）：

  import asyncio
  from ollama import AsyncClient

  async def main():
      client = AsyncClient()
      response = await client.chat(model='llama3.2', messages=[
          {'role': 'user', 'content': 'What is Ollama?'},
      ])
      print(response['message']['content'])

  asyncio.run(main())

孙成新的个人博客

Ollama本地部署并运行模型的详细步骤

1. 系统要求

2. 安装步骤

3. 拉取和运行模型

4. 使用API运行模型

5. 用Python代码与模型进行对话

发表回复取消回复

1. 系统要求

2. 安装步骤

3. 拉取和运行模型

4. 使用API运行模型

5. 用Python代码与模型进行对话

发表回复 取消回复

发表回复取消回复