HFL — Complete Architecture

Run HuggingFace Models Locally Like Ollama
v0.1.0 Python >=3.10 License HRUL-1.0 Author: Gabriel Galan Pelayo Build: Hatchling 1900 tests | 90%+ coverage

1. Overview and Purpose

HFL (HuggingFace Local) is a CLI tool and API server that allows downloading, managing, and running artificial intelligence models from HuggingFace Hub directly on the user's local machine. Its design aims to be a drop-in replacement for Ollama, but connected to the HuggingFace ecosystem with more than 500,000 models.

🎯

Problem It Solves

Users need to run LLMs locally without depending on cloud APIs. Ollama solves this but with a limited catalog. HFL connects directly to HuggingFace Hub, offering access to the world's largest catalog of open-weight models, with download, automatic GGUF conversion, and local execution.

🏗️

Design Philosophy

Extreme modularity with lazy imports, dual API compatibility (OpenAI + Ollama), rigorous legal compliance (licenses, privacy, EU AI Act), and multi-backend support (llama.cpp, Transformers, vLLM) with automatic selection based on model format and available hardware.

2. Technology Stack

Core

Python >=3.10Base language
typer >=0.12CLI Framework
rich >=13.0Styled output
pydantic >=2.10API data validation
pyyaml >=6.0Configuration parsing

API Server

fastapi >=0.115Async web framework
uvicorn >=0.32ASGI server
sse-starlette >=2.0Server-Sent Events
httpx >=0.28Async HTTP client

HuggingFace

huggingface-hub >=0.27HF Hub API

Optional Extras

llama-cpp-pythonGGUF Backend
transformers+torchNative GPU backend
vllmProduction backend
ggufFormat conversion

3. Project File Structure

hfl/
├── pyproject.toml           — Package definition, deps, scripts, tools
├── hfl.spec                 — PyInstaller spec for standalone executable
├── LICENSE                  — HRUL v1.0 (custom license)
├── LICENSE-DEPENDENCIES.md  — Licenses of all dependencies
├── PRIVACY.md               — Privacy policy
├── NOTICE-EU-AI-ACT.md     — EU AI Act compliance
├── DISCLAIMER.md            — Liability disclaimer
├── README.md                — Main documentation
├── README.es.md             — Main documentation (Spanish)
├── LICENSE-FAQ.md            — License FAQ for HRUL
├── CONTRIBUTING.md           — Contributing guide
├── CODE_OF_CONDUCT.md        — Code of Conduct (Contributor Covenant 2.1)
├── CHANGELOG.md              — Changelog (Keep a Changelog format)
├── SECURITY.md               — Security policy
│
├── src/hfl/                 — MAIN SOURCE CODE
│   ├── __init__.py          — Package version (0.1.0)
│   ├── config.py            — HFLConfig dataclass — global configuration
│   ├── exceptions.py        — Complete exception hierarchy
│   ├── events.py            — EventBus for internal pub/sub
│   ├── metrics.py           — Performance metrics
│   ├── plugins.py           — Plugin system with entry_points
│   ├── security.py          — Security sanitization and validation
│   ├── validators.py        — Common data validators
│   ├── logging_config.py    — Centralized logging configuration
│   │
│   ├── core/                — SYSTEM CORE
│   │   ├── __init__.py
│   │   ├── container.py     — DI container with thread-safe Singleton
│   │   ├── observability_setup.py — Observability event listeners setup
│   │   └── tracing.py       — Request tracing with IDs
│   │
│   ├── cli/                 — COMMAND LINE INTERFACE
│   │   ├── __init__.py
│   │   ├── main.py          — 12 commands: pull, run, serve, list, search, rm, inspect, alias, login, logout, version, compliance-report
│   │   └── commands/        — Modularized commands
│   │       └── _utils.py    — Shared CLI utilities
│   │
│   ├── api/                 — REST API SERVER
│   │   ├── __init__.py
│   │   ├── server.py        — FastAPI app, CORS, disclaimer, lifespan
│   │   ├── state.py         — ServerState with async locks for LLM/TTS
│   │   ├── streaming.py     — SSE streaming helpers
│   │   ├── model_loader.py  — Dynamic model loading
│   │   ├── helpers.py       — ensure_llm_loaded, ensure_tts_loaded
│   │   ├── errors.py        — Centralized HTTP error handling
│   │   ├── middleware.py    — Privacy-safe logging
│   │   ├── rate_limit.py    — Rate limiting by IP/token
│   │   ├── routes_openai.py — /v1/chat/completions, /v1/completions, /v1/models
│   │   ├── routes_native.py — /api/generate, /api/chat, /api/tags (Ollama-compatible)
│   │   ├── routes_tts.py    — /v1/audio/speech, /api/tts (TTS endpoints)
│   │   ├── routes_health.py — /health, /ready, /live (healthchecks)
│   │   ├── routes_metrics.py — /metrics (Prometheus + JSON)
│   │   ├── exception_handlers.py — Global HFLError exception handling
│   │   └── timeout.py       — @with_timeout decorator + configurable timeout
│   │
│   ├── engine/              — INFERENCE ENGINES
│   │   ├── __init__.py
│   │   ├── base.py          — InferenceEngine + AudioEngine ABCs + dataclasses
│   │   ├── selector.py      — Automatic backend selection for LLM/TTS
│   │   ├── llama_cpp.py     — LlamaCppEngine (GGUF, CPU/GPU)
│   │   ├── transformers_engine.py — TransformersEngine (safetensors, GPU)
│   │   ├── vllm_engine.py   — VLLMEngine (production GPU)
│   │   ├── bark_engine.py   — BarkEngine (TTS via transformers)
│   │   ├── coqui_engine.py  — CoquiEngine (TTS XTTS-v2)
│   │   ├── async_wrapper.py — Sync→async wrapper for engines
│   │   ├── model_pool.py    — Model pool with LRU eviction
│   │   ├── dependency_check.py — Optional dependency verification
│   │   ├── failover.py      — FailoverEngine (multi-engine retry with sticky routing)
│   │   ├── memory.py        — Real-time RAM/GPU memory tracking
│   │   ├── observability.py — Engine performance metrics
│   │   └── prompt_builder.py — Prompt formats + delimiter escaping
│   │
│   ├── hub/                 — HUGGINGFACE INTEGRATION
│   │   ├── __init__.py
│   │   ├── auth.py          — Authentication and tokens
│   │   ├── client.py        — HTTP client for HF Hub
│   │   ├── downloader.py    — Download with resume and rate limiting
│   │   ├── license_checker.py — License classification (5 levels)
│   │   └── resolver.py      — Intelligent model resolution
│   │
│   ├── models/              — DATA MODELS
│   │   ├── __init__.py
│   │   ├── manifest.py      — ModelManifest — complete metadata
│   │   ├── provenance.py    — ConversionRecord + ProvenanceLog
│   │   ├── registry.py      — ModelRegistry — local JSON inventory
│   │   └── backends/        — Storage backends (File + SQLite)
│   │
│   ├── converter/           — FORMAT CONVERSION
│   │   ├── __init__.py
│   │   ├── formats.py       — ModelFormat + ModelType enums
│   │   └── gguf_converter.py — GGUFConverter — conversion + quantization
│   │
│   ├── utils/               — CROSS-CUTTING UTILITIES
│   │   ├── __init__.py
│   │   ├── circuit_breaker.py — Circuit breaker for resilience
│   │   └── retry.py         — Retry with exponential backoff
│   │
│   └── i18n/                — INTERNATIONALIZATION
│       ├── __init__.py      — t(), get_language(), set_language()
│       └── locales/         — Translation files
│           ├── en.json      — English translations
│           └── es.json      — Spanish translations
│
├── docs/                    — DOCUMENTATION
│   ├── adr/                 — Architecture Decision Records
│   │   ├── 0001-singleton-pattern.md
│   │   ├── 0002-async-api-sync-engines.md
│   │   ├── 0003-gguf-default-format.md
│   │   ├── 0004-ollama-compatibility.md
│   │   ├── 0005-license-classification.md
│   │   └── 0006-rate-limiting-strategy.md
│   └── *.html               — Architecture documentation
│
└── tests/                   — TEST SUITE (80+ files, 90%+ coverage)
    ├── conftest.py          — Shared fixtures
    ├── test_api*.py         — API tests (5 files)
    ├── test_cli*.py         — CLI tests (4 files)
    ├── test_engine*.py      — Engine tests (8 files)
    ├── test_hub*.py         — HF Hub tests (5 files)
    ├── test_i18n*.py        — i18n tests (4 files)
    └── test_*.py            — Unit and integration tests

├── .github/                 — GITHUB INFRASTRUCTURE
│   ├── workflows/
│   │   ├── ci.yml           — CI: lint + test + type-check (Python 3.10/3.11/3.12)
│   │   ├── pages.yml        — Deploy docs to GitHub Pages
│   │   ├── build-executables.yml — Cross-platform executable builds
│   │   ├── lint.yml         — Linting with ruff
│   │   ├── test.yml         — Tests with pytest
│   │   ├── security.yml     — Security audit
│   │   └── license-check.yml — License verification
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.yml   — Bug report template
│   │   └── feature_request.yml — Feature request template
│   └── PULL_REQUEST_TEMPLATE.md — PR template with compliance checklist

4. General Architecture Diagram

High-Level Architecture — HFL v0.1.0
CLI (typer + rich) pull run serve list search rm inspect alias login logout version cli/main.py — Entry point: hfl.cli.main:app API Server (FastAPI + Uvicorn) OpenAI Compatible Ollama Compatible Middleware: CORS + Disclaimer + Privacy Logger Consumers Terminal (interactive CLI) OpenAI SDK / httpx / curl Ollama Apps (Open WebUI, etc.) Hub (HuggingFace Integration) resolver.py downloader.py license_checker.py auth.py Connection: huggingface_hub API Engine (Inference Engines) LlamaCpp GGUF CPU/GPU Transformers safetensors GPU vLLM Production GPU selector.py → Auto-selection by format + hardware ABC: InferenceEngine (base.py) Models (Data and Registry) manifest.py registry.py provenance.py Persistence: ~/.hfl/models.json ~/.hfl/provenance.json Converter (Format Conversion) formats.py gguf_converter.py safetensors → FP16 GGUF → Quantized GGUF Config + Exceptions (Core) config.py (HFLConfig) exceptions.py (15 types) Home: ~/.hfl | Port: 11434 (Ollama compat) HuggingFace Hub (External) huggingface.co API 500,000+ models Gated models + Licenses Local File System — ~/.hfl/ models/ Model files (GGUF, safetensors) models.json Local model registry provenance.json Conversion log cache/ + tools/llama.cpp HF cache + compiled tools

5. Module config — Central Configuration

File: src/hfl/config.py

Contains the HFLConfig class (dataclass) that defines all global application configuration. It is instantiated once as a singleton (config = HFLConfig()) when importing the module, and ensure_dirs() is called to create the directory structure.

PropertyTypeDefaultEnv VarDescription
home_dirPath~/.hflHFL_HOMERoot directory
models_dirPath (prop)~/.hfl/modelsStorage for downloaded models
cache_dirPath (prop)~/.hfl/cacheTemporary HuggingFace cache
registry_pathPath (prop)~/.hfl/models.jsonModel registry (local inventory)
llama_cpp_dirPath (prop)~/.hfl/tools/llama.cppCompiled conversion tools
hoststr127.0.0.1HFL_HOSTAPI server address
portint11434HFL_PORTPort (same as Ollama for compatibility)
default_ctx_sizeint4096Default context tokens
default_n_gpu_layersint-1GPU layers (-1 = all)
rate_limit_enabledbooltrueHFL_RATE_LIMIT_ENABLEDEnable/disable rate limiting
rate_limit_requestsint60HFL_RATE_LIMIT_REQUESTSMax requests per window
rate_limit_windowint60HFL_RATE_LIMIT_WINDOWRate limit window (seconds)
hf_tokenstr|NoneHF_TOKENAuthentication token (memory only, never persisted)
SLOConfig: Service Level Objectives configuration for monitoring availability targets and latency percentiles (P50, P95, P99). Used by the health and metrics endpoints to report SLI compliance.

6. Module cli — Command Line Interface

File: src/hfl/cli/main.py (~870 lines)

Framework: Typer + Rich. Entry point registered in pyproject.toml as hfl = "hfl.cli.main:app"

CommandDescriptionKey Options
hfl pull <model>Download model from HF Hub--quantize Q4_K_M, --format auto|gguf|safetensors, --alias, --skip-license
hfl run <model>Interactive terminal chat--backend auto|llama-cpp|transformers|vllm, --ctx, --system, --verbose
hfl serveREST API server--host, --port, --model (pre-load), --api-key (authentication)
hfl listList local models with Rich tableShows name, alias, format, quantization, license (risk-colored), size
hfl search <query>Interactive paginated search on HF Hub--gguf, --max-params, --min-params, --sort, --page-size
hfl rm <model>Remove model with confirmationDeletes files + registry entry
hfl inspect <model>Full detail (Rich panel)Shows metadata, license, restrictions, timestamps
hfl alias <model> <alias>Assign short aliasAllows referring to models by simple names
hfl loginConfigure HF token--token or interactive. Verifies with whoami()
hfl logoutRemove saved tokenUses huggingface_hub.logout()
hfl versionShow version + license
hfl compliance-reportLegal compliance report (JSON/Markdown)Output format selection
Signal Handling: The run command handles Ctrl+C during token streaming gracefully, preserving the partial response.

CLI Helper Functions

_format_size() converts bytes to readable format. _get_key() reads a key without Enter (raw terminal). _extract_params_from_name() extracts parameters from name (regex: "70b", "7b", "1.5b"). _estimate_model_size() estimates disk size based on parameters and quantization. _display_model_row() renders a search result row. _get_params_value() extracts numeric value for filtering.

7. Module hub — HuggingFace Integration

resolver.py — Intelligent Resolution

Class ResolvedModel (dataclass) with: repo_id, revision, filename, format, quantization.

The resolve() function supports three input formats:

1. org/model → direct HF repo

2. org/model:Q4_K_M → repo with Ollama-style quantization

3. model-name → name search (top 5 by downloads)

After resolution, detects if repo has GGUF files (prefers _select_gguf() with priority Q4_K_M > Q5_K_M > Q4_K_S), safetensors, or pytorch.

downloader.py — Download with Progress

Main function pull_model(resolved). For GGUF downloads individual file with hf_hub_download(). For safetensors downloads complete snapshot with snapshot_download() filtering: *.safetensors, config.json, tokenizer*.json, tokenizer.model.

Implements rate limiting (0.5s between API calls) and identifying User-Agent (hfl/0.1.0) to comply with HuggingFace ToS.

auth.py — Authentication

get_hf_token() obtains token with priority: 1) HF_TOKEN env var, 2) token saved by huggingface_hub.

ensure_auth(repo_id) verifies repo access. If it fails and there's no token, requests interactively. Respects HF's gating system: does NOT bypass license acceptance for gated models.

license_checker.py — License Classification

Enum LicenseRisk: PERMISSIVE, CONDITIONAL, NON_COMMERCIAL, RESTRICTED, UNKNOWN.

Dictionary LICENSE_CLASSIFICATION with ~20 known licenses. Dictionary LICENSE_RESTRICTIONS with specific restrictions per family (Llama: 700M MAU, attribution, etc.).

check_model_license() queries HF API, classifies risk, and returns LicenseInfo. require_user_acceptance() presents a Rich panel with the license and requires explicit confirmation for non-permissive licenses.

8. Module models — Data and Registry

ModelManifest (manifest.py)

Dataclass that stores complete metadata for each downloaded model. It is the fundamental unit of information in the system.

FieldTypePurpose
name, repo_idstrIdentification (short name + HF repo)
aliasstr|NoneUser-defined custom name
local_path, formatstrLocation and type (gguf/safetensors/pytorch)
size_bytes, quantizationint, strSize on disk + Q level
architecture, parameters, context_lengthstr, str, intModel characteristics
license, license_name, license_urlstrLegal information (R1)
license_restrictions, gated, license_accepted_atlist, bool, strRestrictions and acceptance
gpai_classification, training_flopsstrEU AI Act (R4)
created_at, last_usedstrTimestamps

ModelRegistry (registry.py)

Manages local inventory. Persists to ~/.hfl/models.json as JSON array. Operations: add() (avoids duplicates), get() (searches by name, alias, or repo_id), set_alias(), list_all() (sorted by date), remove().

ProvenanceLog (provenance.py)

Immutable conversion log in ~/.hfl/provenance.json. Each ConversionRecord documents: source (repo, format, revision), destination (format, path, quantization), tool used (llama.cpp + version), original license, and timestamps. Serves for legal traceability and compliance audit (R3).

9. Module engine — Inference Engines

Class Hierarchy — Engine
«ABC» InferenceEngine + load(path, **kwargs) + unload() / generate() / generate_stream() + chat() / chat_stream() / model_name / is_loaded LlamaCppEngine GGUF | CPU+Metal+CUDA+Vulkan Flash Attention, auto chat format TransformersEngine safetensors | GPU CUDA BitsAndBytes 4bit/8bit, TextIteratorStreamer VLLMEngine Production GPU | NVIDIA CUDA PagedAttention, continuous batching selector.py → GGUF→LlamaCpp | CUDA+safetensors→Transformers | fallback→LlamaCpp Dataclasses ChatMessage(role, content) GenerationConfig(temp, top_p...) GenerationResult(text, tokens...)

LlamaCppEngine

Main backend. Uses llama-cpp-python. Parameters: n_ctx, n_gpu_layers (-1=all), n_threads (0=auto), flash_attn, chat_format (auto-detect). Includes stderr suppression to silence Metal/CUDA logs when verbose=False. Generates results with metrics: tokens/s, prompt tokens, stop reason.

TransformersEngine

Uses models in native format with GPU. Dynamic quantization support: 4bit (NF4 double quant via BitsAndBytes) and 8bit. Streaming via TextIteratorStreamer in separate thread. Uses tokenizer's apply_chat_template() or Llama-style fallback.

EXPERIMENTAL VLLMEngine

EXPERIMENTAL Production GPU backend. Wraps vllm.LLM with SamplingParams. Real streaming with AsyncLLMEngine, with synchronous fallback for compatibility. Requires NVIDIA GPU with CUDA.

FailoverEngine

Multi-backend engine with sticky routing — automatically retries with the next available engine if one fails. Provides high availability across multiple inference backends.

Model Pool & Memory

Model pool with non-recursive waiting (bounded polling), preventing stack overflow with concurrent loads. Real-time RAM and GPU memory tracking via psutil and GPUtil.

Automatic Selector (selector.py)

Decision logic in select_engine(model_path, backend):

GGUF model? → Yes → LlamaCppEngine
CUDA GPU available? → Yes → TransformersEngine → ImportError → Fallback: LlamaCppEngine
No GPU? LlamaCppEngine (will need prior conversion)

All imports are lazy (_get_llama_cpp_engine(), etc.) to avoid requiring all dependencies to be installed.

10. Module converter — Format Conversion

formats.py — Format Detection

Enum ModelFormat: GGUF, SAFETENSORS, PYTORCH, UNKNOWN. The detect_format(path) function inspects extensions (.gguf, .safetensors, .pt/.pth/.bin) both in individual files and directories (rglob). find_model_file() locates the main model file.

gguf_converter.py — GGUF Conversion

Class GGUFConverter with two-step pipeline:

safetensors/pytorch → Step 1 → convert_hf_to_gguf.py (FP16) → Step 2 → llama-quantize (Q4_K_M) Final GGUF

ensure_tools() auto-installs llama.cpp if not present: git clone → cmake build → pip install requirements. check_model_convertibility() validates that the model is convertible (rejects LoRA adapters, image models, models without config.json).

QuantizationBits/weightQualityUse Case
Q2_K~2.5~80%Extreme compression
Q3_K_M~3.5~87%Low RAM
Q4_K_M~4.5~92%DEFAULT — best balance
Q5_K_M~5.0~96%High quality
Q6_K~6.5~97%Premium
Q8_0~8.0~98%+Maximum quantized quality
F1616.0100%No quantization

11. Module api — REST Server

Endpoints and API Compatibility
FastAPI Server (server.py) RequestLogger → APIKey → RateLimit → Disclaimer → CORS OpenAI Compatible (routes_openai.py) POST /v1/chat/completions Chat with SSE streaming (text/event-stream) POST /v1/completions Text completion with streaming GET /v1/models — List available models Schemas: ChatCompletionRequest, CompletionRequest Helpers: _ensure_model_loaded() → auto-load Streaming: _stream_chat(), _stream_completion() Format: SSE data: {...}\n\n + data: [DONE] Ollama Compatible (routes_native.py) POST /api/generate Text generation with NDJSON streaming POST /api/chat Multi-turn chat with NDJSON streaming GET /api/tags + /api/version + HEAD / Schemas: GenerateRequest, ChatRequest Streaming: application/x-ndjson Compatible with: Open WebUI, Chatbox, etc. Helper: _options_to_config() for parameters

ServerState

Class with engine (active InferenceEngine), current_model (loaded ModelManifest) and api_key (str|None for authentication). Instantiated as global singleton. Server lifespan handles cleanup on close.

Middleware Stack

Execution order: RequestLogger → APIKey → RateLimit (conditional) → Disclaimer → CORS. APIKey runs BEFORE RateLimit, so unauthenticated requests are rejected without consuming rate limit tokens.

RequestLogger — Privacy-safe logging. NEVER logs: bodies (prompts/outputs), auth headers, User-Agent. Only: method, path, status, duration.

APIKeyMiddleware — Optional authentication via --api-key flag. Supports Authorization: Bearer <key> and X-API-Key: <key> headers. Public endpoints (/health, /) exempt.

RateLimitMiddleware — IP-based rate limiting, configurable via env vars. Supports per-model rate limiting.

DisclaimerMiddleware — Adds X-AI-Disclaimer header to AI generation endpoint responses (R9).

CORSMiddleware — Allows all origins, methods, and headers (local development).

Exception Handlers

Centralized exception handling via register_exception_handlers(app). Maps the entire HFLError hierarchy to HTTP responses with appropriate status codes (400 for validation, 429 for rate limit, 500 for internal errors).

OpenAPI Documentation

All endpoints include tags, summary, and responses in their decorators for auto-generated OpenAPI documentation. Tags: OpenAI, Ollama, TTS, Health, Metrics.

Structured Logging

All logging uses %-style format strings (logger.info('Model loaded: %s', name)) instead of f-strings, avoiding unnecessary evaluation when the log level is disabled.

Health & Metrics Endpoints

EndpointDescription
/health/deep?probe=trueRuns a minimal inference test to verify the model works. Reports "degraded" status on failure.
/health/sliService Level Indicators with availability and latency metrics.
/metricsPrometheus-format metrics.
/metrics/jsonJSON-format metrics.

12. Module i18n — Internationalization

File: src/hfl/i18n/__init__.py

Complete internationalization system that allows the CLI to display all messages in multiple languages. Uses JSON translation files with nested keys and dot-notation access.

Main Functions

FunctionDescription
t(key, **kwargs)Translates a key (e.g., t("commands.pull.downloading")). Supports interpolation with .format(**kwargs). Cached with lru_cache for performance.
get_language()Returns the current language. Reads from HFL_LANG env var, defaults to "en"
set_language(lang)Changes language at runtime and clears the translation cache
_load_translations(lang)Loads the language JSON file from locales/
_get_nested_value(data, key)Navigates nested dictionaries with dot notation

Translation Files

Each language has a JSON file (~193 lines) in src/hfl/i18n/locales/:

en.json (English) and es.json (Spanish). Keys follow the structure module.action.message, for example: commands.pull.downloading, commands.search.no_results, errors.model_not_found.

Usage: Set language with export HFL_LANG=es. All CLI commands use t() for their messages, allowing language changes without modifying code.

13. Exception Hierarchy

Custom Exception Tree
HFLError ModelNotFoundError ModelAlreadyExistsError DownloadError NetworkError ConversionError ToolNotFoundError LicenseError LicenseNotAcceptedError GatedModelError EngineError ModelNotLoadedError MissingDependencyError OutOfMemoryError AuthenticationError InvalidTokenError TokenRequiredError ConfigurationError InvalidConfigError

All inherit from HFLError(message, details) with __str__ combining both. Each exception includes specific contextual data (model_name, repo_id, required_gb, etc.) for detailed diagnostics.

14. Complete Flow: hfl pull

Model Download and Registration Pipeline
1. Resolveresolver.py 2. Licenselicense_checker.py 3. Downloaddownloader.py 4. Detect fmtformats.py 5. Convert?gguf_converter.py(only if not GGUF)safetensors→FP16→Qx 6. Create Manifestmanifest.py 7. Registerregistry.add() HfApi.model_info() check + accept hf_hub_download/ snapshot_download models.json

15. Complete Flow: hfl run

Interactive Chat Pipeline
1. Registry.get()search by name/alias 2. select_engine()auto → format + HW 3. engine.load()load into RAM/VRAM 4. Chat Loopinput → messages.append()→ chat_stream()→ print tokens 5. /exitengine.unload() loop until /exit

16. Complete Flow: hfl serve

API Server Pipeline
ClientHTTP/SSE MiddlewareCORSAPIKeyDisclaimer (R9) Routerroutes_openai.pyroutes_native.py/health, / _ensure_model_loadedRegistry → select_engine→ engine.load() (if needed) Enginechat() / generate()chat_stream() ... ResponseJSON / SSE Auto-load: if different model requested, unloads current and loads new

18. Design Patterns

Strategy Pattern (Engine)

The InferenceEngine interface (ABC) defines the contract. Three concrete implementations (LlamaCpp, Transformers, vLLM) are interchangeable. selector.py acts as factory choosing the correct strategy based on context.

Lazy Loading / Lazy Imports

All heavy dependencies (torch, transformers, vllm, llama-cpp-python) are imported only when needed. Imports inside functions prevent minimal installations from failing due to absent optional dependencies.

Dependency Injection Container

core/container.py implements a DI container with thread-safe singletons (double-checked locking). Provides get_config(), get_registry(), get_state(), get_event_bus(), get_metrics(). Facilitates testing with reset_container().

Dataclass as Value Objects

ChatMessage, GenerationConfig, GenerationResult, ModelManifest, ResolvedModel, ConversionRecord, LicenseInfo — all are immutable or semi-immutable dataclasses encapsulating data.

Pipeline / Chain (CLI pull)

The pull command implements a sequential 7-step pipeline: resolve → verify license → download → detect format → convert (conditional) → create manifest → register. Each step is a decoupled component.

Adapter Pattern (Dual API)

The same inference engines are exposed through two different API interfaces (OpenAI and Ollama) via separate routers that adapt request/response formats to each ecosystem's expected format.

Circuit Breaker + Retry

utils/circuit_breaker.py implements circuit breaker for external calls. utils/retry.py provides retry with configurable exponential backoff. Both improve resilience against network failures or external service issues.

Event Bus (Pub/Sub)

events.py implements an internal EventBus for decoupled communication between components. Allows subscribing to events like model_loaded, inference_complete, download_progress without creating direct dependencies.

Architecture Decision Records (ADRs)

Important architectural decisions are documented in docs/adr/ following the standard ADR format:

ADRTitleStatus
0001Singleton Pattern for Config and RegistryAccepted
0002Async API with Sync Engines (sync_to_thread)Accepted
0003GGUF as Default FormatAccepted
0004Ollama API CompatibilityAccepted
0005License Classification (5 levels)Accepted
0006Rate Limiting StrategyAccepted

19. Dependencies and Extras

Dependency Map by Group
Core (always installed) typer rich huggingface-hub fastapi uvicorn pydantic httpx sse-starlette pyyaml [llama] llama-cpp-python >=0.3 [transformers] transformers >=4.47 torch >=2.5 accelerate >=1.2 sentencepiece [vllm] vllm >=0.6 [convert] gguf >=0.10 [dev] pytest >=8.0 pytest-asyncio ruff >=0.8 pytest-cov [build] pyinstaller >=6.0 | [all] = llama + transformers + vllm + convert Build system: hatchling | Target: Python >=3.10 | Linting: ruff (E, F, W, I) | Line length: 100

20. Testing

Coverage: 90%+ with 1,900 tests — Comprehensive suite with pytest + pytest-asyncio (asyncio_mode = "auto"). 80+ test files with shared fixtures in conftest.py.

Test Categories

Core & Config

test_config.pyHFLConfig, paths, env vars
test_container.pyDI container, Singleton pattern
test_exceptions.pyException hierarchy
test_events.pyEventBus, pub/sub
test_metrics.pyPerformance metrics
test_validators.pyData validation
test_security.pySecurity & sanitization

API Server

test_api*.py (5)Endpoints, auth, contracts
test_routes_*.py (4)OpenAI, Native, TTS
test_state.pyServerState, async locks
test_streaming.pySSE streaming
test_model_loader.pyDynamic model loading
test_helpers.pyensure_llm/tts_loaded
test_middleware.pyPrivacy logger, disclaimer

Engine & Inference

test_engine*.py (2)Base, LlamaCpp
test_selector*.py (3)Backend auto-selection
test_transformers*.py (2)TransformersEngine
test_vllm_engine.pyvLLM with mocks
test_tts_*.py (2)Bark, Coqui TTS
test_async_wrapper.pySync→Async wrapper
test_model_pool.pyModel pool

Hub & Downloads

test_hub*.py (2)HF Hub integration
test_downloader*.py (2)Download with resume
test_resolver*.py (2)Model resolution
test_license_checker.pyLicense classification
test_auth.pyHF authentication

CLI & Utils

test_cli*.py (4)Typer commands
test_converter*.py (3)GGUF conversion
test_i18n*.py (4)Internationalization
test_circuit_breaker.pyCircuit breaker
test_retry.pyRetry with backoff

Integration & Edge Case Tests

test_integration.pyFull end-to-end flows (pull→run→serve)
test_concurrency.pyConcurrency, thread safety, race conditions
test_network_errors.pyTimeouts, disconnects, retry logic
test_edge_cases.pyMalformed inputs, boundary conditions
test_server_lifecycle.pyStartup, shutdown, cleanup
test_cli_signal.pyCLI signal handling (Ctrl+C graceful shutdown)
test_middleware_order.pyMiddleware execution order verification
test_exception_handlers.pyHFLError → HTTP status code mapping
test_health_probes.pyHealth probe endpoints (/health/deep, /health/sli)
test_config_env.pyConfig env var override tests
test_model_pool_wait.pyModel pool non-recursive waiting
test_stress.pyStress tests (concurrent streaming, model pool stress)
test_timeout.pyTimeout decorator tests
test_failover.pyFailover engine multi-backend retry
test_rate_limit_per_model.pyPer-model rate limiting
test_conversion_caching.pyConversion caching tests

Main fixtures (conftest.py): tmp_hfl_home (temp directory), mock_hf_api (HfApi mock), sample_manifest (sample model), populated_registry (pre-populated registry), mock_engine (mocked inference engine), test_client (FastAPI TestClient).

# Run tests with coverage
pytest --cov=hfl --cov-report=html --cov-fail-under=80

# Tests by category
pytest tests/test_api*.py -v          # API only
pytest tests/test_engine*.py -v       # Engines only
pytest -k "not slow" -v               # Exclude slow tests

21. Build and Distribution

Build System: Hatchling

Configured in pyproject.toml. Source packages in src/hfl. Entry point: hfl = "hfl.cli.main:app". Installation with pip install . or pip install .[all] for all optional dependencies.

PyInstaller (hfl.spec)

Spec to generate standalone executable. Allows distributing HFL as a single binary without requiring Python installed. Generated with pyinstaller hfl.spec and result is in dist/hfl.

Local Storage

~/.hfl/
├── models/               # Downloaded model files
│   └── org--model/       # Directory per model (org--model format)
├── cache/                # HuggingFace Hub cache
├── tools/
│   └── llama.cpp/        # Compiled conversion tools
│       └── build/bin/    # Binaries: llama-quantize, etc.
├── models.json           # Model registry (array of ModelManifest)
└── provenance.json       # Immutable conversion log

22. CI/CD & GitHub

GitHub Actions Workflows

WorkflowTriggerDescription
ci.ymlPush/PR to mainFull pipeline: lint (ruff) + tests (pytest matrix Python 3.10/3.11/3.12) + type-check
pages.ymlPush to mainAuto-deploy HTML docs to GitHub Pages
build-executables.ymlRelease/manualCross-platform executable builds with PyInstaller
lint.ymlPush/PRLinting with ruff (E, F, W, I)
test.ymlPush/PRTests with pytest + coverage
security.ymlScheduled/manualDependency security audit
license-check.ymlPush/PRDependency license verification

GitHub Templates

Issue Templates: bug_report.yml (structured form with system info, reproduction steps, logs) and feature_request.yml (problem statement, proposed solution, affected component).

PR Template: Checklist including verification of all 5 Compliance Modules (license, provenance, disclaimer, privacy, gating) per HRUL requirements.

Community files: CONTRIBUTING.md (dev setup, code style, PR process), CODE_OF_CONDUCT.md (Contributor Covenant v2.1), SECURITY.md (vulnerability reporting policy).

HFL v0.1.0 — Comprehensive Architecture Documentation

Updated on March 7, 2026 — All diagrams are interactive SVGs

License: HRUL v1.0 (source-available)