HFL — Complete Architecture

Run HuggingFace Models Locally Like Ollama

v0.1.0 Python >=3.10 License HRUL-1.0 Author: Gabriel Galan Pelayo Build: Hatchling 1900 tests | 90%+ coverage

1. Overview and Purpose

HFL (HuggingFace Local) is a CLI tool and API server that allows downloading, managing, and running artificial intelligence models from HuggingFace Hub directly on the user's local machine. Its design aims to be a drop-in replacement for Ollama, but connected to the HuggingFace ecosystem with more than 500,000 models.

🎯

Problem It Solves

Users need to run LLMs locally without depending on cloud APIs. Ollama solves this but with a limited catalog. HFL connects directly to HuggingFace Hub, offering access to the world's largest catalog of open-weight models, with download, automatic GGUF conversion, and local execution.

🏗️

Design Philosophy

Extreme modularity with lazy imports, dual API compatibility (OpenAI + Ollama), rigorous legal compliance (licenses, privacy, EU AI Act), and multi-backend support (llama.cpp, Transformers, vLLM) with automatic selection based on model format and available hardware.

2. Technology Stack

Core

`Python >=3.10`	Base language
`typer >=0.12`	CLI Framework
`rich >=13.0`	Styled output
`pydantic >=2.10`	API data validation
`pyyaml >=6.0`	Configuration parsing

API Server

`fastapi >=0.115`	Async web framework
`uvicorn >=0.32`	ASGI server
`sse-starlette >=2.0`	Server-Sent Events
`httpx >=0.28`	Async HTTP client

HuggingFace

huggingface-hub >=0.27 HF Hub API

Optional Extras

`llama-cpp-python`	GGUF Backend
`transformers+torch`	Native GPU backend
`vllm`	Production backend
`gguf`	Format conversion

3. Project File Structure

hfl/
├── pyproject.toml           — Package definition, deps, scripts, tools
├── hfl.spec                 — PyInstaller spec for standalone executable
├── LICENSE                  — HRUL v1.0 (custom license)
├── LICENSE-DEPENDENCIES.md  — Licenses of all dependencies
├── PRIVACY.md               — Privacy policy
├── NOTICE-EU-AI-ACT.md     — EU AI Act compliance
├── DISCLAIMER.md            — Liability disclaimer
├── README.md                — Main documentation
├── README.es.md             — Main documentation (Spanish)
├── LICENSE-FAQ.md            — License FAQ for HRUL
├── CONTRIBUTING.md           — Contributing guide
├── CODE_OF_CONDUCT.md        — Code of Conduct (Contributor Covenant 2.1)
├── CHANGELOG.md              — Changelog (Keep a Changelog format)
├── SECURITY.md               — Security policy
│
├── src/hfl/                 — MAIN SOURCE CODE
│   ├── __init__.py          — Package version (0.1.0)
│   ├── config.py            — HFLConfig dataclass — global configuration
│   ├── exceptions.py        — Complete exception hierarchy
│   ├── events.py            — EventBus for internal pub/sub
│   ├── metrics.py           — Performance metrics
│   ├── plugins.py           — Plugin system with entry_points
│   ├── security.py          — Security sanitization and validation
│   ├── validators.py        — Common data validators
│   ├── logging_config.py    — Centralized logging configuration
│   │
│   ├── core/                — SYSTEM CORE
│   │   ├── __init__.py
│   │   ├── container.py     — DI container with thread-safe Singleton
│   │   ├── observability_setup.py — Observability event listeners setup
│   │   └── tracing.py       — Request tracing with IDs
│   │
│   ├── cli/                 — COMMAND LINE INTERFACE
│   │   ├── __init__.py
│   │   ├── main.py          — 12 commands: pull, run, serve, list, search, rm, inspect, alias, login, logout, version, compliance-report
│   │   └── commands/        — Modularized commands
│   │       └── _utils.py    — Shared CLI utilities
│   │
│   ├── api/                 — REST API SERVER
│   │   ├── __init__.py
│   │   ├── server.py        — FastAPI app, CORS, disclaimer, lifespan
│   │   ├── state.py         — ServerState with async locks for LLM/TTS
│   │   ├── streaming.py     — SSE streaming helpers
│   │   ├── model_loader.py  — Dynamic model loading
│   │   ├── helpers.py       — ensure_llm_loaded, ensure_tts_loaded
│   │   ├── errors.py        — Centralized HTTP error handling
│   │   ├── middleware.py    — Privacy-safe logging
│   │   ├── rate_limit.py    — Rate limiting by IP/token
│   │   ├── routes_openai.py — /v1/chat/completions, /v1/completions, /v1/models
│   │   ├── routes_native.py — /api/generate, /api/chat, /api/tags (Ollama-compatible)
│   │   ├── routes_tts.py    — /v1/audio/speech, /api/tts (TTS endpoints)
│   │   ├── routes_health.py — /health, /ready, /live (healthchecks)
│   │   ├── routes_metrics.py — /metrics (Prometheus + JSON)
│   │   ├── exception_handlers.py — Global HFLError exception handling
│   │   └── timeout.py       — @with_timeout decorator + configurable timeout
│   │
│   ├── engine/              — INFERENCE ENGINES
│   │   ├── __init__.py
│   │   ├── base.py          — InferenceEngine + AudioEngine ABCs + dataclasses
│   │   ├── selector.py      — Automatic backend selection for LLM/TTS
│   │   ├── llama_cpp.py     — LlamaCppEngine (GGUF, CPU/GPU)
│   │   ├── transformers_engine.py — TransformersEngine (safetensors, GPU)
│   │   ├── vllm_engine.py   — VLLMEngine (production GPU)
│   │   ├── bark_engine.py   — BarkEngine (TTS via transformers)
│   │   ├── coqui_engine.py  — CoquiEngine (TTS XTTS-v2)
│   │   ├── async_wrapper.py — Sync→async wrapper for engines
│   │   ├── model_pool.py    — Model pool with LRU eviction
│   │   ├── dependency_check.py — Optional dependency verification
│   │   ├── failover.py      — FailoverEngine (multi-engine retry with sticky routing)
│   │   ├── memory.py        — Real-time RAM/GPU memory tracking
│   │   ├── observability.py — Engine performance metrics
│   │   └── prompt_builder.py — Prompt formats + delimiter escaping
│   │
│   ├── hub/                 — HUGGINGFACE INTEGRATION
│   │   ├── __init__.py
│   │   ├── auth.py          — Authentication and tokens
│   │   ├── client.py        — HTTP client for HF Hub
│   │   ├── downloader.py    — Download with resume and rate limiting
│   │   ├── license_checker.py — License classification (5 levels)
│   │   └── resolver.py      — Intelligent model resolution
│   │
│   ├── models/              — DATA MODELS
│   │   ├── __init__.py
│   │   ├── manifest.py      — ModelManifest — complete metadata
│   │   ├── provenance.py    — ConversionRecord + ProvenanceLog
│   │   ├── registry.py      — ModelRegistry — local JSON inventory
│   │   └── backends/        — Storage backends (File + SQLite)
│   │
│   ├── converter/           — FORMAT CONVERSION
│   │   ├── __init__.py
│   │   ├── formats.py       — ModelFormat + ModelType enums
│   │   └── gguf_converter.py — GGUFConverter — conversion + quantization
│   │
│   ├── utils/               — CROSS-CUTTING UTILITIES
│   │   ├── __init__.py
│   │   ├── circuit_breaker.py — Circuit breaker for resilience
│   │   └── retry.py         — Retry with exponential backoff
│   │
│   └── i18n/                — INTERNATIONALIZATION
│       ├── __init__.py      — t(), get_language(), set_language()
│       └── locales/         — Translation files
│           ├── en.json      — English translations
│           └── es.json      — Spanish translations
│
├── docs/                    — DOCUMENTATION
│   ├── adr/                 — Architecture Decision Records
│   │   ├── 0001-singleton-pattern.md
│   │   ├── 0002-async-api-sync-engines.md
│   │   ├── 0003-gguf-default-format.md
│   │   ├── 0004-ollama-compatibility.md
│   │   ├── 0005-license-classification.md
│   │   └── 0006-rate-limiting-strategy.md
│   └── *.html               — Architecture documentation
│
└── tests/                   — TEST SUITE (80+ files, 90%+ coverage)
    ├── conftest.py          — Shared fixtures
    ├── test_api*.py         — API tests (5 files)
    ├── test_cli*.py         — CLI tests (4 files)
    ├── test_engine*.py      — Engine tests (8 files)
    ├── test_hub*.py         — HF Hub tests (5 files)
    ├── test_i18n*.py        — i18n tests (4 files)
    └── test_*.py            — Unit and integration tests

├── .github/                 — GITHUB INFRASTRUCTURE
│   ├── workflows/
│   │   ├── ci.yml           — CI: lint + test + type-check (Python 3.10/3.11/3.12)
│   │   ├── pages.yml        — Deploy docs to GitHub Pages
│   │   ├── build-executables.yml — Cross-platform executable builds
│   │   ├── lint.yml         — Linting with ruff
│   │   ├── test.yml         — Tests with pytest
│   │   ├── security.yml     — Security audit
│   │   └── license-check.yml — License verification
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.yml   — Bug report template
│   │   └── feature_request.yml — Feature request template
│   └── PULL_REQUEST_TEMPLATE.md — PR template with compliance checklist

4. General Architecture Diagram

High-Level Architecture — HFL v0.1.0

5. Module config — Central Configuration

File: src/hfl/config.py

Contains the HFLConfig class (dataclass) that defines all global application configuration. It is instantiated once as a singleton (config = HFLConfig()) when importing the module, and ensure_dirs() is called to create the directory structure.

Property	Type	Default	Env Var	Description
`home_dir`	Path	`~/.hfl`	`HFL_HOME`	Root directory
`models_dir`	Path (prop)	`~/.hfl/models`	—	Storage for downloaded models
`cache_dir`	Path (prop)	`~/.hfl/cache`	—	Temporary HuggingFace cache
`registry_path`	Path (prop)	`~/.hfl/models.json`	—	Model registry (local inventory)
`llama_cpp_dir`	Path (prop)	`~/.hfl/tools/llama.cpp`	—	Compiled conversion tools
`host`	str	`127.0.0.1`	`HFL_HOST`	API server address
`port`	int	`11434`	`HFL_PORT`	Port (same as Ollama for compatibility)
`default_ctx_size`	int	`4096`	—	Default context tokens
`default_n_gpu_layers`	int	`-1`	—	GPU layers (-1 = all)
`rate_limit_enabled`	bool	`true`	`HFL_RATE_LIMIT_ENABLED`	Enable/disable rate limiting
`rate_limit_requests`	int	`60`	`HFL_RATE_LIMIT_REQUESTS`	Max requests per window
`rate_limit_window`	int	`60`	`HFL_RATE_LIMIT_WINDOW`	Rate limit window (seconds)
`hf_token`	str\|None	—	`HF_TOKEN`	Authentication token (memory only, never persisted)

    SLOConfig: Service Level Objectives configuration for monitoring availability targets and latency percentiles (P50, P95, P99). Used by the health and metrics endpoints to report SLI compliance.
  

    Privacy (R6): The hf_token is read ONLY from the environment variable. It is never persisted to disk, never saved to models.json or any configuration file. It exists only in memory during process execution.
  

6. Module cli — Command Line Interface

File: src/hfl/cli/main.py (~870 lines)

Framework: Typer + Rich. Entry point registered in pyproject.toml as hfl = "hfl.cli.main:app"

Command	Description	Key Options
`hfl pull <model>`	Download model from HF Hub	`--quantize Q4_K_M`, `--format auto\|gguf\|safetensors`, `--alias`, `--skip-license`
`hfl run <model>`	Interactive terminal chat	`--backend auto\|llama-cpp\|transformers\|vllm`, `--ctx`, `--system`, `--verbose`
`hfl serve`	REST API server	`--host`, `--port`, `--model` (pre-load), `--api-key` (authentication)
`hfl list`	List local models with Rich table	Shows name, alias, format, quantization, license (risk-colored), size
`hfl search <query>`	Interactive paginated search on HF Hub	`--gguf`, `--max-params`, `--min-params`, `--sort`, `--page-size`
`hfl rm <model>`	Remove model with confirmation	Deletes files + registry entry
`hfl inspect <model>`	Full detail (Rich panel)	Shows metadata, license, restrictions, timestamps
`hfl alias <model> <alias>`	Assign short alias	Allows referring to models by simple names
`hfl login`	Configure HF token	`--token` or interactive. Verifies with `whoami()`
`hfl logout`	Remove saved token	Uses `huggingface_hub.logout()`
`hfl version`	Show version + license	—
`hfl compliance-report`	Legal compliance report (JSON/Markdown)	Output format selection

    Signal Handling: The run command handles Ctrl+C during token streaming gracefully, preserving the partial response.
  

CLI Helper Functions

_format_size() converts bytes to readable format. _get_key() reads a key without Enter (raw terminal). _extract_params_from_name() extracts parameters from name (regex: "70b", "7b", "1.5b"). _estimate_model_size() estimates disk size based on parameters and quantization. _display_model_row() renders a search result row. _get_params_value() extracts numeric value for filtering.

7. Module hub — HuggingFace Integration

resolver.py — Intelligent Resolution

Class ResolvedModel (dataclass) with: repo_id, revision, filename, format, quantization.

The resolve() function supports three input formats:

1. org/model → direct HF repo

2. org/model:Q4_K_M → repo with Ollama-style quantization

3. model-name → name search (top 5 by downloads)

After resolution, detects if repo has GGUF files (prefers _select_gguf() with priority Q4_K_M > Q5_K_M > Q4_K_S), safetensors, or pytorch.

downloader.py — Download with Progress

Main function pull_model(resolved). For GGUF downloads individual file with hf_hub_download(). For safetensors downloads complete snapshot with snapshot_download() filtering: *.safetensors, config.json, tokenizer*.json, tokenizer.model.

Implements rate limiting (0.5s between API calls) and identifying User-Agent (hfl/0.1.0) to comply with HuggingFace ToS.

auth.py — Authentication

get_hf_token() obtains token with priority: 1) HF_TOKEN env var, 2) token saved by huggingface_hub.

ensure_auth(repo_id) verifies repo access. If it fails and there's no token, requests interactively. Respects HF's gating system: does NOT bypass license acceptance for gated models.

license_checker.py — License Classification

Enum LicenseRisk: PERMISSIVE, CONDITIONAL, NON_COMMERCIAL, RESTRICTED, UNKNOWN.

Dictionary LICENSE_CLASSIFICATION with ~20 known licenses. Dictionary LICENSE_RESTRICTIONS with specific restrictions per family (Llama: 700M MAU, attribution, etc.).

check_model_license() queries HF API, classifies risk, and returns LicenseInfo. require_user_acceptance() presents a Rich panel with the license and requires explicit confirmation for non-permissive licenses.

8. Module models — Data and Registry

ModelManifest (manifest.py)

Dataclass that stores complete metadata for each downloaded model. It is the fundamental unit of information in the system.

Field	Type	Purpose
`name`, `repo_id`	str	Identification (short name + HF repo)
`alias`	str\|None	User-defined custom name
`local_path`, `format`	str	Location and type (gguf/safetensors/pytorch)
`size_bytes`, `quantization`	int, str	Size on disk + Q level
`architecture`, `parameters`, `context_length`	str, str, int	Model characteristics
`license`, `license_name`, `license_url`	str	Legal information (R1)
`license_restrictions`, `gated`, `license_accepted_at`	list, bool, str	Restrictions and acceptance
`gpai_classification`, `training_flops`	str	EU AI Act (R4)
`created_at`, `last_used`	str	Timestamps

ModelRegistry (registry.py)

Manages local inventory. Persists to ~/.hfl/models.json as JSON array. Operations: add() (avoids duplicates), get() (searches by name, alias, or repo_id), set_alias(), list_all() (sorted by date), remove().

ProvenanceLog (provenance.py)

Immutable conversion log in ~/.hfl/provenance.json. Each ConversionRecord documents: source (repo, format, revision), destination (format, path, quantization), tool used (llama.cpp + version), original license, and timestamps. Serves for legal traceability and compliance audit (R3).

9. Module engine — Inference Engines

Class Hierarchy — Engine

LlamaCppEngine

Main backend. Uses llama-cpp-python. Parameters: n_ctx, n_gpu_layers (-1=all), n_threads (0=auto), flash_attn, chat_format (auto-detect). Includes stderr suppression to silence Metal/CUDA logs when verbose=False. Generates results with metrics: tokens/s, prompt tokens, stop reason.

TransformersEngine

Uses models in native format with GPU. Dynamic quantization support: 4bit (NF4 double quant via BitsAndBytes) and 8bit. Streaming via TextIteratorStreamer in separate thread. Uses tokenizer's apply_chat_template() or Llama-style fallback.

EXPERIMENTAL VLLMEngine

EXPERIMENTAL Production GPU backend. Wraps vllm.LLM with SamplingParams. Real streaming with AsyncLLMEngine, with synchronous fallback for compatibility. Requires NVIDIA GPU with CUDA.

FailoverEngine

Multi-backend engine with sticky routing — automatically retries with the next available engine if one fails. Provides high availability across multiple inference backends.

Model Pool & Memory

Model pool with non-recursive waiting (bounded polling), preventing stack overflow with concurrent loads. Real-time RAM and GPU memory tracking via psutil and GPUtil.

Automatic Selector (selector.py)

Decision logic in select_engine(model_path, backend):

GGUF model? → Yes → LlamaCppEngine

CUDA GPU available? → Yes → TransformersEngine → ImportError → Fallback: LlamaCppEngine

No GPU? → LlamaCppEngine (will need prior conversion)

All imports are lazy (_get_llama_cpp_engine(), etc.) to avoid requiring all dependencies to be installed.

10. Module converter — Format Conversion

formats.py — Format Detection

Enum ModelFormat: GGUF, SAFETENSORS, PYTORCH, UNKNOWN. The detect_format(path) function inspects extensions (.gguf, .safetensors, .pt/.pth/.bin) both in individual files and directories (rglob). find_model_file() locates the main model file.

gguf_converter.py — GGUF Conversion

Class GGUFConverter with two-step pipeline:

safetensors/pytorch → Step 1 → convert_hf_to_gguf.py (FP16) → Step 2 → llama-quantize (Q4_K_M) → Final GGUF

ensure_tools() auto-installs llama.cpp if not present: git clone → cmake build → pip install requirements. check_model_convertibility() validates that the model is convertible (rejects LoRA adapters, image models, models without config.json).

Quantization	Bits/weight	Quality	Use Case
Q2_K	~2.5	~80%	Extreme compression
Q3_K_M	~3.5	~87%	Low RAM
Q4_K_M	~4.5	~92%	DEFAULT — best balance
Q5_K_M	~5.0	~96%	High quality
Q6_K	~6.5	~97%	Premium
Q8_0	~8.0	~98%+	Maximum quantized quality
F16	16.0	100%	No quantization

11. Module api — REST Server

Endpoints and API Compatibility

ServerState

Class with engine (active InferenceEngine), current_model (loaded ModelManifest) and api_key (str|None for authentication). Instantiated as global singleton. Server lifespan handles cleanup on close.

Middleware Stack

Execution order: RequestLogger → APIKey → RateLimit (conditional) → Disclaimer → CORS. APIKey runs BEFORE RateLimit, so unauthenticated requests are rejected without consuming rate limit tokens.

RequestLogger — Privacy-safe logging. NEVER logs: bodies (prompts/outputs), auth headers, User-Agent. Only: method, path, status, duration.

APIKeyMiddleware — Optional authentication via --api-key flag. Supports Authorization: Bearer <key> and X-API-Key: <key> headers. Public endpoints (/health, /) exempt.

RateLimitMiddleware — IP-based rate limiting, configurable via env vars. Supports per-model rate limiting.

DisclaimerMiddleware — Adds X-AI-Disclaimer header to AI generation endpoint responses (R9).

CORSMiddleware — Allows all origins, methods, and headers (local development).

Exception Handlers

Centralized exception handling via register_exception_handlers(app). Maps the entire HFLError hierarchy to HTTP responses with appropriate status codes (400 for validation, 429 for rate limit, 500 for internal errors).

OpenAPI Documentation

All endpoints include tags, summary, and responses in their decorators for auto-generated OpenAPI documentation. Tags: OpenAI, Ollama, TTS, Health, Metrics.

Structured Logging

All logging uses %-style format strings (logger.info('Model loaded: %s', name)) instead of f-strings, avoiding unnecessary evaluation when the log level is disabled.

Health & Metrics Endpoints

Endpoint	Description
`/health/deep?probe=true`	Runs a minimal inference test to verify the model works. Reports "degraded" status on failure.
`/health/sli`	Service Level Indicators with availability and latency metrics.
`/metrics`	Prometheus-format metrics.
`/metrics/json`	JSON-format metrics.

12. Module i18n — Internationalization

File: src/hfl/i18n/__init__.py

Complete internationalization system that allows the CLI to display all messages in multiple languages. Uses JSON translation files with nested keys and dot-notation access.

Main Functions

Function	Description
`t(key, **kwargs)`	Translates a key (e.g., `t("commands.pull.downloading")`). Supports interpolation with `.format(**kwargs)`. Cached with `lru_cache` for performance.
`get_language()`	Returns the current language. Reads from `HFL_LANG` env var, defaults to `"en"`
`set_language(lang)`	Changes language at runtime and clears the translation cache
`_load_translations(lang)`	Loads the language JSON file from `locales/`
`_get_nested_value(data, key)`	Navigates nested dictionaries with dot notation

Translation Files

Each language has a JSON file (~193 lines) in src/hfl/i18n/locales/:

en.json (English) and es.json (Spanish). Keys follow the structure module.action.message, for example: commands.pull.downloading, commands.search.no_results, errors.model_not_found.

    Usage: Set language with export HFL_LANG=es. All CLI commands use t() for their messages, allowing language changes without modifying code.
  

13. Exception Hierarchy

Custom Exception Tree

All inherit from HFLError(message, details) with __str__ combining both. Each exception includes specific contextual data (model_name, repo_id, required_gb, etc.) for detailed diagnostics.

14. Complete Flow: hfl pull

Model Download and Registration Pipeline

15. Complete Flow: hfl run

Interactive Chat Pipeline

16. Complete Flow: hfl serve

API Server Pipeline

17. Legal Compliance and Audit

HFL implements a comprehensive legal compliance system documented with audit references (R1-R9):

R1 Model Licenses

Mandatory verification before download. Classification into 5 risk levels. Presentation to user with visual panel. Explicit acceptance required for non-permissive licenses. License metadata persisted in ModelManifest.

R3 Conversion Provenance

Immutable ProvenanceLog records each conversion: source, destination, tool, version, license, timestamp. Legal warning displayed during conversion: "the original license remains in effect".

R4 EU AI Act

Fields in ModelManifest for GPAI classification: gpai_classification ("gpai", "gpai-systemic", "exempt") and training_flops. File NOTICE-EU-AI-ACT.md.

R6 Privacy

HF token only in memory. Privacy-safe logging: NEVER logs prompts, AI outputs, tokens, User-Agent. Only metadata (method, path, status, duration). File PRIVACY.md.

R8 HuggingFace ToS

Rate limiting (0.5s between API calls). Identifying User-Agent (hfl/0.1.0). Respect for gating system: does NOT bypass license acceptance. User must accept on huggingface.co first.

R9 AI Disclaimer

DisclaimerMiddleware adds X-AI-Disclaimer header to all AI endpoint responses. Disclaimer in CLI chat: "AI models may generate incorrect information..."

Project Legal Files

`LICENSE`	HRUL v1.0 — Project's custom license
`LICENSE-FAQ.md`	Frequently asked questions about the HRUL v1.0 license
`LICENSE-DEPENDENCIES.md`	Licenses of all third-party dependencies
`PRIVACY.md`	Privacy policy (no data collection)
`NOTICE-EU-AI-ACT.md`	EU AI Act compliance
`DISCLAIMER.md`	General liability disclaimer

18. Design Patterns

Strategy Pattern (Engine)

The InferenceEngine interface (ABC) defines the contract. Three concrete implementations (LlamaCpp, Transformers, vLLM) are interchangeable. selector.py acts as factory choosing the correct strategy based on context.

Lazy Loading / Lazy Imports

All heavy dependencies (torch, transformers, vllm, llama-cpp-python) are imported only when needed. Imports inside functions prevent minimal installations from failing due to absent optional dependencies.

Dependency Injection Container

core/container.py implements a DI container with thread-safe singletons (double-checked locking). Provides get_config(), get_registry(), get_state(), get_event_bus(), get_metrics(). Facilitates testing with reset_container().

Dataclass as Value Objects

ChatMessage, GenerationConfig, GenerationResult, ModelManifest, ResolvedModel, ConversionRecord, LicenseInfo — all are immutable or semi-immutable dataclasses encapsulating data.

Pipeline / Chain (CLI pull)

The pull command implements a sequential 7-step pipeline: resolve → verify license → download → detect format → convert (conditional) → create manifest → register. Each step is a decoupled component.

Adapter Pattern (Dual API)

The same inference engines are exposed through two different API interfaces (OpenAI and Ollama) via separate routers that adapt request/response formats to each ecosystem's expected format.

Circuit Breaker + Retry

utils/circuit_breaker.py implements circuit breaker for external calls. utils/retry.py provides retry with configurable exponential backoff. Both improve resilience against network failures or external service issues.

Event Bus (Pub/Sub)

events.py implements an internal EventBus for decoupled communication between components. Allows subscribing to events like model_loaded, inference_complete, download_progress without creating direct dependencies.

Architecture Decision Records (ADRs)

Important architectural decisions are documented in docs/adr/ following the standard ADR format:

ADR	Title	Status
`0001`	Singleton Pattern for Config and Registry	Accepted
`0002`	Async API with Sync Engines (sync_to_thread)	Accepted
`0003`	GGUF as Default Format	Accepted
`0004`	Ollama API Compatibility	Accepted
`0005`	License Classification (5 levels)	Accepted
`0006`	Rate Limiting Strategy	Accepted

19. Dependencies and Extras

Dependency Map by Group

20. Testing

    Coverage: 90%+ with 1,900 tests — Comprehensive suite with pytest + pytest-asyncio (asyncio_mode = "auto"). 80+ test files with shared fixtures in conftest.py.
  

Test Categories

Core & Config

`test_config.py`	HFLConfig, paths, env vars
`test_container.py`	DI container, Singleton pattern
`test_exceptions.py`	Exception hierarchy
`test_events.py`	EventBus, pub/sub
`test_metrics.py`	Performance metrics
`test_validators.py`	Data validation
`test_security.py`	Security & sanitization

API Server

`test_api*.py` (5)	Endpoints, auth, contracts
`test_routes_*.py` (4)	OpenAI, Native, TTS
`test_state.py`	ServerState, async locks
`test_streaming.py`	SSE streaming
`test_model_loader.py`	Dynamic model loading
`test_helpers.py`	ensure_llm/tts_loaded
`test_middleware.py`	Privacy logger, disclaimer

Engine & Inference

`test_engine*.py` (2)	Base, LlamaCpp
`test_selector*.py` (3)	Backend auto-selection
`test_transformers*.py` (2)	TransformersEngine
`test_vllm_engine.py`	vLLM with mocks
`test_tts_*.py` (2)	Bark, Coqui TTS
`test_async_wrapper.py`	Sync→Async wrapper
`test_model_pool.py`	Model pool

Hub & Downloads

`test_hub*.py` (2)	HF Hub integration
`test_downloader*.py` (2)	Download with resume
`test_resolver*.py` (2)	Model resolution
`test_license_checker.py`	License classification
`test_auth.py`	HF authentication

CLI & Utils

`test_cli*.py` (4)	Typer commands
`test_converter*.py` (3)	GGUF conversion
`test_i18n*.py` (4)	Internationalization
`test_circuit_breaker.py`	Circuit breaker
`test_retry.py`	Retry with backoff

Integration & Edge Case Tests

`test_integration.py`	Full end-to-end flows (pull→run→serve)
`test_concurrency.py`	Concurrency, thread safety, race conditions
`test_network_errors.py`	Timeouts, disconnects, retry logic
`test_edge_cases.py`	Malformed inputs, boundary conditions
`test_server_lifecycle.py`	Startup, shutdown, cleanup
`test_cli_signal.py`	CLI signal handling (Ctrl+C graceful shutdown)
`test_middleware_order.py`	Middleware execution order verification
`test_exception_handlers.py`	HFLError → HTTP status code mapping
`test_health_probes.py`	Health probe endpoints (/health/deep, /health/sli)
`test_config_env.py`	Config env var override tests
`test_model_pool_wait.py`	Model pool non-recursive waiting
`test_stress.py`	Stress tests (concurrent streaming, model pool stress)
`test_timeout.py`	Timeout decorator tests
`test_failover.py`	Failover engine multi-backend retry
`test_rate_limit_per_model.py`	Per-model rate limiting
`test_conversion_caching.py`	Conversion caching tests

Main fixtures (conftest.py): tmp_hfl_home (temp directory), mock_hf_api (HfApi mock), sample_manifest (sample model), populated_registry (pre-populated registry), mock_engine (mocked inference engine), test_client (FastAPI TestClient).

# Run tests with coverage
pytest --cov=hfl --cov-report=html --cov-fail-under=80

# Tests by category
pytest tests/test_api*.py -v          # API only
pytest tests/test_engine*.py -v       # Engines only
pytest -k "not slow" -v               # Exclude slow tests

21. Build and Distribution

Build System: Hatchling

Configured in pyproject.toml. Source packages in src/hfl. Entry point: hfl = "hfl.cli.main:app". Installation with pip install . or pip install .[all] for all optional dependencies.

PyInstaller (hfl.spec)

Spec to generate standalone executable. Allows distributing HFL as a single binary without requiring Python installed. Generated with pyinstaller hfl.spec and result is in dist/hfl.

Local Storage

~/.hfl/
├── models/               # Downloaded model files
│   └── org--model/       # Directory per model (org--model format)
├── cache/                # HuggingFace Hub cache
├── tools/
│   └── llama.cpp/        # Compiled conversion tools
│       └── build/bin/    # Binaries: llama-quantize, etc.
├── models.json           # Model registry (array of ModelManifest)
└── provenance.json       # Immutable conversion log

22. CI/CD & GitHub

GitHub Actions Workflows

Workflow	Trigger	Description
`ci.yml`	Push/PR to main	Full pipeline: lint (ruff) + tests (pytest matrix Python 3.10/3.11/3.12) + type-check
`pages.yml`	Push to main	Auto-deploy HTML docs to GitHub Pages
`build-executables.yml`	Release/manual	Cross-platform executable builds with PyInstaller
`lint.yml`	Push/PR	Linting with ruff (E, F, W, I)
`test.yml`	Push/PR	Tests with pytest + coverage
`security.yml`	Scheduled/manual	Dependency security audit
`license-check.yml`	Push/PR	Dependency license verification

GitHub Templates

Issue Templates: bug_report.yml (structured form with system info, reproduction steps, logs) and feature_request.yml (problem statement, proposed solution, affected component).

PR Template: Checklist including verification of all 5 Compliance Modules (license, provenance, disclaimer, privacy, gating) per HRUL requirements.

Community files: CONTRIBUTING.md (dev setup, code style, PR process), CODE_OF_CONDUCT.md (Contributor Covenant v2.1), SECURITY.md (vulnerability reporting policy).

HFL v0.1.0 — Comprehensive Architecture Documentation

Updated on March 7, 2026 — All diagrams are interactive SVGs

License: HRUL v1.0 (source-available)