HFL (HuggingFace Local) is a CLI tool and API server that allows downloading, managing, and running artificial intelligence models from HuggingFace Hub directly on the user's local machine. Its design aims to be a drop-in replacement for Ollama, but connected to the HuggingFace ecosystem with more than 500,000 models.
Users need to run LLMs locally without depending on cloud APIs. Ollama solves this but with a limited catalog. HFL connects directly to HuggingFace Hub, offering access to the world's largest catalog of open-weight models, with download, automatic GGUF conversion, and local execution.
Extreme modularity with lazy imports, dual API compatibility (OpenAI + Ollama), rigorous legal compliance (licenses, privacy, EU AI Act), and multi-backend support (llama.cpp, Transformers, vLLM) with automatic selection based on model format and available hardware.
Python >=3.10 | Base language |
typer >=0.12 | CLI Framework |
rich >=13.0 | Styled output |
pydantic >=2.10 | API data validation |
pyyaml >=6.0 | Configuration parsing |
fastapi >=0.115 | Async web framework |
uvicorn >=0.32 | ASGI server |
sse-starlette >=2.0 | Server-Sent Events |
httpx >=0.28 | Async HTTP client |
huggingface-hub >=0.27 | HF Hub API |
llama-cpp-python | GGUF Backend |
transformers+torch | Native GPU backend |
vllm | Production backend |
gguf | Format conversion |
hfl/ ├── pyproject.toml — Package definition, deps, scripts, tools ├── hfl.spec — PyInstaller spec for standalone executable ├── LICENSE — HRUL v1.0 (custom license) ├── LICENSE-DEPENDENCIES.md — Licenses of all dependencies ├── PRIVACY.md — Privacy policy ├── NOTICE-EU-AI-ACT.md — EU AI Act compliance ├── DISCLAIMER.md — Liability disclaimer ├── README.md — Main documentation ├── README.es.md — Main documentation (Spanish) ├── LICENSE-FAQ.md — License FAQ for HRUL ├── CONTRIBUTING.md — Contributing guide ├── CODE_OF_CONDUCT.md — Code of Conduct (Contributor Covenant 2.1) ├── CHANGELOG.md — Changelog (Keep a Changelog format) ├── SECURITY.md — Security policy │ ├── src/hfl/ — MAIN SOURCE CODE │ ├── __init__.py — Package version (0.1.0) │ ├── config.py — HFLConfig dataclass — global configuration │ ├── exceptions.py — Complete exception hierarchy │ ├── events.py — EventBus for internal pub/sub │ ├── metrics.py — Performance metrics │ ├── plugins.py — Plugin system with entry_points │ ├── security.py — Security sanitization and validation │ ├── validators.py — Common data validators │ ├── logging_config.py — Centralized logging configuration │ │ │ ├── core/ — SYSTEM CORE │ │ ├── __init__.py │ │ ├── container.py — DI container with thread-safe Singleton │ │ ├── observability_setup.py — Observability event listeners setup │ │ └── tracing.py — Request tracing with IDs │ │ │ ├── cli/ — COMMAND LINE INTERFACE │ │ ├── __init__.py │ │ ├── main.py — 12 commands: pull, run, serve, list, search, rm, inspect, alias, login, logout, version, compliance-report │ │ └── commands/ — Modularized commands │ │ └── _utils.py — Shared CLI utilities │ │ │ ├── api/ — REST API SERVER │ │ ├── __init__.py │ │ ├── server.py — FastAPI app, CORS, disclaimer, lifespan │ │ ├── state.py — ServerState with async locks for LLM/TTS │ │ ├── streaming.py — SSE streaming helpers │ │ ├── model_loader.py — Dynamic model loading │ │ ├── helpers.py — ensure_llm_loaded, ensure_tts_loaded │ │ ├── errors.py — Centralized HTTP error handling │ │ ├── middleware.py — Privacy-safe logging │ │ ├── rate_limit.py — Rate limiting by IP/token │ │ ├── routes_openai.py — /v1/chat/completions, /v1/completions, /v1/models │ │ ├── routes_native.py — /api/generate, /api/chat, /api/tags (Ollama-compatible) │ │ ├── routes_tts.py — /v1/audio/speech, /api/tts (TTS endpoints) │ │ ├── routes_health.py — /health, /ready, /live (healthchecks) │ │ ├── routes_metrics.py — /metrics (Prometheus + JSON) │ │ ├── exception_handlers.py — Global HFLError exception handling │ │ └── timeout.py — @with_timeout decorator + configurable timeout │ │ │ ├── engine/ — INFERENCE ENGINES │ │ ├── __init__.py │ │ ├── base.py — InferenceEngine + AudioEngine ABCs + dataclasses │ │ ├── selector.py — Automatic backend selection for LLM/TTS │ │ ├── llama_cpp.py — LlamaCppEngine (GGUF, CPU/GPU) │ │ ├── transformers_engine.py — TransformersEngine (safetensors, GPU) │ │ ├── vllm_engine.py — VLLMEngine (production GPU) │ │ ├── bark_engine.py — BarkEngine (TTS via transformers) │ │ ├── coqui_engine.py — CoquiEngine (TTS XTTS-v2) │ │ ├── async_wrapper.py — Sync→async wrapper for engines │ │ ├── model_pool.py — Model pool with LRU eviction │ │ ├── dependency_check.py — Optional dependency verification │ │ ├── failover.py — FailoverEngine (multi-engine retry with sticky routing) │ │ ├── memory.py — Real-time RAM/GPU memory tracking │ │ ├── observability.py — Engine performance metrics │ │ └── prompt_builder.py — Prompt formats + delimiter escaping │ │ │ ├── hub/ — HUGGINGFACE INTEGRATION │ │ ├── __init__.py │ │ ├── auth.py — Authentication and tokens │ │ ├── client.py — HTTP client for HF Hub │ │ ├── downloader.py — Download with resume and rate limiting │ │ ├── license_checker.py — License classification (5 levels) │ │ └── resolver.py — Intelligent model resolution │ │ │ ├── models/ — DATA MODELS │ │ ├── __init__.py │ │ ├── manifest.py — ModelManifest — complete metadata │ │ ├── provenance.py — ConversionRecord + ProvenanceLog │ │ ├── registry.py — ModelRegistry — local JSON inventory │ │ └── backends/ — Storage backends (File + SQLite) │ │ │ ├── converter/ — FORMAT CONVERSION │ │ ├── __init__.py │ │ ├── formats.py — ModelFormat + ModelType enums │ │ └── gguf_converter.py — GGUFConverter — conversion + quantization │ │ │ ├── utils/ — CROSS-CUTTING UTILITIES │ │ ├── __init__.py │ │ ├── circuit_breaker.py — Circuit breaker for resilience │ │ └── retry.py — Retry with exponential backoff │ │ │ └── i18n/ — INTERNATIONALIZATION │ ├── __init__.py — t(), get_language(), set_language() │ └── locales/ — Translation files │ ├── en.json — English translations │ └── es.json — Spanish translations │ ├── docs/ — DOCUMENTATION │ ├── adr/ — Architecture Decision Records │ │ ├── 0001-singleton-pattern.md │ │ ├── 0002-async-api-sync-engines.md │ │ ├── 0003-gguf-default-format.md │ │ ├── 0004-ollama-compatibility.md │ │ ├── 0005-license-classification.md │ │ └── 0006-rate-limiting-strategy.md │ └── *.html — Architecture documentation │ └── tests/ — TEST SUITE (80+ files, 90%+ coverage) ├── conftest.py — Shared fixtures ├── test_api*.py — API tests (5 files) ├── test_cli*.py — CLI tests (4 files) ├── test_engine*.py — Engine tests (8 files) ├── test_hub*.py — HF Hub tests (5 files) ├── test_i18n*.py — i18n tests (4 files) └── test_*.py — Unit and integration tests ├── .github/ — GITHUB INFRASTRUCTURE │ ├── workflows/ │ │ ├── ci.yml — CI: lint + test + type-check (Python 3.10/3.11/3.12) │ │ ├── pages.yml — Deploy docs to GitHub Pages │ │ ├── build-executables.yml — Cross-platform executable builds │ │ ├── lint.yml — Linting with ruff │ │ ├── test.yml — Tests with pytest │ │ ├── security.yml — Security audit │ │ └── license-check.yml — License verification │ ├── ISSUE_TEMPLATE/ │ │ ├── bug_report.yml — Bug report template │ │ └── feature_request.yml — Feature request template │ └── PULL_REQUEST_TEMPLATE.md — PR template with compliance checklist
File: src/hfl/config.py
Contains the HFLConfig class (dataclass) that defines all global application configuration. It is instantiated once as a singleton (config = HFLConfig()) when importing the module, and ensure_dirs() is called to create the directory structure.
| Property | Type | Default | Env Var | Description |
|---|---|---|---|---|
home_dir | Path | ~/.hfl | HFL_HOME | Root directory |
models_dir | Path (prop) | ~/.hfl/models | — | Storage for downloaded models |
cache_dir | Path (prop) | ~/.hfl/cache | — | Temporary HuggingFace cache |
registry_path | Path (prop) | ~/.hfl/models.json | — | Model registry (local inventory) |
llama_cpp_dir | Path (prop) | ~/.hfl/tools/llama.cpp | — | Compiled conversion tools |
host | str | 127.0.0.1 | HFL_HOST | API server address |
port | int | 11434 | HFL_PORT | Port (same as Ollama for compatibility) |
default_ctx_size | int | 4096 | — | Default context tokens |
default_n_gpu_layers | int | -1 | — | GPU layers (-1 = all) |
rate_limit_enabled | bool | true | HFL_RATE_LIMIT_ENABLED | Enable/disable rate limiting |
rate_limit_requests | int | 60 | HFL_RATE_LIMIT_REQUESTS | Max requests per window |
rate_limit_window | int | 60 | HFL_RATE_LIMIT_WINDOW | Rate limit window (seconds) |
hf_token | str|None | — | HF_TOKEN | Authentication token (memory only, never persisted) |
hf_token is read ONLY from the environment variable. It is never persisted to disk, never saved to models.json or any configuration file. It exists only in memory during process execution.
File: src/hfl/cli/main.py (~870 lines)
Framework: Typer + Rich. Entry point registered in pyproject.toml as hfl = "hfl.cli.main:app"
| Command | Description | Key Options |
|---|---|---|
hfl pull <model> | Download model from HF Hub | --quantize Q4_K_M, --format auto|gguf|safetensors, --alias, --skip-license |
hfl run <model> | Interactive terminal chat | --backend auto|llama-cpp|transformers|vllm, --ctx, --system, --verbose |
hfl serve | REST API server | --host, --port, --model (pre-load), --api-key (authentication) |
hfl list | List local models with Rich table | Shows name, alias, format, quantization, license (risk-colored), size |
hfl search <query> | Interactive paginated search on HF Hub | --gguf, --max-params, --min-params, --sort, --page-size |
hfl rm <model> | Remove model with confirmation | Deletes files + registry entry |
hfl inspect <model> | Full detail (Rich panel) | Shows metadata, license, restrictions, timestamps |
hfl alias <model> <alias> | Assign short alias | Allows referring to models by simple names |
hfl login | Configure HF token | --token or interactive. Verifies with whoami() |
hfl logout | Remove saved token | Uses huggingface_hub.logout() |
hfl version | Show version + license | — |
hfl compliance-report | Legal compliance report (JSON/Markdown) | Output format selection |
run command handles Ctrl+C during token streaming gracefully, preserving the partial response.
_format_size() converts bytes to readable format. _get_key() reads a key without Enter (raw terminal). _extract_params_from_name() extracts parameters from name (regex: "70b", "7b", "1.5b"). _estimate_model_size() estimates disk size based on parameters and quantization. _display_model_row() renders a search result row. _get_params_value() extracts numeric value for filtering.
Class ResolvedModel (dataclass) with: repo_id, revision, filename, format, quantization.
The resolve() function supports three input formats:
1. org/model → direct HF repo
2. org/model:Q4_K_M → repo with Ollama-style quantization
3. model-name → name search (top 5 by downloads)
After resolution, detects if repo has GGUF files (prefers _select_gguf() with priority Q4_K_M > Q5_K_M > Q4_K_S), safetensors, or pytorch.
Main function pull_model(resolved). For GGUF downloads individual file with hf_hub_download(). For safetensors downloads complete snapshot with snapshot_download() filtering: *.safetensors, config.json, tokenizer*.json, tokenizer.model.
Implements rate limiting (0.5s between API calls) and identifying User-Agent (hfl/0.1.0) to comply with HuggingFace ToS.
get_hf_token() obtains token with priority: 1) HF_TOKEN env var, 2) token saved by huggingface_hub.
ensure_auth(repo_id) verifies repo access. If it fails and there's no token, requests interactively. Respects HF's gating system: does NOT bypass license acceptance for gated models.
Enum LicenseRisk: PERMISSIVE, CONDITIONAL, NON_COMMERCIAL, RESTRICTED, UNKNOWN.
Dictionary LICENSE_CLASSIFICATION with ~20 known licenses. Dictionary LICENSE_RESTRICTIONS with specific restrictions per family (Llama: 700M MAU, attribution, etc.).
check_model_license() queries HF API, classifies risk, and returns LicenseInfo. require_user_acceptance() presents a Rich panel with the license and requires explicit confirmation for non-permissive licenses.
Dataclass that stores complete metadata for each downloaded model. It is the fundamental unit of information in the system.
| Field | Type | Purpose |
|---|---|---|
name, repo_id | str | Identification (short name + HF repo) |
alias | str|None | User-defined custom name |
local_path, format | str | Location and type (gguf/safetensors/pytorch) |
size_bytes, quantization | int, str | Size on disk + Q level |
architecture, parameters, context_length | str, str, int | Model characteristics |
license, license_name, license_url | str | Legal information (R1) |
license_restrictions, gated, license_accepted_at | list, bool, str | Restrictions and acceptance |
gpai_classification, training_flops | str | EU AI Act (R4) |
created_at, last_used | str | Timestamps |
Manages local inventory. Persists to ~/.hfl/models.json as JSON array. Operations: add() (avoids duplicates), get() (searches by name, alias, or repo_id), set_alias(), list_all() (sorted by date), remove().
Immutable conversion log in ~/.hfl/provenance.json. Each ConversionRecord documents: source (repo, format, revision), destination (format, path, quantization), tool used (llama.cpp + version), original license, and timestamps. Serves for legal traceability and compliance audit (R3).
Main backend. Uses llama-cpp-python. Parameters: n_ctx, n_gpu_layers (-1=all), n_threads (0=auto), flash_attn, chat_format (auto-detect). Includes stderr suppression to silence Metal/CUDA logs when verbose=False. Generates results with metrics: tokens/s, prompt tokens, stop reason.
Uses models in native format with GPU. Dynamic quantization support: 4bit (NF4 double quant via BitsAndBytes) and 8bit. Streaming via TextIteratorStreamer in separate thread. Uses tokenizer's apply_chat_template() or Llama-style fallback.
EXPERIMENTAL Production GPU backend. Wraps vllm.LLM with SamplingParams. Real streaming with AsyncLLMEngine, with synchronous fallback for compatibility. Requires NVIDIA GPU with CUDA.
Multi-backend engine with sticky routing — automatically retries with the next available engine if one fails. Provides high availability across multiple inference backends.
Model pool with non-recursive waiting (bounded polling), preventing stack overflow with concurrent loads. Real-time RAM and GPU memory tracking via psutil and GPUtil.
Decision logic in select_engine(model_path, backend):
All imports are lazy (_get_llama_cpp_engine(), etc.) to avoid requiring all dependencies to be installed.
Enum ModelFormat: GGUF, SAFETENSORS, PYTORCH, UNKNOWN. The detect_format(path) function inspects extensions (.gguf, .safetensors, .pt/.pth/.bin) both in individual files and directories (rglob). find_model_file() locates the main model file.
Class GGUFConverter with two-step pipeline:
ensure_tools() auto-installs llama.cpp if not present: git clone → cmake build → pip install requirements. check_model_convertibility() validates that the model is convertible (rejects LoRA adapters, image models, models without config.json).
| Quantization | Bits/weight | Quality | Use Case |
|---|---|---|---|
| Q2_K | ~2.5 | ~80% | Extreme compression |
| Q3_K_M | ~3.5 | ~87% | Low RAM |
| Q4_K_M | ~4.5 | ~92% | DEFAULT — best balance |
| Q5_K_M | ~5.0 | ~96% | High quality |
| Q6_K | ~6.5 | ~97% | Premium |
| Q8_0 | ~8.0 | ~98%+ | Maximum quantized quality |
| F16 | 16.0 | 100% | No quantization |
Class with engine (active InferenceEngine), current_model (loaded ModelManifest) and api_key (str|None for authentication). Instantiated as global singleton. Server lifespan handles cleanup on close.
Execution order: RequestLogger → APIKey → RateLimit (conditional) → Disclaimer → CORS. APIKey runs BEFORE RateLimit, so unauthenticated requests are rejected without consuming rate limit tokens.
RequestLogger — Privacy-safe logging. NEVER logs: bodies (prompts/outputs), auth headers, User-Agent. Only: method, path, status, duration.
APIKeyMiddleware — Optional authentication via --api-key flag. Supports Authorization: Bearer <key> and X-API-Key: <key> headers. Public endpoints (/health, /) exempt.
RateLimitMiddleware — IP-based rate limiting, configurable via env vars. Supports per-model rate limiting.
DisclaimerMiddleware — Adds X-AI-Disclaimer header to AI generation endpoint responses (R9).
CORSMiddleware — Allows all origins, methods, and headers (local development).
Centralized exception handling via register_exception_handlers(app). Maps the entire HFLError hierarchy to HTTP responses with appropriate status codes (400 for validation, 429 for rate limit, 500 for internal errors).
All endpoints include tags, summary, and responses in their decorators for auto-generated OpenAPI documentation. Tags: OpenAI, Ollama, TTS, Health, Metrics.
All logging uses %-style format strings (logger.info('Model loaded: %s', name)) instead of f-strings, avoiding unnecessary evaluation when the log level is disabled.
| Endpoint | Description |
|---|---|
/health/deep?probe=true | Runs a minimal inference test to verify the model works. Reports "degraded" status on failure. |
/health/sli | Service Level Indicators with availability and latency metrics. |
/metrics | Prometheus-format metrics. |
/metrics/json | JSON-format metrics. |
File: src/hfl/i18n/__init__.py
Complete internationalization system that allows the CLI to display all messages in multiple languages. Uses JSON translation files with nested keys and dot-notation access.
| Function | Description |
|---|---|
t(key, **kwargs) | Translates a key (e.g., t("commands.pull.downloading")). Supports interpolation with .format(**kwargs). Cached with lru_cache for performance. |
get_language() | Returns the current language. Reads from HFL_LANG env var, defaults to "en" |
set_language(lang) | Changes language at runtime and clears the translation cache |
_load_translations(lang) | Loads the language JSON file from locales/ |
_get_nested_value(data, key) | Navigates nested dictionaries with dot notation |
Each language has a JSON file (~193 lines) in src/hfl/i18n/locales/:
en.json (English) and es.json (Spanish). Keys follow the structure module.action.message, for example: commands.pull.downloading, commands.search.no_results, errors.model_not_found.
export HFL_LANG=es. All CLI commands use t() for their messages, allowing language changes without modifying code.
All inherit from HFLError(message, details) with __str__ combining both. Each exception includes specific contextual data (model_name, repo_id, required_gb, etc.) for detailed diagnostics.
HFL implements a comprehensive legal compliance system documented with audit references (R1-R9):
Mandatory verification before download. Classification into 5 risk levels. Presentation to user with visual panel. Explicit acceptance required for non-permissive licenses. License metadata persisted in ModelManifest.
Immutable ProvenanceLog records each conversion: source, destination, tool, version, license, timestamp. Legal warning displayed during conversion: "the original license remains in effect".
Fields in ModelManifest for GPAI classification: gpai_classification ("gpai", "gpai-systemic", "exempt") and training_flops. File NOTICE-EU-AI-ACT.md.
HF token only in memory. Privacy-safe logging: NEVER logs prompts, AI outputs, tokens, User-Agent. Only metadata (method, path, status, duration). File PRIVACY.md.
Rate limiting (0.5s between API calls). Identifying User-Agent (hfl/0.1.0). Respect for gating system: does NOT bypass license acceptance. User must accept on huggingface.co first.
DisclaimerMiddleware adds X-AI-Disclaimer header to all AI endpoint responses. Disclaimer in CLI chat: "AI models may generate incorrect information..."
LICENSE | HRUL v1.0 — Project's custom license |
LICENSE-FAQ.md | Frequently asked questions about the HRUL v1.0 license |
LICENSE-DEPENDENCIES.md | Licenses of all third-party dependencies |
PRIVACY.md | Privacy policy (no data collection) |
NOTICE-EU-AI-ACT.md | EU AI Act compliance |
DISCLAIMER.md | General liability disclaimer |
The InferenceEngine interface (ABC) defines the contract. Three concrete implementations (LlamaCpp, Transformers, vLLM) are interchangeable. selector.py acts as factory choosing the correct strategy based on context.
All heavy dependencies (torch, transformers, vllm, llama-cpp-python) are imported only when needed. Imports inside functions prevent minimal installations from failing due to absent optional dependencies.
core/container.py implements a DI container with thread-safe singletons (double-checked locking). Provides get_config(), get_registry(), get_state(), get_event_bus(), get_metrics(). Facilitates testing with reset_container().
ChatMessage, GenerationConfig, GenerationResult, ModelManifest, ResolvedModel, ConversionRecord, LicenseInfo — all are immutable or semi-immutable dataclasses encapsulating data.
The pull command implements a sequential 7-step pipeline: resolve → verify license → download → detect format → convert (conditional) → create manifest → register. Each step is a decoupled component.
The same inference engines are exposed through two different API interfaces (OpenAI and Ollama) via separate routers that adapt request/response formats to each ecosystem's expected format.
utils/circuit_breaker.py implements circuit breaker for external calls. utils/retry.py provides retry with configurable exponential backoff. Both improve resilience against network failures or external service issues.
events.py implements an internal EventBus for decoupled communication between components. Allows subscribing to events like model_loaded, inference_complete, download_progress without creating direct dependencies.
Important architectural decisions are documented in docs/adr/ following the standard ADR format:
| ADR | Title | Status |
|---|---|---|
0001 | Singleton Pattern for Config and Registry | Accepted |
0002 | Async API with Sync Engines (sync_to_thread) | Accepted |
0003 | GGUF as Default Format | Accepted |
0004 | Ollama API Compatibility | Accepted |
0005 | License Classification (5 levels) | Accepted |
0006 | Rate Limiting Strategy | Accepted |
pytest + pytest-asyncio (asyncio_mode = "auto"). 80+ test files with shared fixtures in conftest.py.
test_config.py | HFLConfig, paths, env vars |
test_container.py | DI container, Singleton pattern |
test_exceptions.py | Exception hierarchy |
test_events.py | EventBus, pub/sub |
test_metrics.py | Performance metrics |
test_validators.py | Data validation |
test_security.py | Security & sanitization |
test_api*.py (5) | Endpoints, auth, contracts |
test_routes_*.py (4) | OpenAI, Native, TTS |
test_state.py | ServerState, async locks |
test_streaming.py | SSE streaming |
test_model_loader.py | Dynamic model loading |
test_helpers.py | ensure_llm/tts_loaded |
test_middleware.py | Privacy logger, disclaimer |
test_engine*.py (2) | Base, LlamaCpp |
test_selector*.py (3) | Backend auto-selection |
test_transformers*.py (2) | TransformersEngine |
test_vllm_engine.py | vLLM with mocks |
test_tts_*.py (2) | Bark, Coqui TTS |
test_async_wrapper.py | Sync→Async wrapper |
test_model_pool.py | Model pool |
test_hub*.py (2) | HF Hub integration |
test_downloader*.py (2) | Download with resume |
test_resolver*.py (2) | Model resolution |
test_license_checker.py | License classification |
test_auth.py | HF authentication |
test_cli*.py (4) | Typer commands |
test_converter*.py (3) | GGUF conversion |
test_i18n*.py (4) | Internationalization |
test_circuit_breaker.py | Circuit breaker |
test_retry.py | Retry with backoff |
test_integration.py | Full end-to-end flows (pull→run→serve) |
test_concurrency.py | Concurrency, thread safety, race conditions |
test_network_errors.py | Timeouts, disconnects, retry logic |
test_edge_cases.py | Malformed inputs, boundary conditions |
test_server_lifecycle.py | Startup, shutdown, cleanup |
test_cli_signal.py | CLI signal handling (Ctrl+C graceful shutdown) |
test_middleware_order.py | Middleware execution order verification |
test_exception_handlers.py | HFLError → HTTP status code mapping |
test_health_probes.py | Health probe endpoints (/health/deep, /health/sli) |
test_config_env.py | Config env var override tests |
test_model_pool_wait.py | Model pool non-recursive waiting |
test_stress.py | Stress tests (concurrent streaming, model pool stress) |
test_timeout.py | Timeout decorator tests |
test_failover.py | Failover engine multi-backend retry |
test_rate_limit_per_model.py | Per-model rate limiting |
test_conversion_caching.py | Conversion caching tests |
Main fixtures (conftest.py): tmp_hfl_home (temp directory), mock_hf_api (HfApi mock), sample_manifest (sample model), populated_registry (pre-populated registry), mock_engine (mocked inference engine), test_client (FastAPI TestClient).
# Run tests with coverage
pytest --cov=hfl --cov-report=html --cov-fail-under=80
# Tests by category
pytest tests/test_api*.py -v # API only
pytest tests/test_engine*.py -v # Engines only
pytest -k "not slow" -v # Exclude slow tests
Configured in pyproject.toml. Source packages in src/hfl. Entry point: hfl = "hfl.cli.main:app". Installation with pip install . or pip install .[all] for all optional dependencies.
Spec to generate standalone executable. Allows distributing HFL as a single binary without requiring Python installed. Generated with pyinstaller hfl.spec and result is in dist/hfl.
~/.hfl/
├── models/ # Downloaded model files
│ └── org--model/ # Directory per model (org--model format)
├── cache/ # HuggingFace Hub cache
├── tools/
│ └── llama.cpp/ # Compiled conversion tools
│ └── build/bin/ # Binaries: llama-quantize, etc.
├── models.json # Model registry (array of ModelManifest)
└── provenance.json # Immutable conversion log
| Workflow | Trigger | Description |
|---|---|---|
ci.yml | Push/PR to main | Full pipeline: lint (ruff) + tests (pytest matrix Python 3.10/3.11/3.12) + type-check |
pages.yml | Push to main | Auto-deploy HTML docs to GitHub Pages |
build-executables.yml | Release/manual | Cross-platform executable builds with PyInstaller |
lint.yml | Push/PR | Linting with ruff (E, F, W, I) |
test.yml | Push/PR | Tests with pytest + coverage |
security.yml | Scheduled/manual | Dependency security audit |
license-check.yml | Push/PR | Dependency license verification |
Issue Templates: bug_report.yml (structured form with system info, reproduction steps, logs) and feature_request.yml (problem statement, proposed solution, affected component).
PR Template: Checklist including verification of all 5 Compliance Modules (license, provenance, disclaimer, privacy, gating) per HRUL requirements.
Community files: CONTRIBUTING.md (dev setup, code style, PR process), CODE_OF_CONDUCT.md (Contributor Covenant v2.1), SECURITY.md (vulnerability reporting policy).
HFL v0.1.0 — Comprehensive Architecture Documentation
Updated on March 7, 2026 — All diagrams are interactive SVGs
License: HRUL v1.0 (source-available)