# Ingestion & Storage ## Purpose Astrape needs a reliable way to collect energy-related data, normalize it, store it, and give Gibil a clean view of the current system state. The first version should favor boring, inspectable data flows over cleverness. Gibil should not need to know whether a value came from Modbus, Home Assistant, a weather API, a price API, or a manual override. It should receive timestamped observations and snapshots with enough metadata to decide whether the data is fresh and trustworthy. ## Initial Sources ### Sigen Inverter - Protocol: Modbus TCP - Polling target: every 5-10 seconds for fast-changing electrical state - Initial metrics: - `solar_power_w` - `battery_soc_pct` - `battery_charge_w` - `battery_discharge_w` - `grid_import_w` - `grid_export_w` - `daily_yield_kwh` - Risk: register map must be confirmed before this can be real ### Home Assistant / Ganymede - Preferred integration: MQTT - Direction: HASS/Ganymede should publish selected state to Astrape where possible - Initial metrics: - `home_power_w` - `indoor_temp_c` - selected device states - selected sensor values needed for water/heating logic - Reasoning: MQTT keeps Astrape loosely coupled and avoids making HASS a synchronous dependency for every decision tick ### Weather - Preferred first source: OpenMeteo - Polling target: hourly forecast refresh - Initial metrics: - `outdoor_temp_c` - `cloud_cover_pct` - `ghi_w_m2` - `wind_speed_m_s` - Use: external forecast history for generation and heating models ### Grid Pricing - First implementation: static time-of-use config - Later implementation: spot pricing API if needed - Initial metrics: - `grid_price_per_kwh` - `price_stage` - `cheap_window_active` - Reasoning: static config lets Gibil produce useful behavior before price API work is settled ### Manual Inputs - Purpose: allow operator-supplied values when a real integration is not available yet - Inputs may come from local config or a small authenticated admin path - Manual data should be marked clearly with `source = manual` ## Observation Shape Every collector should produce normalized observations. ```text observed_at: timestamp when the measurement was true received_at: timestamp when Astrape received it source: sigen | hass | weather | price | manual metric: stable metric name value: number, string, or boolean unit: W | kWh | pct | C | SEK/kWh | state | none quality: ok | stale | estimated | missing | error metadata: source-specific context ``` Guidelines: - `observed_at` and `received_at` are both needed because pushed data may arrive late - metric names should be stable and boring - raw source names/registers/entities belong in metadata, not in the metric name - Gibil should be able to ignore stale or low-quality observations ## Derived Snapshots Gibil should reason from snapshots, not directly from loose individual observations. A snapshot is the best-known whole-system state at a decision tick. It can include: - current solar generation - current home consumption - battery SoC - battery charge/discharge power - grid import/export - current price stage - active forecast window - stale/missing input flags Snapshots should be persisted because they explain what Gibil knew when it made a decision. ## Storage Choice Use TimescaleDB as the first primary store. Reasons: - It is Postgres, so querying and joining data stays straightforward - It handles time-series retention and aggregation well - It works for raw observations, derived snapshots, decisions, forecasts, and events - It leaves room for later model training without needing a second historical store immediately InfluxDB remains a reasonable alternative, but TimescaleDB is the better default if we want relational joins, auditability, and forecast training queries. The runtime expects `ASTRAPE_DATABASE_URL` to point at TimescaleDB. Weather ingest also expects `ASTRAPE_LATITUDE` and `ASTRAPE_LONGITUDE`. ## Initial Tables ### `observations` Raw normalized metric samples from all collectors. Core fields: - `id` - `observed_at` - `received_at` - `source` - `metric` - `value_num` - `value_text` - `value_bool` - `unit` - `quality` - `metadata` Notes: - use one value column based on the metric type - keep metadata as JSON for source-specific details - make this a hypertable on `observed_at` ### `snapshots` Periodic whole-system state used by Gibil. Core fields: - `id` - `created_at` - `snapshot` - `input_quality` Notes: - store the snapshot as JSON initially - this can be normalized later if query patterns demand it ### `decisions` Gibil outputs and reasoning. Core fields: - `id` - `created_at` - `snapshot_id` - `stage` - `recommendations` - `reasons` - `confidence` Notes: - decisions should be explainable enough to debug after the fact - this table becomes the audit trail for HASS-facing behavior ### `weather_forecast_points` Clean external weather forecast points from weather sources. Core fields: - `id` - `issued_at` - `target_at` - `horizon_hours` - `source` - `temperature_c` - `shortwave_radiation_w_m2` - `cloud_cover_pct` Notes: - this stores external forecasts, not internal predictions - make this a hypertable on `target_at` ### `weather_resolved_truth` Observed weather for target hours that have already happened. Core fields: - `id` - `resolved_at` - `source` - `temperature_c` - `shortwave_radiation_w_m2` Notes: - future prediction modules can join this to `weather_forecast_points` - make this a hypertable on `resolved_at` ### `sigen_plant_snapshots` High-resolution Sigenergy plant state from Modbus TCP. Core fields: - `observed_at` - `received_at` - `source` - `solar_power_w` - `battery_soc_pct` - `battery_soh_pct` - `battery_power_w` - `grid_power_w` - `grid_import_w` - `grid_export_w` - `load_power_w` - `plant_active_power_w` - `accumulated_pv_energy_kwh` - `daily_consumed_energy_kwh` - `accumulated_consumed_energy_kwh` - status fields for EMS, running state, and grid sensor state - `raw_values` Notes: - raw polling target is `SIGEN_POLL_SECONDS=5` - make this a hypertable on `observed_at` - keep raw JSON during integration so unsupported or surprising registers can be debugged - rollup views should preserve averages, min/max spikes, and sample counts so short-duration usage signatures are not erased completely Initial rollups: - `sigen_plant_snapshots_1m` - `sigen_plant_snapshots_15m` - `sigen_plant_snapshots_1h` ### `system_events` Operational events from collectors, storage, Gibil, and publishers. Core fields: - `id` - `created_at` - `component` - `severity` - `event_type` - `message` - `metadata` Notes: - this should capture stale data, auth failures, bad Modbus reads, publish failures, and degraded-mode decisions ## Retention Initial retention targets: - raw 5-10 second observations: 7-30 days - 1-minute aggregates: 6-12 months - 15-minute/hourly aggregates: keep indefinitely unless storage becomes a problem - decisions: keep indefinitely - system events: keep indefinitely or archive after a year Retention should be revisited after real sample rates and database size are known. ## First Slice The first implementation slice should prove the shape before touching real hardware. 1. Define the observation and snapshot models. 2. Add a manual collector only if needed for operator-supplied values. 3. Store observations in TimescaleDB or a local development substitute. 4. Build one snapshot from the latest observations. 5. Let Gibil make a simple stage decision from that snapshot. 6. Persist the decision with reasons. This gives us the whole loop: ```text collector -> observations -> snapshot -> Gibil decision -> stored audit trail ``` MQTT publishing can come immediately after this loop exists. ## Open Questions - Should development use real TimescaleDB from day one, or SQLite/Postgres first? - What is the exact MQTT topic namespace for HASS/Ganymede integration? - Which HASS entities should be included in the first read-only state feed? - How should the `gibil` IPA identity authenticate to MQTT and HASS? - What high-resolution retention target is acceptable on the Astrape VM? - Should snapshots be created on a fixed schedule, on new data, or both?