No description

Find a file

Leonard Kramer a287f9d3b2 gzip yesterday's NDJSON, add log file, README - Compress prices-YYYY-MM-DD.ndjson for past UTC days after each successful scrape (atomic .gz.tmp rename, ~10x size reduction on real data). - Optional logging.file: tee INFO+ records to a low-noise log file via a small slog multi-handler so stdout can stay at DEBUG independently. - Bump default request_timeout 30s -> 60s after observing real API slowness. - Add unit test for CompressOlder covering atomicity, today-file skipping, and existing .gz preservation. - README with deploy, operations, and analysis snippets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-04 18:51:12 +01:00
cmd/fuel-history	gzip yesterday's NDJSON, add log file, README	2026-05-04 18:51:12 +01:00
pkg	gzip yesterday's NDJSON, add log file, README	2026-05-04 18:51:12 +01:00
.dockerignore	initial scraper for GOV.UK Fuel Finder PFS prices	2026-05-04 18:34:18 +01:00
.gitignore	initial scraper for GOV.UK Fuel Finder PFS prices	2026-05-04 18:34:18 +01:00
config.example.yaml	gzip yesterday's NDJSON, add log file, README	2026-05-04 18:51:12 +01:00
docker-compose.yml	initial scraper for GOV.UK Fuel Finder PFS prices	2026-05-04 18:34:18 +01:00
Dockerfile	initial scraper for GOV.UK Fuel Finder PFS prices	2026-05-04 18:34:18 +01:00
go.mod	initial scraper for GOV.UK Fuel Finder PFS prices	2026-05-04 18:34:18 +01:00
go.sum	initial scraper for GOV.UK Fuel Finder PFS prices	2026-05-04 18:34:18 +01:00
README.md	gzip yesterday's NDJSON, add log file, README	2026-05-04 18:51:12 +01:00

README.md

fuel_history

A small Go service that periodically scrapes the GOV.UK Fuel Finder PFS fuel-prices API and writes the results as flattened NDJSON to a daily-rotated file for later analysis.

One row per (station × fuel type) per scrape — denormalised for time-series analysis.
Daily UTC file rotation (prices-YYYY-MM-DD.ndjson); previous days are gzipped automatically (~10× compression) once the next day's scrape begins.
Sequential paginated fetch with OAuth2 client-credentials, in-memory token cache + 401/403 retry, configurable scrape interval.
Distroless Docker image (~9 MB) and a ready-to-go docker-compose.yml.

Prerequisites

For deployment: Docker + Docker Compose v2.
For local development: Go 1.24+.
A client_id / client_secret pair from the Fuel Finder developer portal.

Configuration

Copy the example and fill in your credentials:

cp config.example.yaml config.yaml
$EDITOR config.yaml

config.yaml is gitignored. Field reference:

api:
  base_url: https://www.fuel-finder.service.gov.uk   # production server
  client_id: YOUR_CLIENT_ID
  client_secret: YOUR_CLIENT_SECRET

scrape:
  interval: 30m            # Go duration, e.g. 10m, 30m, 1h
  request_timeout: 60s     # per-HTTP-request timeout

storage:
  dir: ./data              # NDJSON output directory

logging:
  file: ./data/fuel-history.log   # optional; INFO+ tee'd here. Empty disables.

Deploying with Docker Compose

cp config.example.yaml config.yaml      # then edit credentials
mkdir -p data
docker compose up -d --build

If id -u on your server is not 1001, edit the user: line in docker-compose.yml accordingly so files in ./data are owned by your host user. Also ensure the data/ directory itself is writable by that uid: sudo chown 1001:1001 data (or whichever uid you set).

Operations

docker compose logs -f                         # stream container logs
docker compose ps                              # service status
docker compose run --rm fuel-history --once    # one-shot scrape
docker compose down                            # stop and remove

Container stdout logs are capped at 5 × 10 MB by Docker's json-file driver (see logging: block in docker-compose.yml). The application's own fuel-history.log file is unbounded — it's low-noise (a few lines per scrape) so a year of operation is a few MB.

Running locally (without Docker)

go build -o fuel-history ./cmd/fuel-history
./fuel-history --config config.yaml --once     # one scrape, then exit
./fuel-history --config config.yaml --debug    # continuous, verbose stdout
./fuel-history --help

CLI flags:

--config PATH — path to YAML config (default config.yaml).
--once — perform a single scrape and exit (useful for cron / smoke tests).
--debug — DEBUG-level logging on stdout. The log file always uses INFO+.

Output format

NDJSON, one JSON object per line:

{
  "scrape_time": "2026-05-04T17:20:24.230Z",
  "node_id": "0028acef…",
  "trading_name": "Alex Fuel Station",
  "public_phone_number": "+448003234040",
  "fuel_type": "E10",
  "price": 132.9,
  "price_last_updated": "2026-02-17T16:03:04.938Z",
  "price_change_effective_timestamp": "2026-02-17T16:00:00.000Z"
}

Files in data/:

prices-YYYY-MM-DD.ndjson — current day, append-only.
prices-YYYY-MM-DD.ndjson.gz — previous days, gzip-compressed in place.
fuel-history.log — application log (if logging.file is set).

Analysing the data

# Pretty-print one record
head -1 data/prices-2026-05-04.ndjson | jq .

# Count distinct stations and fuel types
jq -r .node_id  data/prices-*.ndjson | sort -u | wc -l
jq -r .fuel_type data/prices-*.ndjson | sort -u

# Cheapest E10 right now (top 5)
jq -c 'select(.fuel_type=="E10")' data/prices-*.ndjson \
  | jq -s 'sort_by(.price)[:5]'

# Read compressed and uncompressed together
zcat -f data/prices-*.ndjson*

# DuckDB SQL across all days at once (handles .gz automatically)
duckdb -c "SELECT fuel_type, COUNT(*), AVG(price)
           FROM read_json_auto('data/prices-*.ndjson*', format='newline_delimited')
           GROUP BY fuel_type"

Project layout

cmd/fuel-history/main.go    flags, signal handling, slog wiring
pkg/config/                 YAML loader + defaults + validation
pkg/fuelapi/                token cache + paginated fetch + 401-retry
pkg/store/                  daily NDJSON sink + gzip-on-rollover
pkg/logging/                slog multi-handler (per-sink levels)
pkg/scraper/                ticker loop + flatten + compress orchestration
config.example.yaml
docker-compose.yml
Dockerfile

Tests

go test ./...
go vet ./...

Notes / limitations

API quirks worth knowing about:
- Token endpoint takes a JSON body ({client_id, client_secret}) — not standard OAuth2 form-encoded, despite what the spec implies.
- The PFS endpoint returns a bare JSON array (the documented {data: [...]} wrapper is absent).
- End of pagination is signalled by HTTP 404 with a "Requested batch X is not available" body — not an empty array.
- Periodic 504 Gateway Timeout responses with a "Maintenance" HTML page are common; the scraper logs the error and waits for the next tick.
No retention policy yet — old .gz files accumulate forever. Add a find data -name 'prices-*.ndjson.gz' -mtime +90 -delete cron job if you want a sliding window.
Rate limit: 429 responses say "try again in 5 minutes". A scrape that hits one is aborted; the next ticker fire retries the whole thing. No sophisticated back-off yet.

README.md Unescape Escape