Metadata-Version: 2.4
Name: pulka
Version: 0.1.0
Summary: A tiny, fast VisiData-like tabular viewer for Polars.
Author: Pulka developers
License-Expression: MIT
Project-URL: Homepage, https://github.com/pulka-dev/pulka
Project-URL: Documentation, https://github.com/pulka-dev/pulka#readme
Project-URL: Issues, https://github.com/pulka-dev/pulka/issues
Keywords: polars,dataframe,tui,visualization,analytics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Utilities
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: polars>=1.16.0
Requires-Dist: prompt-toolkit>=3.0.48
Requires-Dist: rich>=13.7.1
Requires-Dist: pyarrow>=16.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: zstandard>=0.22.0
Requires-Dist: xlsxwriter>=3.2.9
Provides-Extra: test
Requires-Dist: pytest>=8.4; extra == "test"
Requires-Dist: pytest-randomly>=4.0.1; extra == "test"
Provides-Extra: dev
Requires-Dist: pytest>=8.4; extra == "dev"
Requires-Dist: pytest-randomly>=4.0.1; extra == "dev"
Requires-Dist: ruff>=0.6.0; extra == "dev"
Requires-Dist: prek>=0.1.0; extra == "dev"
Requires-Dist: import-linter>=2.0; extra == "dev"
Requires-Dist: mypy>=1.11; extra == "dev"
Requires-Dist: build>=1.2.2; extra == "dev"
Requires-Dist: twine>=5.1; extra == "dev"
Dynamic: license-file

# Pulka

A small, vibe-engineered VisiData-like tabular viewer I built for myself. It loads CSV/TSV, Parquet, and Arrow/Feather/IPCs and presents a keyboard-driven table with sorting and filtering. I’m sharing it in case it sparks ideas, but it’s still very much a personal playground rather than a polished product.

## Project status

Pulka is intentionally personal and early-stage. I experiment freely and change APIs whenever it feels right for my own workflows. Please treat the codebase as reference material rather than a supported tool. I’m flattered by the interest, but I’m keeping the scope personal for now, so I’m not accepting contributions, feature requests, or support questions.

## Installation

Pulka is published as a standard Python package mostly so I can install it on my own machines. If you’d still like to poke around, the quickest way to install the CLI is:

```bash
pip install pulka
```

The installation provides the `pulka` console script. Run `pulka --help` to see all available options. Optional extras are available:

- `pulka[test]` – installs the pytest-based integration suite.
- `pulka[dev]` – installs the testing extras plus Ruff and the ancillary tooling used during development.

If you prefer [`uv`](https://github.com/astral-sh/uv) for local development, the project ships a `uv.lock` file. Run `uv sync --dev` to install Pulka along with its development dependencies inside an isolated virtual environment.

<img width="330" height="330" alt="IMG_7280" src="https://github.com/user-attachments/assets/a94d127f-b8f6-4e13-8444-c009802e554b" />


## Quick start

- Launch the viewer directly against any supported data source:

  ```bash
  pulka path/to/data.parquet
  ```

- Evaluate a Polars expression without writing an intermediate file (prints the
  result once; add `--tui` to browse interactively):

  ```bash
  pulka --expr "pl.scan_parquet('path/to/data.parquet').select(pl.all().head(5))"
  ```

  Headless output uses Polars' DataFrame formatting, so the preview matches what
  you'd see from `pl.DataFrame`. Add `--tui` to switch back to the interactive
  viewer. Within expressions you can call `df.glimpse()` for a per-column summary,
  reference columns via `c.<name>` (or `pl.col("name")`), and use Polars selectors
  through the `cs` alias.

- Inspect a dataset quickly without launching the TUI:

  ```bash
  pulka data/sample.parquet --schema   # name + dtype table
  pulka data/sample.parquet --glimpse  # column-wise preview via df.glimpse()
  ```

- Generate a comprehensive Parquet file that covers all core Polars dtypes:

  ```bash
  ./generate_all_polars_dtypes_parquet.py
  # writes data/all_polars_dtypes.parquet
  ```

- Run the viewer interactively with `uv` during development:

  ```bash
  uv run pulka data/all_polars_dtypes.parquet
  # or run the module entry point
  uv run python -m pulka data/all_polars_dtypes.parquet
  ```

## Controls

- q: back (or quit if at root)
- arrows/hjkl: move cursor/viewport
- PgUp/PgDn: page up/down
- gg / G: jump to top / bottom
- 0 / $: first / last visible column
- gh / gl: first / last column (horizontal gg/G)
- H / L: slide the current column left/right (reorder columns)
- gH / gL: slide the current column to the first/last position
- _: maximize current column width (toggle)
- g_: maximize all columns' widths (toggle)
- s: sort by current column (toggle asc/desc)
- yy: copy the active cell to the clipboard
- C: column summary sheet (per-column stats)
- i: toggle column insight sidecar (live stats + active cell preview)
- F: frequency table of the current column (value, count, percent)
- T: transpose view (columns as rows with sample data; respects `PULKA_TRANSPOSE_SAMPLE_ROWS`)
- /: search current column (substring, case-insensitive)
- c: search columns by name (tab-complete + history; `n`/`N` cycle matches)
- n / N: next / previous match (row search or column search, depending on context)
- r: reset filters
- Ctrl+R: reload the current dataset from disk
- e: open expression filter prompt (Polars expression using `c.<column>`)
- f: open SQL filter prompt (provide a WHERE clause without the `WHERE` keyword)
- : open command prompt (`goto <column>`, `record on`, ...)
- @: toggle structured flight recorder (writes buffered session log)
- ?: show schema
- enter: in `F` mode, filter by selected value and return to DataFrame view

## Scripted/headless usage

Useful for debugging without a TTY, tests, CI, or capturing output sequences.

```bash
pulka data.parquet --cmd down --cmd right --cmd s --cmd quit
```

You can also skip the positional path entirely and provide a Polars
expression instead:

```bash
pulka --expr "pl.DataFrame({'a': [1, 2]}).lazy()"
# default: prints a single render to stdout
pulka --expr "df.describe()" data.parquet --tui  # reference the scanned dataset via `df`
```

- From a script file (one command per line):

  ```bash
  pulka data.parquet --script commands.txt
  ```

Supported commands:
- down [n], up [n], left [n], right [n]
- pagedown, pageup, top, bottom, first, last
- gh, gl: navigate to first/last column overall (adjusts viewport)
- H, L, gH, gL: slide the current column left/right or to the extremes
- _, maxcol: toggle maximize current column
- g_, maxall: toggle maximize all columns
- sort, filter <expr>, filter_eq <col> <value>, reset, goto <col>, render, quit
- sql_filter <where>: apply an SQL WHERE clause (omit the `WHERE` keyword)
- schema: show column schema information
- freq [col]: frequency table of current or specified column
- transpose [rows]: transpose view with optional row count
- insight [on|off]: toggle the column insight sidecar (TUI only)
- center: center current row in viewport
- search <term>: search current column for substring
- next_diff, prev_diff: navigate to next/previous search match
- hide, unhide: hide current column or unhide all columns
- pal [id]: switch highlight palette (`:pal` to list available presets)
- undo, redo: undo/redo the last transformation
- next_different, prev_different: navigate to next/previous different value
- yank_cell (yy): copy the active cell to the clipboard

Filter expressions use the helper namespace `c` to refer to columns (`c.tripduration > 1200`, `c.name.str.contains('NY', literal=True)`). Any Polars Expr helpers are available via `pl`/`lit`.

## Debugging workflows

- Force a tiny viewport to study repaints/highlights clearly:

  ```bash
  pulka data.parquet --viewport-rows 4 --viewport-cols 4
  ```

- Scripted navigation with explicit renders between steps:

  ```bash
  pulka data.parquet --cmd render --cmd down --cmd render --cmd quit
  ```

- Use the included generator to cover edge cases across types:

  ```bash
  ./generate_all_polars_dtypes_parquet.py --rows 128 --seed 123
  pulka data/all_polars_dtypes.parquet
  # or run the module entry point
  python -m pulka data/all_polars_dtypes.parquet --viewport-rows 6 --viewport-cols 6
  ```

- Recording is disabled by default. Enable it from the CLI with `--record` or toggle inside the
  TUI. Logs are streamed to `~/.pulka/sessions/` (JSONL, compressed with zstd when available).
  Dataset paths are automatically redacted by default (replaced with basename + SHA1 digest)
  to make logs safe to share, with the original paths stored under `_raw_path` for internal use.

  - Enable recording with `pulka data.parquet --record` or press `@` during a session.
  - Change the destination with `--record-dir /path/to/sessions`.
  - While recording, Pulka emits `perf` events capturing render/status durations (TUI, headless, and API paths) so slow commands can be identified post-run.

  Headless runs respect the same options; add `--record` to persist logs for scripted sessions.
  
  **Cell redaction**: By default, cell values containing strings are hashed and replaced with `{hash, length}` dictionaries in the flight recorder logs to protect sensitive data. You can select other modes using the `--cell-redaction` flag or the `PULKA_RECORDER_CELL_REDACTION` environment variable:
  
  - `none`: No redaction applied to cell values (default when recording is disabled).
  - `hash_strings`: Hash string values and replace with `{hash, length}` (default when recording is enabled).
  - `mask_patterns`: Replace sensitive patterns (emails, IBANs, phones) with `***`.
  
  Example usage: `pulka data.parquet --cell-redaction mask_patterns` or `PULKA_RECORDER_CELL_REDACTION=hash_strings pulka data.parquet`.
  
  Note: `_raw_path` values remain for internal use and are not exported in shared logs.
  
  **Repro exports**: Export reproducible dataset slices for debugging with the `repro_export` command. The exported Parquet files contain the currently visible rows/columns plus a 10-row margin (configurable), and respect the active redaction policy. Files are saved in the session directory as `<session_id>-repro.parquet`. Trigger via:
  
  - Interactive mode: `:repro_export` or `:repro` command
  - Headless mode: `pulka data.parquet --repro-export` flag
  - Command: `pulka data.parquet --cmd repro_export --cmd quit`
  
  The export respects your current viewport and column visibility settings (use `all_columns=true` to export all columns).

## Flight Recorder & Debugging

Pulka’s structured flight recorder captures rich runtime telemetry—key events, perf timings,
viewer snapshots, and rendered frames—to make tricky bugs reproducible.

- **Toggle in the TUI**: Press `@` to enable or disable the recorder for the current session. When
  stopping, Pulka saves the buffered log to `~/.pulka/sessions/` and copies the full path to your
  clipboard when available.
- **Headless & API support**: Pass `--record` on the CLI or attach a `Recorder` in code to capture
  the same telemetry outside the TUI.
- **Artifacts**: Recorder files are UTF-8 JSONL (`*.pulka.jsonl`), optionally compressed with zstd.
  They include structured events (`command`, `key`, `state`, `frame`, `perf`, …) and respect cell
  redaction policies.

You can post-process these logs with your own tooling or scripts (see `PROFILING.md` for examples)
to analyse performance and reproduce user journeys.


## Benchmarks

- Run the microbenchmarks against the default fixture:

  ```bash
  uv run python benchmarks/bench_pulka.py --mode micro --iterations 5
  ```

  The pre-commit hooks call `benchmarks/check_microbench.py` to ensure the
  navigation microbenchmarks stay within budget. Update the baseline when
  intentional performance work lands:

  ```bash
  uv run python benchmarks/check_microbench.py --update-baseline
  ```

- Point the benchmark to another dataset or change the sample count via `--path` and `--iterations`.

- Measure fast vertical scrolling with the synthetic mini-nav fixture:

  ```bash
  uv run python benchmarks/bench_pulka.py --mode vscroll --iterations 10
  ```

  Use `--path` to benchmark a specific dataset or adjust `--vscroll-steps`,
  `--vscroll-rows`, and `--vscroll-cols` to mimic different scroll workloads.

- Point the benchmark to another dataset or change the sample count via `--path` and `--iterations`.

- Need a larger real-world dataset? Download one month of NYC Citi Bike trips (CSV) and convert to Parquet:

  ```bash
  mkdir -p data/fixtures/nyc_citibike
  curl -L 'https://s3.amazonaws.com/tripdata/202401-citibike-tripdata.zip' -o data/fixtures/nyc_citibike/202401-citibike-tripdata.zip
  unzip -d data/fixtures/nyc_citibike data/fixtures/nyc_citibike/202401-citibike-tripdata.zip
  uv run --with polars python - <<'PY'
  import polars as pl
  from pathlib import Path

  root = Path('data/fixtures/nyc_citibike')
  parts = sorted(root.glob('202401-citibike-tripdata_*.csv'))
  schema_overrides = {'start_station_id': pl.Utf8, 'end_station_id': pl.Utf8}
  lf = pl.concat([
      pl.scan_csv(p, infer_schema_length=10000, schema_overrides=schema_overrides) for p in parts
  ])
  lf.sink_parquet(root / '202401-citibike-tripdata.parquet')
  PY
  ```

  All generated files live under `data/fixtures/` (ignored by git) so you can keep large fixtures locally without polluting commits.

### Synthetic data presets

- Materialise any spec or capsule via the CLI:

  ```bash
  uv run pulka generate '549r/sol=sequence();value=normal(0,1)' --out data/mars.parquet
  ```

- Save frequently used specs under `~/.config/pulka/generate_presets.toml` (or override the path with `PULKA_GENERATE_PRESET_FILE`). Example:

  ```toml
  [presets]
  themartian = '549r/sol=sequence()!;earth_datetime=@(...);storm_alert=@(...)'
  mini_nav = '200r/id=sequence();value=normal(0,1)'
  ```

- Generate from a preset or inspect what is available:

  ```bash
  uv run pulka generate --preset themartian --out data/the-martian.parquet
  uv run pulka generate --preset hailmary --out data/hailmary.parquet
  uv run pulka generate --list-presets
  ```

  Pulka ships these presets out of the box. To customize or add new ones, edit
  `~/.config/pulka/generate_presets.toml` (create it if missing) or copy the sample from
  `docs/generate_presets.example.toml` as a starting point.

## Notes

- The viewer operates on the engine's physical plan (backed by Polars today), applying filter/sort lazily and fetching only the visible slice per render for performance.
- Filtering uses Polars expressions: refer to columns with `c.<name>` (or `c["name with spaces"]`) and combine with any `polars.Expr` helpers.
- Use the `PULKA_TRANSPOSE_SAMPLE_ROWS` environment variable (legacy `PD_TRANSPOSE_SAMPLE_ROWS` is still recognised) or the `transpose [rows]` command to control how many rows are sampled for transpose mode.
- Rendering uses a prompt_toolkit-native table control by default for smoother scrolling and fewer ANSI redraw artifacts. Set `PULKA_PTK_TABLE=0` (or `PD_PTK_TABLE=0`) to fall back to the Rich-based renderer that still powers headless exports. If you see flicker on very small terminals, try `--viewport-rows 4 --viewport-cols 4` to debug.
- You can run both scripts directly thanks to the `uv` shebangs; no manual environment setup required.
- Colours/styles are configurable via `pulka-theme.toml` (or `PULKA_THEME_PATH`; `PD_THEME_PATH` is kept for compatibility). Copy the default file and tweak the Rich style strings as needed.
- Background job concurrency defaults to `min(4, cpu_count)` threads. Set `PULKA_JOB_WORKERS=<n>` or add a `[jobs]` table with `max_workers = <n>` to your `pulka.toml` when you want shared runtimes to fan out over more worker threads.

## Development

- Install the project (and optional test dependencies) locally:

  ```bash
  uv pip install -e ".[test]"
  ```

- Run the full test suite with uv:

  ```bash
  uv run pytest
  ```

  Add extra pytest arguments after `pytest` as needed.

### Essential Tools & Commands

```bash
# Run the application
pulka data/file.parquet

# Run tests
uv run python -m pytest

# Run specific test
uv run python -m pytest tests/test_specific.py::TestSpecific::test_name

# Install in development mode
uv pip install -e .

# Clear Python cache
rm -rf src/pulka/__pycache__ src/__pycache__
```

### Debugging Tips

1. **Terminal width issues**: Use `COLUMNS=80` environment variable to simulate different terminal widths
2. **Status bar debugging**: The status bar has responsive layouts - check both wide and narrow terminals
3. **Data type simplification**: Complex types (List, Array, Struct, etc.) are simplified to single words

### Writing Tests

1. **Test structure**: Follow existing patterns in `tests/` directory
2. **Status bar tests**: Use `capsys` fixture to capture output and verify status bar content
3. **Data type tests**: Test with `all_polars_dtypes.parquet` which contains all major data types

### Key Implementation Details

1. **Status bar format**: `filename • row n / col name[type] • status_message         total_rows • memory`
2. **Data type simplification**: Happens in `render_status_line()` function in `src/pulka/__init__.py`
3. **Responsive design**: Automatically switches between full and simplified layouts based on terminal width

### Common Development Tasks

1. **Add new data type simplification**: Modify the dtype simplification logic in `render_status_line()`
2. **Modify status bar layout**: Adjust the string formatting in `render_status_line()`
3. **Add new status messages**: Set `viewer.status_message` in relevant functions

### Useful Test Files

- `data/all_polars_dtypes.parquet`: Contains all major Polars data types for testing
- `tests/test_dtypes.py`: Tests for data type handling
- `tests/test_viewer.py`: Tests for status bar and viewer functionality

 - Tests guidelines: 

## Architecture

Pulka follows a modular architecture with clear separation of concerns:

- **Data Layer** (`src/pulka/data/`): Handles dataset scanning, filter compilation, and query building
  - `scan.py`: File format detection and Polars LazyFrame creation
  - `filter_lang.py`: AST validation and Polars expression compilation
  - `query.py`: Query plan construction utilities

- **Core Layer** (`src/pulka/core/`): Centralized state management and interfaces
  - `sheet.py`: Sheet protocol defining the interface for tabular data views
  - `viewer.py`: Viewport and cursor state management
  - `formatting.py`: Data type-aware formatting helpers
  - `jobs.py`: Background job management (for summary statistics)

- **Sheet Layer** (`src/pulka/sheets/`): First-class sheet implementations
  - `data_sheet.py`: Primary data view with filters/sorting
  - `freq_sheet.py`: Frequency tables showing value counts
  - `summary_sheet.py`: Column statistics summary
  - `transpose_sheet.py`: Transposed view with columns as rows

- **Command Layer** (`src/pulka/command/`): Unified command system
  - `registry.py`: Command registration and execution
  - `builtins.py`: Standard command handlers

- **Render Layer** (`src/pulka/render/`): Pure rendering functions
  - `table.py`: Table rendering with highlighting
  - `status_bar.py`: Status bar layout and truncation logic

- **TUI Layer** (`src/pulka/tui/`): Terminal UI implementation
  - `app.py`: Main application integration
  - `screen.py`: Screen state and modal management
  - `keymap.py`: Key binding definitions
  - `modals.py`: Dialog and modal implementations

- **Debug Layer** (`src/pulka/debug/`): Debugging and replay tools
  - `replay.py`: TUI replay tool for reproducing recorded sessions
  - `replay_cli.py`: Command line interface for replay functionality

- **API Layer** (`src/pulka/api/`): Public embeddable interface
  - `session.py`: Main `Session` class for programmatic access
  - `__init__.py`: Re-exported public API

## Embedding via pulka.api

Pulka provides a clean API for embedding in other applications:

```python
from pulka.api import Runtime, Session, open

# Construct a runtime once per process to load config + plugins
runtime = Runtime()

# Open a dataset with a runtime-managed session
session = runtime.open("data.parquet")

# Or fall back to the legacy helpers when you don't need to reuse the runtime
session = open("data.parquet")
session = Session("data.parquet", viewport_rows=10, viewport_cols=5)

# Runtime metadata is available without opening a session
print(runtime.loaded_plugins)

# Access the shared JobRunner to schedule background work in custom integrations
runner = runtime.job_runner

# Run script commands programmatically
outputs = session.run_script(["down", "right", "sort", "render"])

# Or drive individual commands via the session runtime
runtime = session.command_runtime
result = runtime.invoke("down", source="docs")
if result.message:
    print(result.message)
if result.render.should_render:
    table_after_down = session.render()

# Render current view
table_output = session.render()

# Render without status bar
table_only = session.render(include_status=False)

# Open derived sheet views via the registry
freq_viewer = session.open_sheet_view(
    "freq",
    base_viewer=session.viewer,
    column_name="category",
    viewer_options={"source_path": None},
)
transpose_viewer = session.open_sheet_view(
    "transpose",
    base_viewer=freq_viewer,
)
```

The API exposes:
- `Runtime` for shared configuration, registries, and plugin metadata
- `Session` class for managing a data view session
- `Session.open_sheet_view()` for constructing derived sheet viewers (frequency, histogram, transpose, plugins)
- Derived sheet constructors must accept the runtime-managed `JobRunner` via the `runner` keyword
- `open()` convenience function
- `run_script()` for executing command sequences
- `command_runtime` for fine-grained command dispatch and recorder integration
- `render()` for getting current view as text
- Sheet properties via `session.sheet`
- Viewer state via `session.viewer`

## Development

### Quick Start

1. **Install dependencies:**
   ```bash
   uv sync --dev
   ```

2. **Run all quality checks:**
   ```bash
   uv run python -m pulka.dev check
   ```

3. **Auto-fix common issues:**
   ```bash
   uv run python -m pulka.dev fix
   ```

### Development Commands

- **`uv run python -m pulka.dev lint`** - Run Ruff linter
- **`uv run python -m pulka.dev format`** - Format code with Ruff
- **`uv run python -m pulka.dev lint-imports`** - Check static import layering contracts
- **`uv run python -m pulka.dev test`** - Run all tests
- **`uv run python -m pulka.dev check`** - Run all quality checks (lint + format + import contracts + tests)
- **`uv run python -m pulka.dev fix`** - Auto-fix issues and run tests

See [docs/architecture_guardrails/README.md](docs/architecture_guardrails/README.md) for more background on the
import contracts and how to interpret failures.

### Pre-commit Hooks

Pre-commit hooks using `prek` automatically run:
- `uv run ruff check .`
- `uv run python -m pulka_fixtures check`
- `uv run python -m pulka.testing.runners smoke`
- `uv run python benchmarks/check_microbench.py`
- `uv run pytest tests/test_determinism_canary.py -v`

### Development Workflow

```bash
# Make changes
vim src/pulka/...

# Check for issues
uv run python -m pulka.dev check

# Auto-fix what you can
uv run python -m pulka.dev fix

# Commit (hooks run automatically)
git commit -m "Your changes"
```

Code is formatted with Ruff (100 character line length) and follows modern Python 3.12+ conventions.

This enables integration into other tools, automated analysis, and test scenarios without requiring TUI dependencies.

## License

Pulka is available under the [MIT License](LICENSE).
