Skip to content

Data Setup

Data Source

Quant101 uses Polygon.io flat files distributed via AWS S3. These are daily OHLCV aggregates for all US equities.

Directory Structure

polygon_data/
├── lake/           # Parquet files (converted from csv.gz)
├── processed/      # Cached/resampled data
└── raw/            # Original csv.gz + metadata (splits, tickers, indices)

Data Acquisition

Download Raw Data

# Download recent Polygon.io flat files
python src/data/fetcher/polygon_downloader.py \
    --asset-class us_stocks_sip \
    --data-type day_aggs_v1 \
    --recent-days 7

# Convert csv.gz → Parquet
python src/data/fetcher/csvgz_to_parquet.py \
    --asset-class us_stocks_sip \
    --data-type day_aggs_v1 \
    --recent-days 7

Incremental Update

The update script handles the full pipeline (download, convert, splits, indices):

bash scripts/incremental_update/data_update.sh

Stock Universe

The pipeline uses a named universe registry rather than hardcoded ticker lists:

from data.universe import get_universe

tickers = get_universe("US_LARGE_CAP_50")  # 50 tickers, sector-organized

Available universes:

Name Size Description
US_LARGE_CAP_50 50 Sector-diversified US mega-caps
US_LARGE_CAP_52 52 Extended version

Register your own:

from data.universe import register_universe

register_universe("MY_TECH_10", ["AAPL", "MSFT", "GOOGL", ...])

Data Loading

The core loader handles OHLCV data with split adjustment and caching:

from data.loader.data_loader import stock_load_process

ohlcv = stock_load_process(
    tickers=["AAPL", "MSFT"],
    start_date="2024-01-01",
    end_date="2025-01-01",
)
# Returns: pl.LazyFrame with (ticker, timestamps, open, high, low, close, volume)

Column Naming Convention

Context Date Column Notes
Raw OHLCV timestamps As-is from Polygon
Factor signals date Renamed during pipeline
Returns date Standardized
Alpha/Risk/Execution date Via constants.DATE_COL

Known Issue

The timestampsdate rename is handled by the pipeline but not enforced globally. See src/constants.py for the canonical names.