Database Overview
Core Tables
| Table |
Records |
Purpose |
| in_network_services |
Loading... |
Medical services/procedures with quality scores |
| provider_groups |
Loading... |
Provider organizations (enriched with NPPES) |
| negotiated_rate_provider_groups |
Loading... |
Junction table: Service Provider links |
| nppes_providers |
Loading... |
National provider lookup (taxonomy enrichment) |
| negotiated_prices |
Loading... |
Actual dollar amounts for services |
Performance Note: With many provider groups, every query MUST use WHERE filters and LIMIT clauses to prevent memory exhaustion!
Complete Table List (19 Tables)
in_network_services
negotiated_rates
negotiated_prices
provider_groups
negotiated_rate_provider_groups
nppes_providers
reporting_entities
plans
allowed_amounts
allowed_amount_providers
provider_facilities
providers
services
tax_identifiers
service_facility_link
enrichment_log
file_processing_log
rate_pattern_flags
npi_registry_cache
Entity Relationship Diagram
Complete visual representation of all table relationships
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ INSURANCE PAYER HIERARCHY │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────┐
│ reporting_entities │ Insurance Company (BCBS Michigan)
│ ────────────────────── │
│ • id (PK) │
│ • name │
│ • entity_type │
└────────────┬─────────────┘
│ 1:N
┌──────────────────────────┐
│ plans │ Insurance Plans (HMO, PPO, etc.)
│ ────────────────────── │
│ • id (PK) │
│ • reporting_entity_id │ (FK)
│ • plan_name │
└────────────┬─────────────┘
│ 1:N
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ CORE SERVICE & PRICING DATA │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ in_network_services │ Medical Services
│ ─────────────────────────────────────────── │
│ • id (PK) 418,652 │
│ • billing_code (CPT/HCPCS) │
│ • name │
│ • plausibility_score │
│ • plausibility_flags │
└───────────────────────┬─────────────────────────┘
│ 1:N
┌─────────────────────────────────────────────────┐
│ negotiated_rates 418,652 │
└───────────────┬──────────────────┬──────────────┘
│ 1:N │ 1:N
┌──────────────────────────┐ ┌────────────────────────────────────┐
│ negotiated_prices │ │ negotiated_rate_provider_groups │
│ • negotiated_rate_value │ │ 34,011,942│
└──────────────────────────┘ └───────────────┬────────────────────┘
│ N:1
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ PROVIDER INFORMATION │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────┐
│ provider_groups 34,011,942│ Providers
│ • tin_value, npi_list │
│ • primary_taxonomy_code │
│ • primary_specialization │
│ • city, state, zip_code │
└────────────────────────────────────────────┘
│ Enrichment (JOIN on NPI)
│
┌────────────────────────────────────────────┐
│ nppes_providers 9,236,343 │
│ • npi (PK) │
│ • provider_taxonomy_code │
│ • provider_specialization │
└────────────────────────────────────────────┘
Key Insight: The junction table negotiated_rate_provider_groups with 34M records connects services to providers, creating a many-to-many relationship. This is why queries must be carefully optimized!
Index Strategy
in_network_services (418K records)
-
idx_in_network_services_billing_code (btree) Fast CPT/HCPCS lookup
-
idx_in_network_services_plausibility (btree) Quality filtering
-
idx_in_network_services_description_gin (GIN) Full-text search
provider_groups (34M records!) CRITICAL
Heavily indexed for performance on massive dataset:
-
idx_provider_groups_tin_value (btree) Provider lookup by Tax ID
-
idx_provider_groups_city (btree) Geographic queries
-
idx_provider_groups_state (btree) State filtering
-
idx_provider_groups_taxonomy_code (btree) Specialty filtering
-
idx_provider_groups_hospital_taxonomy (partial WHERE taxonomy ~~ '282%') Hospital-specific
negotiated_prices (418K records)
-
idx_negotiated_prices_rate_value (btree) Price filtering/sorting
-
idx_negotiated_prices_negotiated_type (btree) Rate type filtering
Critical Rule: With 34M records in provider_groups, EVERY query MUST use an indexed WHERE clause or the database will run out of memory!
Data Pipeline Flow
Phase 1: Data Ingestion
[CMS Transparency Files] src/parsers/cms_parser.py
Extract JSON Data
├─ reporting_entities (Insurance company)
├─ plans (Insurance plans)
├─ in_network_services (418K services)
├─ negotiated_rates (Rate containers)
├─ negotiated_prices (Dollar amounts)
└─ provider_groups (34M records created)
Phase 2: NPPES Enrichment
[NPPES Data File - 10.29GB CSV] src/scripts/load_nppes_data.py
Load 9,236,343 providers into nppes_providers table
src/scripts/enrich_provider_groups.py
Match provider_groups.npi_list nppes_providers.npi
├─ Copy primary_taxonomy_code
├─ Copy primary_taxonomy_name
├─ Copy primary_specialization
├─ Copy primary_classification
└─ Set taxonomy_enriched = true
Result: 34M provider groups enriched with specialty data
Phase 3: Quality Scoring
src/scripts/score_plausibility.py + taxonomy_validator.py
For each service (418K):
├─ Get provider taxonomies
├─ Get service category (CPT code range)
├─ Check specialty match
└─ Calculate plausibility_score (0-100)
Distribution: 37.54% Good | 22.64% Fair | 39.58% Questionable
Phase 4: Elasticsearch Indexing
src/elasticsearch/indexer.py
Aggregate by billing_code:
├─ MIN(rate), MAX(rate), AVG(rate)
├─ COUNT(providers)
└─ Nested provider/pricing arrays
Result: 3,869 unique billing codes indexed Kibana dashboards
Quality Scoring System
Plausibility Score (0-100)
Validates whether provider specialty matches service type
Score Ranges
90-100: Excellent
Perfect specialty match (e.g., Cardiologist doing heart surgery)
75-89: Good
Compatible provider-service match
50-74: Fair
Neutral/generic codes (DME, equipment)
25-49: Questionable
Specialty mismatch (e.g., Dentist doing orthopedic surgery)
Actual Distribution (418,652 services)
39.58%
Questionable (165,688)
Validation Logic
CPT Code Ranges:
• Surgery: 10000-69999
• Radiology: 70000-79999
• Lab: 80000-89999
• Medicine: 90000-99999
Scoring Algorithm:
1. Base score: 50 (neutral)
2. If provider classification matches service: +25
3. If specialty matches (e.g., Imaging for MRI): +25
4. Special rules (hospitals can do anything): 85
5. Max score: 100
Why This Matters: Quality scores help identify potentially questionable provider-service combinations, improving data reliability for price comparisons and network design.
Real-World Query Examples
1 Find Cheapest Provider for Service
SELECT
COALESCE(npp.provider_organization_name, npp.provider_name) as provider,
pg.city, pg.state,
pg.primary_specialization,
np.negotiated_rate_value as rate,
ins.plausibility_score
FROM in_network_services ins
JOIN negotiated_rates nr ON ins.id = nr.service_id
JOIN negotiated_prices np ON nr.id = np.negotiated_rate_id
JOIN negotiated_rate_provider_groups nrpg ON nr.id = nrpg.negotiated_rate_id
JOIN provider_groups pg ON nrpg.provider_group_id = pg.id
LEFT JOIN nppes_providers npp ON SPLIT_PART(pg.npi_list, ',', 1) = npp.npi
WHERE ins.billing_code = '70543' -- MRI Brain
AND pg.state = 'MI'
AND ins.plausibility_score >= 75 -- Only quality matches
ORDER BY np.negotiated_rate_value ASC
LIMIT 10;
Indexes Used: billing_code, state, plausibility_score
2 Provider Cost Efficiency Score
WITH market_avg AS (
SELECT AVG(np.negotiated_rate_value) as avg_rate
FROM in_network_services ins
JOIN negotiated_rates nr ON ins.id = nr.service_id
JOIN negotiated_prices np ON nr.id = np.negotiated_rate_id
WHERE ins.billing_code = '99212'
)
SELECT
pg.tin_value,
COALESCE(npp.provider_organization_name, 'Unknown') as provider_name,
pg.city,
AVG(np.negotiated_rate_value) as provider_avg,
ma.avg_rate as market_avg,
ROUND((AVG(np.negotiated_rate_value) - ma.avg_rate) / ma.avg_rate * 100, 1) as pct_vs_market
FROM provider_groups pg
LEFT JOIN nppes_providers npp ON SPLIT_PART(pg.npi_list, ',', 1) = npp.npi
JOIN negotiated_rate_provider_groups nrpg ON pg.id = nrpg.provider_group_id
JOIN negotiated_rates nr ON nrpg.negotiated_rate_id = nr.id
JOIN negotiated_prices np ON nr.id = np.negotiated_rate_id
JOIN in_network_services ins ON nr.service_id = ins.id
CROSS JOIN market_avg ma
WHERE ins.billing_code = '99212'
AND pg.state = 'MI'
GROUP BY pg.tin_value, npp.provider_organization_name, pg.city, ma.avg_rate
ORDER BY pct_vs_market ASC
LIMIT 20;
3 Geographic Price Variation
SELECT
pg.city,
pg.state,
COUNT(DISTINCT pg.id) as provider_count,
MIN(np.negotiated_rate_value) as min_rate,
AVG(np.negotiated_rate_value) as avg_rate,
MAX(np.negotiated_rate_value) as max_rate,
MAX(np.negotiated_rate_value) - MIN(np.negotiated_rate_value) as price_spread
FROM in_network_services ins
JOIN negotiated_rates nr ON ins.id = nr.service_id
JOIN negotiated_prices np ON nr.id = np.negotiated_rate_id
JOIN negotiated_rate_provider_groups nrpg ON nr.id = nrpg.negotiated_rate_id
JOIN provider_groups pg ON nrpg.provider_group_id = pg.id
WHERE ins.billing_code = '70543'
AND pg.state = 'MI'
AND pg.city IS NOT NULL
GROUP BY pg.city, pg.state
HAVING COUNT(DISTINCT pg.id) >= 3
ORDER BY price_spread DESC;
Pro Tip: These queries all use indexed columns (billing_code, state, city) and include LIMIT clauses to ensure good performance even with 34M records!
Key Design Decisions
Why Junction Table?
Without Junction Table (Bad):
negotiated_rates.provider_list = "[group1, group2, ...]"
- Can't query efficiently
- Can't enforce foreign keys
- Can't index properly
With Junction Table (Good):
negotiated_rate_provider_groups:
rate_id | provider_group_id
--------|------------------
1 | 12345
1 | 67890
- Fast lookups in both directions
- Referential integrity maintained
- Proper indexing on both FKs
Why Separate negotiated_prices?
Reason: One rate can have multiple price variations
Rate ID 1 (CPT 99212 - Office Visit):
├─ Price 1: $75 (professional, office)
├─ Price 2: $120 (facility, hospital)
└─ Price 3: $90 (professional, telehealth)
Supports: Different billing classes, settings, and modifiers for the same service
Why Denormalize Taxonomy?
Alternative (Normalized):
provider_groups.taxonomy_id taxonomy_reference_table.id
Current Design (Denormalized):
provider_groups:
• primary_taxonomy_code
• primary_taxonomy_name
• primary_specialization
• primary_classification
Advantages:
- Faster queries (no extra JOIN)
- Direct filtering on specialization
- Better index performance
- Simpler application logic
Trade-off:
- Data duplication (acceptable for analytics)
- Read performance >> Storage cost
Design Philosophy
All design decisions optimize for read performance and query simplicity in an analytics workload. With 34M+ provider records, every decision prioritizes fast filtering and aggregation over storage efficiency.
Data Volume Summary
Comprehensive overview of record counts across all core tables.
| Table Name |
Record Count |
Description |
Index Strategy |
| provider_groups |
Loading... |
Healthcare organizations (TIN-based) |
13+ indexes (tin, taxonomy, city, state, zip) |
| negotiated_rate_provider_groups |
Loading... |
Junction table (rates providers) |
2 FK indexes (rate_id, provider_group_id) |
| nppes_providers |
Loading... |
NPPES registry (NPI-based) |
NPI primary key + taxonomy indexes |
| in_network_services |
Loading... |
CPT/HCPCS services with quality scores |
billing_code, plausibility, description_gin |
| negotiated_rates |
Loading... |
Rate containers (1:1 with services) |
service_id FK index |
| negotiated_prices |
Loading... |
Actual dollar amounts |
negotiated_rate_id FK index |
| reporting_entities |
Loading... |
Insurance companies (BCBS Michigan) |
Primary key only |
| plans |
Loading... |
Insurance plans (PPO, HMO, etc.) |
reporting_entity_id FK index |
| TOTAL |
Loading... |
Total records across all core tables |
Performance Implications
- provider_groups: REQUIRES indexed WHERE clauses or database runs out of memory
- Junction table: Every provider price query touches this - indexes are critical
- NPPES providers: NPI lookups must use primary key index
- Total records: Full table scans are impossible - always filter first
Enrichment Status Tracking
The enrichment process populates provider taxonomy data from NPPES registry. This system tracks enrichment progress and status.
enrichment_log Table
Purpose: Tracks each enrichment batch run with timestamps and statistics
CREATE TABLE enrichment_log (
id SERIAL PRIMARY KEY,
batch_id VARCHAR(100) UNIQUE NOT NULL,
records_processed INTEGER,
records_enriched INTEGER,
records_failed INTEGER,
status VARCHAR(50),
started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT
);
Provider Enrichment Flags
provider_groups table includes:
taxonomy_enriched (BOOLEAN) - Indicates if NPPES data was successfully applied
enriched_at (TIMESTAMP) - When enrichment occurred
enrichment_source (VARCHAR) - Source system (usually 'nppes')
Query enrichment status:
SELECT
COUNT(*) AS total_providers,
COUNT(*) FILTER (WHERE taxonomy_enriched = true) AS enriched,
COUNT(*) FILTER (WHERE taxonomy_enriched = false) AS pending,
ROUND(100.0 * COUNT(*) FILTER (WHERE taxonomy_enriched = true) / COUNT(*), 2) AS pct_enriched
FROM provider_groups;
Enrichment Process Flow
1. LOAD NPPES DATA
├─ Download 10.29GB NPPES CSV
├─ Parse 9,236,343 provider records
└─ Load into nppes_providers table
2. MATCH NPIs
├─ Extract NPI list from provider_groups.npi_list (JSON array)
├─ JOIN nppes_providers ON npi
└─ Match primary NPI (first in array)
3. COPY TAXONOMY FIELDS
├─ primary_taxonomy_code (e.g., "207R00000X")
├─ primary_taxonomy_name (e.g., "Internal Medicine")
├─ primary_specialization (e.g., "Internal Medicine")
├─ primary_classification (e.g., "Allopathic & Osteopathic Physicians")
└─ Set taxonomy_enriched = true
4. LOG RESULTS
└─ Insert batch record into enrichment_log
Current Enrichment Status
34,011,942
Total Providers
Schema Evolution Timeline
The database schema evolved through 4 major phases from initial load to production-ready analytics platform.
Phase 1: Initial Data Load
- Parsed CMS transparency files
- Created base tables: reporting_entities, plans, in_network_services
- Loaded 418,652 services from BCBS Michigan
- Basic indexes on primary keys only
Result: Basic data structure, slow queries, no enrichment
Phase 2: NPPES Enrichment
- Downloaded 10.29GB NPPES CSV (9.2M providers)
- Created nppes_providers table with NPI primary key
- Added taxonomy fields to provider_groups
- Enriched 34M provider groups with taxonomy data
- Added enrichment_log tracking table
Result: Providers now have specialization data, enabling specialty-based queries
Phase 3: Quality Scoring
- Added plausibility_score (0-100) to in_network_services
- Added plausibility_flags (JSON) for validation warnings
- Implemented taxonomy_validator.py logic
- Scored all 418,652 services for provider-service appropriateness
- Created plausibility index for fast filtering
Result: 37.54% Good scores, 39.58% Questionable - quality insights enabled
Phase 4: Elasticsearch Integration
- Indexed 3,869 unique billing codes to Elasticsearch
- Enabled fuzzy search and full-text queries
- Added description_gin GIN index for PostgreSQL full-text
- Hybrid search: Elasticsearch for discovery, PostgreSQL for analytics
- Added 13+ performance indexes to provider_groups
Result: Production-ready platform with sub-second search and analytics on 77M+ records
Evolution Metrics
4 phases
Major Iterations
Conceptual Mind Map
High-level visualization of the complete database architecture showing all major components and relationships.
HEALTHCARE TRANSPARENCY DATABASE
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌─── INSURANCE ────┐ ┌──── SERVICES ────┐ ┌──── PROVIDERS ────┐
│ │ │ │ │ │
│ reporting_ │ │ in_network_ │ │ provider_groups │
│ entities (1) │ │ services │ │ (34M records!) │
│ │ │ │ (418K) │ │ │ │
│ │ │ │ │ │ │ │
│ plans (~50) │ │ │ │ │ ┌─────┴─────┐ │
└──────────────────┘ │ │ │ │ │ │
│ negotiated_ │ │ │
│ rates │ │ Enriched npi │
│ (418K) │ │ Taxonomy _list │
│ │ │ │ Fields (JSON) │
│ ├─────┐ │ │ │ │ │
│ │ │ │
│ negotiated rate │ │ primary_ nppes_ │
│ _prices _pvdr │ │ taxonomy pvdrs │
│ (418K) _grps │ │ _code (9.2M) │
│ (34M!) │ │ │
└──────────────────┘ └──────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌─── ENRICHMENT ───┐ ┌──── QUALITY ─────┐ ┌───── SEARCH ─────┐
│ │ │ │ │ │
│ enrichment_log │ │ plausibility_ │ │ Elasticsearch │
│ • batch_id │ │ score (0-100) │ │ • 3,869 codes │
│ • records_ │ │ • flags (JSON) │ │ • Fuzzy search │
│ processed │ │ • validated_at │ │ • Full-text │
│ • records_ │ │ │ │ │
│ enriched │ │ Distribution: │ │ Hybrid Query: │
│ • status │ │ • 37.54% Good │ │ 1. ES discovery │
│ │ │ • 22.64% Fair │ │ 2. PG analytics │
│ NPPES Pipeline: │ │ • 39.58% Quest. │ │ 3. Combine │
│ └─ Match NPI │ │ │ │ │
│ └─ Copy Tax. │ │ Taxonomy Logic: │ │ GIN Index: │
│ └─ Set Flag │ │ CPT Taxonomy │ │ description_gin │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌──── INDEXES ─────┐ ┌─── GEOGRAPHY ────┐ ┌──── ANALYTICS ───┐
│ │ │ │ │ │
│ 50+ total │ │ city, state, zip │ │ Price Comp.: │
│ │ │ indexes │ │ • Geographic │
│ Critical: │ │ │ │ • Provider │
│ • tin_value │ │ Regional Search: │ │ • Specialty │
│ • taxonomy_code │ │ └─ Fast city │ │ │
│ • billing_code │ │ queries │ │ Quality Filter: │
│ • plausibility │ │ └─ State agg. │ │ • Score ranges │
│ │ │ └─ Zip lookup │ │ • Flag analysis │
│ Partial: │ │ │ │ │
│ • hospital_tax. │ │ Hospital-spec: │ │ Cost Analysis: │
│ (282% only) │ │ └─ 282% codes │ │ • Min/max/avg │
│ │ │ + city │ │ • Percentiles │
└──────────────────┘ └──────────────────┘ └──────────────────┘
KEY DESIGN PRINCIPLES:
├─ Scale First: Every decision optimized for 34M+ provider queries
├─ Denormalize: Taxonomy fields copied for performance (no JOIN needed)
├─ Index Everything: 50+ indexes enable sub-second queries on massive tables
├─ Junction Table: Many-to-many relationships properly normalized
├─ Quality Scores: Built-in validation for service-provider appropriateness
└─ Hybrid Search: Elasticsearch for discovery, PostgreSQL for analytics
Architecture Highlights
3 Main Entities
Insurance, Services, Providers - all interconnected
3 Enhancement Layers
Enrichment, Quality, Search - stacked capabilities
3 Performance Systems
Indexes, Geography, Analytics - speed optimized
ETL Pipeline Guide
Reference guide for running the ETL data loading pipeline. Commands are designed to run on the AWS Lightsail server via SSH.
Basic Commands
# Run full ETL pipeline (all files)
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py
# Run with record limit (for testing)
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --limit-records 10000
# Run with file limit
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --limit 5
State Filter
Filter data by provider state using NPPES enrichment. Only providers in the specified state(s) will be loaded.
# Filter to Michigan providers only
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --state-filter MI
# Filter to multiple states
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --state-filter MI,OH,IN
Billing Code Filter
Filter data by billing codes to load only specific procedures or services. Supports exact matching on billing_code and/or billing_code_type.
# Filter to a single billing code
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --billing-codes 99213
# Filter to multiple billing codes
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --billing-codes 99213,99214,99215
# Filter by billing code type (CPT, HCPCS, etc.)
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --billing-code-types CPT
# Combine filters
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --billing-codes 99213,99214 --billing-code-types CPT
Tip: Billing code filter dramatically reduces data size by only loading matching services and their associated provider groups.
Combined Filters
Filters can be combined for precise data loading. Example: load only specific CPT codes for Michigan providers.
# Office visits (99213-99215) in Michigan only
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py \
--state-filter MI \
--billing-codes 99213,99214,99215
# All CPT codes in Michigan and Ohio, limit to first 5 files
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py \
--state-filter MI,OH \
--billing-code-types CPT \
--limit 5
Performance Options
# Adjust batch size (default: 50000)
--batch-size 25000
# Provider group chunk size (default: 100000)
--provider-group-chunk-size 50000
# Enable parallel processing (multiple files)
--parallel --workers 4
# Sample percentage (for testing)
--sample-percent 10
Storage Optimization
Reduce disk usage when processing large files. The --stream-gz flag reads compressed files directly without decompressing to disk (saves 10x storage). The --delete-after-process flag removes files after loading.
# Read .gz files directly (no decompression to disk)
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --stream-gz
# Delete files after processing
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --delete-after-process
# Combine both for minimal storage usage
docker compose -f docker-compose.production.yml exec api python src/etl/load_data.py --stream-gz --delete-after-process
Tip: Using --stream-gz avoids 10x disk expansion (e.g., 5GB 50GB). Slightly slower but essential for limited storage.
Batch Download & Process Mode (Recommended)
The most efficient way to load large datasets. Downloads N files at a time, processes them with ETL, deletes them, then downloads the next batch. Minimizes disk usage by never having all files on disk at once.
# Batch mode: download 100 files, 5 at a time
docker compose -f docker-compose.production.yml exec api python src/scraper/download_and_process.py \
--batch-mode --limit 100 --batch-size 5 \
--file-types in_network --db-id bcbs_michigan
# With state and billing code filters
docker compose -f docker-compose.production.yml exec api python src/scraper/download_and_process.py \
--batch-mode --limit 50 --batch-size 3 \
--file-types in_network --db-id bcbs_michigan \
--state-filter MI --billing-codes 99213,99214,99215
# Large-scale load for new payer
docker compose -f docker-compose.production.yml exec api python src/scraper/download_and_process.py \
--batch-mode --limit 500 --batch-size 10 \
--file-types in_network --db-id uhc_michigan \
--stream-gz --etl-batch-size 10000
Old method: Download 100 files (2TB) Decompress Load
Peak disk: ~2TB
Batch mode: Download 5 files Load Delete Repeat
Peak disk: ~100GB
Post-ETL Steps
After loading data, run these commands to complete the pipeline:
# 1. Reindex Elasticsearch (with NPPES enrichment)
docker compose -f docker-compose.production.yml exec api python reindex_with_nppes.py
# 2. Check database status
docker compose -f docker-compose.production.yml exec postgres psql -U healthcare_admin -d healthcare_bcbs_michigan_production -c "SELECT COUNT(*) FROM in_network_services;"
# 3. View processing logs
docker compose -f docker-compose.production.yml logs -f --tail=100 api
CLI Arguments Quick Reference
| Argument |
Description |
Example |
--billing-codes |
Comma-separated list of billing codes to filter |
99213,99214,99215 |
--billing-code-types |
Comma-separated list of code types (CPT, HCPCS, etc.) |
CPT,HCPCS |
--state-filter |
Comma-separated list of state abbreviations |
MI,OH,IN |
--limit |
Max number of files to process |
5 |
--limit-records |
Max records per file |
10000 |
--batch-size |
Records per batch commit (default: 50000) |
25000 |
--parallel |
Enable parallel file processing |
--parallel --workers 4 |
--sample-percent |
Random sample percentage (for testing) |
10 |
--stream-gz |
Read .gz files directly without decompression |
(flag only) |
--delete-after-process |
Delete files after loading to save storage |
(flag only) |
Batch Download Mode Arguments
These arguments are for the download_and_process.py script with --batch-mode:
| Argument |
Description |
Example |
--batch-mode |
Enable batch download/process/delete loop |
(flag only) |
--batch-size |
Files per batch (default: 5) |
10 |
--limit |
Total files to process |
100 |
--file-types |
in_network, allowed_amount, or both |
in_network |
--db-id |
Database ID for file organization |
bcbs_michigan |
--stream-gz |
Read .gz directly (default: true) |
(flag only) |
--etl-batch-size |
Records per ETL batch |
10000 |
--state-filter |
Filter providers by state |
MI |
--billing-codes |
Filter by billing codes |
99213,99214 |