🏥 Insurance Premium API — Data Flow

A complete visual walkthrough: from user input → validation → ML model → Docker container

📍 Full Request Lifecycle — click any step to expand
💡 Each step below represents one layer of the system. Click a card to see exactly what code runs and what data looks like at that point.
1

🖥 Streamlit Frontend frontend.py

User fills a form → raw values collected → HTTP POST sent to FastAPI

The user sees a Streamlit web form. When they hit Predict, these exact raw values are packaged into a JSON payload:

# What the user enters in the form payload = { "age": 35, # int, 0-120 "weight": 75.0, # float, kg "height": 1.75, # float, meters "income_lpa": 12.0, # float, lakhs/year "smoker": True, # bool "city": "Mumbai", # str "occupation": "private_job" # Literal enum } # Sent over HTTP POST to: http://18.117.168.71:8000/predict response = requests.post(f"{api_base}/predict", json=payload, timeout=20)
⚠️ The API URL is hardcoded to an AWS EC2 instance IP (18.117.168.71:8000). Users can override it in the sidebar.
2

⚡ FastAPI Receives Request app.py

POST /predict — FastAPI hands the JSON body to the UserInput Pydantic model

FastAPI automatically parses the JSON body. The type annotation user_input: UserInput triggers Pydantic validation before any of your code runs.

# app.py @app.post("/predict", response_model=PredictionResponse) def predict_premium(user_input: UserInput): # ↑ FastAPI calls UserInput(**json_body) automatically # If validation fails → 422 Unprocessable Entity (before this runs) # If validation passes → this function executes ...

Also available: GET / (welcome message) and GET /health (status + model version).

3

🔍 Pydantic Validation + Feature Engineering schema/user_input.py

7 raw inputs → validated → 4 computed fields auto-generated → 11 total fields

The UserInput Pydantic model does heavy lifting: it validates ranges, normalises text, and computes derived features automatically using @computed_field and @field_validator.

# schema/user_input.py — computed fields @computed_field # auto-calculated, never passed by user def bmi(self) -> float: return self.weight / (self.height ** 2) # 75 / (1.75²) = 24.49 @computed_field def lifestyle_risk(self) -> str: if self.smoker and self.bmi > 30: return "high" elif self.smoker and self.bmi > 27: return "medium" return "low" # smoker=True, bmi=24.49 → "low" @computed_field def age_group(self) -> str: if age < 25: return "young" elif age < 45: return "adult" # ← age=35 elif age < 60: return "middle_aged" return "senior" @computed_field def city_tier(self) -> int: if city in tier_1_cities: return 1 # ← "Mumbai" → Tier 1 elif city in tier_2_cities: return 2 return 3 @field_validator("city") def validate_city(v): return v.strip().title() # "mumbai" → "Mumbai"

Also imports tier_1_cities and tier_2_cities from config/city_tier.py for the lookup.

4

✂️ Feature Selection (in app.py) app.py

11 fields → slimmed to exactly 6 features the ML model expects

Back in app.py, only 6 of the 11 available fields are extracted and passed to the prediction function. Raw inputs like age, weight, height, smoker, city are dropped — they already contributed to computed features.

# app.py — extract only what the model needs user_input = { "bmi": user_input.bmi, # computed "age_group": user_input.age_group, # computed "lifestyle_risk": user_input.lifestyle_risk, # computed "city_tier": user_input.city_tier, # computed "income_lpa": user_input.income_lpa, # raw (kept) "occupation": user_input.occupation # raw (kept) } # age, weight, height, smoker, city → dropped here ✂️
5

🤖 ML Model Prediction model/predict.py

6-feature dict → pandas DataFrame → model.predict() + predict_proba() → result dict

The model is a pre-trained scikit-learn classifier loaded from model/model.pkl at startup (not per request). It outputs 3 premium categories.

# model/predict.py with open("model/model.pkl", "rb") as f: model = pickle.load(f) # loaded ONCE at startup class_labels = model.classes_.tolist() # e.g. ["High","Low","Medium"] MODEL_VERSION = "1.0.0" def predict_output(user_input: dict): input_df = pd.DataFrame([user_input]) # 1-row DataFrame prediction = model.predict(input_df)[0] # e.g. "High" probabilities = model.predict_proba(input_df)[0] # e.g. [0.85, 0.10, 0.05] confidence = max(probabilities) # e.g. 0.85 prob_dict = dict(zip(class_labels, map(lambda x: round(x, 4), probabilities))) # {"High": 0.85, "Low": 0.10, "Medium": 0.05} return { "prediction": prediction, "confidence": round(confidence, 4), "probabilities": prob_dict }
6

📤 JSON Response Sent Back app.py → frontend.py

FastAPI wraps prediction in JSONResponse → Streamlit displays category, confidence & probabilities

FastAPI returns a JSONResponse with status 200. The structure does NOT exactly match PredictionResponse schema (uses different keys) — this is a minor inconsistency in the code.

# FastAPI sends (wrapped in "response" key): { "response": { "prediction": "High", "confidence": 0.8543, "probabilities": { "High": 0.8543, "Low": 0.0921, "Medium": 0.0536 } } } # Streamlit reads it as: result = data.get("response", {}) category = result.get("predicted_category") or result.get("prediction") confidence = result.get("confidence") probabilities = result.get("class_probabilities") or result.get("probabilities", {})
🔄 How Data Changes Shape at Every Stage
Raw input (from user)
Computed by Pydantic
Dropped / not passed forward
Sent to ML model

① Form Submission

age35
weight75.0 kg
height1.75 m
income_lpa12.0
smokerTrue
city"Mumbai"
occupation"private_job"

② After Pydantic (11 fields)

age35 ✓
weight75.0 ✓
height1.75 ✓
income_lpa12.0 ✓
smokerTrue ✓
city"Mumbai" ✓
occupation"private_job" ✓
bmi24.49 🔧
lifestyle_risk"low" 🔧
age_group"adult" 🔧
city_tier1 🔧

③ Sent to Model (6 fields)

agedropped
weightdropped
heightdropped
smokerdropped
citydropped
bmi24.49
age_group"adult"
lifestyle_risk"low"
city_tier1
income_lpa12.0
occupation"private_job"

④ Model Output

prediction"High"
confidence0.8543
probabilities: High: 0.8543 Low: 0.0921 Medium: 0.0536

⚡ Live Data Packet Simulation

🖥 Streamlit
──JSON──▶
⚡ FastAPI
──validate──▶
🔍 Pydantic
──6 features──▶
🤖 ML Model
──prediction──▶
🖥 Display
💡 Key insight: The model was trained on 6 features (bmi, age_group, lifestyle_risk, city_tier, income_lpa, occupation) — NOT on the raw inputs. The feature engineering in Pydantic mirrors what was done during model training. This is intentional design.
🗂 File Structure & Responsibilities
🖥

frontend.py

Streamlit web UI. Renders input form, collects user values, sends HTTP POST to the FastAPI backend, and displays the prediction result.

Streamlit • requests

app.py

FastAPI application entry point. Defines 3 endpoints: / (welcome), /health (status), /predict (POST). Orchestrates the full prediction pipeline.

FastAPI • uvicorn
🔍

schema/user_input.py

Pydantic model for request validation AND feature engineering. Computes bmi, lifestyle_risk, age_group, city_tier automatically from raw inputs.

Pydantic v2 • @computed_field
📋

schema/prediction_response.py

Pydantic model for API response documentation. Defines predicted_category, confidence, class_probabilities. Used as OpenAPI schema only (note: actual response uses JSONResponse directly).

Pydantic • OpenAPI docs
🤖

model/predict.py

Loads the pickled ML model at startup. Provides predict_output() which wraps model.predict() and model.predict_proba(), returning a dict with prediction, confidence, and per-class probabilities.

scikit-learn • pandas • pickle
📦

model/model.pkl

Pre-trained scikit-learn classifier (binary pickle). Has .classes_, .predict(), and .predict_proba() — typical of a RandomForest, GradientBoosting, or similar ensemble. Version: 1.0.0

Pre-trained • binary file
🏙

config/city_tier.py

Pure data config. Two lists: tier_1_cities (7 metro cities) and tier_2_cities (48 smaller cities). Everything else → Tier 3. Used by UserInput.city_tier computed field.

Config • lookup lists
🐳

Dockerfile

Containerises the FastAPI backend. Uses python:3.12-slim, installs requirements, copies all code, exposes port 8000, and runs uvicorn to serve the app.

Docker • uvicorn
📝

requirements.txt

Lists Python dependencies installed inside the Docker image. Likely includes fastapi, uvicorn, pydantic, scikit-learn, pandas, and streamlit.

pip • dependencies
🐳 Docker — How the Backend is Packaged & Run
The Dockerfile only containerises the FastAPI backend. Streamlit (frontend.py) runs separately and points to the container's exposed port 8000.
🐳 Docker Container (python:3.12-slim)

📋 Dockerfile Breakdown

  • FROM python:3.12-slim — lightweight Python 3.12 base image (~60MB)
  • WORKDIR /app — all subsequent commands run from /app
  • COPY requirements.txt . — copy deps first (Docker layer caching)
  • RUN pip install --no-cache-dir -r requirements.txt — install all Python packages
  • COPY . . — copy ALL project files: app.py, model/, schema/, config/
  • EXPOSE 8000 — document that the container listens on 8000
  • CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] — start server
🔌 Port 8000 → FastAPI + uvicorn
Host: 0.0.0.0 (accessible from outside container)

📁 What's Inside the Container After Build

/app/ ├── app.py # FastAPI entry point ├── requirements.txt # Python dependencies ├── schema/ │ ├── user_input.py # Pydantic validation + feature eng. │ └── prediction_response.py ├── model/ │ ├── predict.py # prediction logic │ └── model.pkl # pre-trained ML model (binary) ├── config/ │ └── city_tier.py # city → tier mapping └── frontend.py # also copied but NOT run by CMD

🚀 Build & Run Commands

# Build the image docker build -t insurance-premium-api . # Run the container docker run -p 8000:8000 insurance-premium-api # Streamlit (runs OUTSIDE the container, talks to it) streamlit run frontend.py # → open http://localhost:8501 in browser # → frontend POSTs to http://18.117.168.71:8000/predict (or your container)
⚠️ Note: FastAPI_Key.pem is in the project folder — this is an AWS EC2 SSH key for the server at 18.117.168.71. It should NOT be committed to git or included in the Docker image in production. The COPY . . instruction currently copies it into the container.
📊 Quick Reference Summary
7
Raw inputs from user
age, weight, height, income_lpa, smoker, city, occupation — entered in the Streamlit form
4
Features auto-computed by Pydantic
bmi (weight/height²), lifestyle_risk, age_group, city_tier — never sent by the user, generated server-side
6
Features fed to the ML model
bmi, age_group, lifestyle_risk, city_tier, income_lpa, occupation — the exact feature set the model was trained on
3
Output premium categories
Low, Medium, High — the ML model outputs one predicted class plus probability scores for all three
3
API Endpoints
GET / welcome, GET /health status check + model version, POST /predict main prediction endpoint
55
Cities with explicit tier mapping
7 Tier-1 metros + 48 Tier-2 cities defined in config/city_tier.py. All other cities → Tier 3

🗺 Full System in One Line

User fills form (Streamlit) → HTTP POST /predictFastAPI receives JSONPydantic validates + computes 4 new fieldsapp.py selects 6 featuresmodel.pkl predicts (Low/Medium/High) + probabilitiesJSON response back to StreamlitUI shows prediction, confidence, class probabilities

🏙 City Tier Logic

Tier 1Mumbai, Delhi, Bangalore, Chennai, Kolkata, Hyderabad, Pune
Tier 248 mid-sized cities (Jaipur, Lucknow, Surat…)
Tier 3Everything else (default fallback)