Introduction

In finance, a decision isn’t useful unless you can show why it was made. “Reasoning” AI models help you do exactly that. Instead of giving a one-line yes/no, they walk through the factors—income, requested loan amount, credit history, number of dependents—and explain how each one affected the result.

In this guide, we set up a simple decision policy (think: a rulebook) inside the model’s system prompt. The model then follows those rules and thresholds every time, so you get a clear outcome plus a short audit trail you can share with risk, compliance, or regulators.

What is DuckDB? (and why you might care)

DuckDB is a small, fast analytics database that runs inside your app—no separate server to manage. Think of it as “SQLite for analytics.” It’s great for slicing and dicing data, running big joins, and computing metrics for dashboards or models.

Runs in-process: Drop it into a Python or R workflow; no ops overhead.
Columnar engine: Stores data by column, so queries that touch a few fields read less data and go faster.
Vectorized & parallel: Processes data in batches and uses all CPU cores to speed up heavy analytics.
File-friendly: Query CSV or Parquet files and even Pandas DataFrames directly—handy for quick analysis, feature engineering, and lightweight ETL without standing up a warehouse.

Bottom line: DuckDB gives your team warehouse-style analytics speed without warehouse-level complexity or cost—perfect for prototypes, notebooks, and embedded analytics.

Reasoning Models (why they’re different)

A reasoning model is an AI that “shows its work.” It lays out the steps it took to reach a conclusion, so you can trust the result and review it later. That makes it well-suited to finance, where decisions must be explainable and consistent.

For this project we use DeepSeek-R1-Distill-LLaMA-70B:

What it is: A large LLaMA-family model distilled (compressed) to run more efficiently while keeping strong analytical skills.
Why it’s a fit: It follows structured instructions and applies rules/thresholds reliably—ideal for loan approvals and credit risk.
Benefit to you: Lower compute cost than many “full-size” models, while still giving clear, step-by-step justifications you can share with stakeholders.

Code Walkthrough

Prerequisites

You need Python 3.10+ and two libraries:

duckdb – a fast, embedded analytics database for reading/querying data files
groq – the SDK to call the DeepSeek reasoning model via Groq

Install (in a notebook use the !, in a terminal omit it):

! pip install duckdb groq

Step 1: Load the Dataset with DuckDB (what’s happening)

We’re using a public loan prediction dataset (~614 rows). Columns include applicant income, co-applicant income, loan amount/term, credit history, demographics, and the final label (approved or not). It’s a simple, realistic sandbox to practice explainable lending decisions.

Why DuckDB here? It lets you query a CSV as if it were a table—fast and zero setup. The snippet below previews five rows so you can eyeball types and missing values.

import duckdb
duckdb.read_csv("train.csv")
con = duckdb.connect()
df = con.sql("SELECT * FROM 'train.csv' LIMIT 5").df()
print(df)

Notes

You can actually query train.csv directly (DuckDB treats the file like a table).
Expect a few NULL/None entries (e.g., LoanAmount). That’s normal in real data.
For production scoring, align units (monthly vs annual income, loan amount units).

Step 2: The System Prompt (your policy as a rulebook)

This prompt is the policy the model must follow. It spells out the factors and thresholds, so the model “shows its work” instead of giving a black-box yes/no.

Key ideas

Credit history carries the most weight (poor credit → usually decline).

Affordability via a simple loan-to-income check:
- If loan ÷ (applicant + co-applicant income) > 5 → DECLINE
- If between 2–5 → REVIEW
- If < 2 and credit is good → APPROVE
Socio-economic fields (education, marital status, dependents) are supporting context only—not decisive—to avoid unfair bias.

Always return a structured JSON with verdict, score, reasons, and recommendations.

SYSTEM = """You are a senior loan eligibility officer at Dream Housing Finance.
Your task: decide if a customer is eligible for a home loan based ONLY on the provided case JSON.

Consider the following factors logically:
- Applicant's income (higher income → higher eligibility)
- Co-applicant income (combined income matters)
- Loan amount (must be reasonable compared to income)
- Loan term (longer terms lower monthly burden)
- Credit history (1 = good, 0 = poor; poor credit usually leads to decline)
- Education, marital status, and dependents can influence stability, but DO NOT use them unfairly--only as supportive signals.

Policy guidelines:
- If credit history = 0, high risk → usually DECLINE unless strong income/low loan.
- If loan amount / total income ratio is very high (> 5), mark DECLINE.
- If ratio is moderate (2-5), mark REVIEW.
- Otherwise, APPROVE.
- Return JSON with: verdict (APPROVE/REVIEW/DECLINE), risk_score_0_100, reasons, and recommendations.

Be clear and logical in your reasoning, showing step-by-step analysis of the applicant's profile.
"""

‍

Reality check:

The demo uses a simple loan-amount ÷ income proxy for affordability. In common loan datasets, ApplicantIncome is monthly and LoanAmount is often in thousands, so raw division mismatches units and can understate the ratio. For realistic outcomes, either use your operational EMI/Income (FOIR/DTI) policy band (e.g., 30–45%), or normalize both values to the same time unit and scale before applying thresholds.

Step 3: Using a Reasoning Model (what the code does)

Pull a small sample from the CSV.
Send it + the policy prompt to the DeepSeek reasoning model via Groq.
Streaming prints tokens as they arrive (nice for demos).

query = "SELECT * FROM 'train.csv' LIMIT 8"
sample = con.execute(query).df().to_dict(orient="records")
from groq import Groq

client = Groq(api_key="YOUR_GROQ_API_KEY")

completion = client.chat.completions.create(
   model="deepseek-r1-distill-llama-70b",
   messages=[
       {
           "role": "system",
           "content": SYSTEM
       },
       {
           "role": "user",
           "content": f"Here is some loan application data:\n{sample}\n\nWho is most at risk of default and why?"
       }
   ],
   temperature=0.6,
   max_completion_tokens=4096,
   top_p=0.95,
   stream=True
)
for chunk in completion:
   print(chunk.choices[0].delta.content or "", end="")

‍

Tip: or executive demos, show only the final structured output (verdict, score, reasons, recommendations) rather than intermediate thought streams.

Expected Output

Present an audit-friendly result—either a concise narrative summary, a compact JSON, or both. Avoid publishing internal chain-of-thought.

Decision Summary (example)

The model flags LP001014 as the most at-risk applicant. While their combined income (5540) and loan amount (158) produce a low loan-to-income proxy (~0.0285), the credit history of 0 is the decisive factor. In lending, a poor credit track record is a strong predictor of default and outweighs positive signals like a manageable loan size or adequate income. Supportive context—such as 3+ dependents, which can add financial strain—reinforces the risk. Together, these lead to a verdict of DECLINE with a high risk score of 90/100. This step-by-step explanation shows how the model weighs multiple inputs but ultimately prioritizes credit history, providing a transparent, audit-friendly trail of the decision.

‍

<think>
Okay, so I need to figure out who is most at risk of default based on the provided loan applications. Let's start by understanding the key factors that determine loan eligibility and risk. The main factors mentioned are income (both applicant and co-applicant), loan amount, loan term, credit history, education, marital status, and dependents.

First, I'll go through each application one by one and analyze these factors.

Starting with LP001002:
- Loan_ID: LP001002
- Gender: Male
- Married: False
- Dependents: 0
- Education: Graduate
- Self_Employed: False
- ApplicantIncome: 5849
- CoapplicantIncome: 0.0
- LoanAmount: None
- Loan_Amount_Term: 360
- Credit_History: 1
- Property_Area: Urban
- Loan_Status: Y

Since LoanAmount is None, it's unclear, but the income is 5849, which is decent. Credit history is good (1). No co-applicant income. Since loan amount isn't provided, it's hard to assess the ratio, but with good income and credit, risk is lower.

Next, LP001003:
- Loan_ID: LP001003
- Gender: Male
- Married: True
- Dependents: 1
- Education: Graduate
- Self_Employed: False
- ApplicantIncome: 4583
- CoapplicantIncome: 1508.0
- LoanAmount: 128
- Loan_Amount_Term: 360
- Credit_History: 1
- Property_Area: Rural
- Loan_Status: N

Combined income is 4583 + 1508 = 6091. Loan amount is 128. The ratio is 128 / 6091 ≈ 0.021, which is very low. Credit history is good. So, low risk. But loan status is 'N', so maybe other factors, but based on given data, seems okay.

LP001005:
- Loan_ID: LP001005
- Gender: Male
- Married: True
- Dependents: 0
- Education: Graduate
- Self_Employed: True
- ApplicantIncome: 3000
- CoapplicantIncome: 0.0
- LoanAmount: 66
- Loan_Amount_Term: 360
- Credit_History: 1
- Property_Area: Urban
- Loan_Status: Y

Self-employed, which might be a risk, but income is 3000, loan amount 66. Ratio is 66/3000 ≈ 0.022. Good credit. So, low risk.

LP001006:
- Loan_ID: LP001006
- Gender: Male
- Married: True
- Dependents: 0
- Education: Not Graduate
- Self_Employed: False
- ApplicantIncome: 2583
- CoapplicantIncome: 2358.0
- LoanAmount: 120
- Loan_Amount_Term: 360
- Credit_History: 1
- Property_Area: Urban
- Loan_Status: Y

Combined income: 2583 + 2358 = 4941. Loan amount 120. Ratio ≈ 0.024. Good credit. Low risk.

LP001008:
- Loan_ID: LP001008
- Gender: Male
- Married: False
- Dependents: 0
- Education: Graduate
- Self_Employed: False
- ApplicantIncome: 6000
- CoapplicantIncome: 0.0
- LoanAmount: 141
- Loan_Amount_Term: 360
- Credit_History: 1
- Property_Area: Urban
- Loan_Status: Y

Income is 6000, loan 141. Ratio ≈ 0.0235. Good credit. Low risk.

LP001011:
- Loan_ID: LP001011
- Gender: Male
- Married: True
- Dependents: 2
- Education: Graduate
- Self_Employed: True
- ApplicantIncome: 5417
- CoapplicantIncome: 4196.0
- LoanAmount: 267
- Loan_Amount_Term: 360
- Credit_History: 1
- Property_Area: Urban
- Loan_Status: Y

Combined income: 5417 + 4196 = 9613. Loan amount 267. Ratio ≈ 0.0278. Good credit. Self-employed, but ratio is okay. Moderate risk.

LP001013:
- Loan_ID: LP001013
- Gender: Male
- Married: True
- Dependents: 0
- Education: Not Graduate
- Self_Employed: False
- ApplicantIncome: 2333
- CoapplicantIncome: 1516.0
- LoanAmount: 95
- Loan_Amount_Term: 360
- Credit_History: 1
- Property_Area: Urban
- Loan_Status: Y

Combined income: 2333 + 1516 = 3849. Loan 95. Ratio ≈ 0.0246. Good credit. Low risk.

LP001014:
- Loan_ID: LP001014
- Gender: Male
- Married: True
- Dependents: 3+
- Education: Graduate
- Self_Employed: False
- ApplicantIncome: 3036
- CoapplicantIncome: 2504.0
- LoanAmount: 158
- Loan_Amount_Term: 360
- Credit_History: 0
- Property_Area: Semiurban
- Loan_Status: N

Combined income: 3036 + 2504 = 5540. Loan amount 158. Ratio ≈ 0.0284. But credit history is 0, which is poor. This is a red flag. Even though the ratio is low, poor credit history usually leads to decline unless strong income and low loan, which is the case here, but credit history is a major factor.

So, LP001014 has the highest risk because of poor credit history despite a manageable loan amount. The credit history being 0 indicates a high risk of default. Even though the income is sufficient and the loan amount isn't too high, the bad credit history is a significant negative factor. Therefore, this applicant is the most at risk of defaulting on the loan.
</think>

The applicant most at risk of default is LP001014. Here's the breakdown:

**Loan_ID:** LP001014

**Reasons for High Risk:**
1. **Poor Credit History:** The credit history score is 0, indicating a high risk of default.
2. **Combined Income and Loan Amount:** Although the combined income (5540) and loan amount (158) ratio is manageable (≈0.0284), the poor credit history outweighs these positive factors.
3. **Dependents:** Having 3+ dependents may add financial strain, potentially affecting loan repayment capabilities.

**Conclusion:** Despite a manageable loan-to-income ratio, the poor credit history significantly increases the risk of default, making LP001014 the highest risk applicant.

**Risk Assessment:**
- **Verdict:** DECLINE
- **Risk Score:** 90/100 (High Risk)
- **Reasons:** Poor credit history (0), despite manageable loan amount and sufficient income.
- **Recommendations:** Improve credit score, consider a co-applicant with a better credit history, or apply for a smaller loan amount.

‍

Quick Guardrails

Fairness: Keep demographic fields supportive only; don’t let them drive the verdict.
Units: Align scales (monthly vs annual income; loan amount units and currency).
PII: Don’t send personally identifiable information to the model.
Production swap-ins: Replace the simple proxy with your FOIR/DTI and policy thresholds when you operationalize.

The complete source code for this project is available on GitHub.

Conclusion

Reasoning models replace black-box lending with explainable decisions. Instead of a simple approve/decline, the model evaluates income, loan size, term, credit history, and supportive signals against clear rules, then returns a verdict with reasons and recommendations. The outcome: decisions that are consistent and auditable for risk and compliance, easier for customers to understand, and fairer across the portfolio.

Want to explore this for your portfolio? Let’s talk.