Skip to main content
Data Analysis Prompts

ChatGPT Prompts for Data Analysis

Generate SQL queries, Python pandas code, Excel formulas, and statistical analysis — with AI prompts built for analysts who need working code, not generic advice.

12 prompts|Updated March 2026

Data analysts spend significant time on repetitive tasks: writing boilerplate SQL queries, cleaning messy datasets, building the same aggregation patterns in pandas, or constructing complex Excel formulas. ChatGPT can accelerate all of these with the right prompts — but generic prompts produce generic code. These prompts are built for analysts who need working, production-ready code snippets that they can adapt to their actual schema and data. Each prompt includes schema inputs, data type specifications, and edge case handling to produce useful output rather than toy examples.

1

SQL Query for Business Analysis

Write a SQL query to answer the following business question: [describe the specific business question]

Database context:
- Database type: [PostgreSQL / MySQL / BigQuery / Snowflake / SQL Server / SQLite]
- Relevant tables and schema:
  - Table 1: [table_name] — columns: [col1 (type)], [col2 (type)], [col3 (type)]
  - Table 2: [table_name] — columns: [col1 (type)], [col2 (type)], [col3 (type)]
  - Table 3 (if needed): [table_name] — columns: [col1 (type)], [col2 (type)]
- Primary keys and foreign keys: [describe relationships between tables]
- Date column format: [YYYY-MM-DD / timestamp / Unix epoch]
- Data volume (approximate): [X rows in main table]

Query requirements:
- Time period: [specific date range or relative period like "last 30 days" or "current quarter"]
- Grouping: [what dimensions to group by — e.g., by user, by product category, by week]
- Aggregations needed: [sum, count distinct, average, percentile, etc.]
- Filters: [any WHERE conditions — status, user type, minimum threshold, etc.]
- Output columns: [exactly what columns should appear in the result]
- Sort order: [how to order results]

Write the query with:
1. Comments explaining each major section (CTE, JOIN, filter)
2. Handling for NULL values where relevant
3. An alternate version using window functions if applicable
4. A note on query performance: any indexes that would speed this up significantly
Paste your actual schema into the prompt — even if it's messy. The more accurately you describe your table structure, the fewer query revisions you'll need. Always run AI-generated SQL on a test environment or with LIMIT 100 before executing on production data.
2

Pandas Data Cleaning Script

Write a Python pandas script to clean the following dataset. The data is loaded from [CSV / Excel / JSON / parquet / SQL database] into a DataFrame called df.

Dataset description:
- What the data represents: [e.g., user event logs, sales transactions, customer records]
- Number of rows (approximate): [X rows]
- Number of columns: [X columns]

Known data quality issues to fix:
1. [Issue 1: e.g., "The 'email' column has ~15% missing values — these should be dropped"]
2. [Issue 2: e.g., "The 'date' column is stored as string in mixed formats: 'MM/DD/YYYY' and 'YYYY-MM-DD'"]
3. [Issue 3: e.g., "The 'revenue' column has dollar signs and commas as strings: '$1,234.56'"]
4. [Issue 4: e.g., "User IDs are duplicated — keep the record with the most recent timestamp"]
5. [Issue 5: e.g., "The 'country' column has inconsistent capitalization and abbreviations — normalize to ISO 3166-1 alpha-2"]

Column schema:
[column_name]: [dtype], [description], [example values]
[column_name]: [dtype], [description], [example values]

Write the cleaning script with:
- Each cleaning step as a separate, commented block
- Assertions or shape checks after each major step (so we know if something unexpected happens)
- A final data quality report: print statements showing row count before/after, null counts per column, and duplicate counts
- A cleaned DataFrame saved to [output format and path]
Add a 'df_raw = df.copy()' line at the start to preserve the original DataFrame. Cleaning scripts that modify in place are harder to debug — always keep a reference to the pre-cleaned state.
3

Python Data Visualization Code

Write Python code to create a visualization for the following analysis. Use [matplotlib / seaborn / plotly / altair] as the plotting library.

What I'm visualizing: [describe the insight or relationship you want to show]
Data structure: [describe the DataFrame — relevant columns, data types, and approximate value ranges]

Visualization requirements:
- Chart type: [bar chart / line chart / scatter plot / heatmap / histogram / box plot / funnel chart]
- X-axis: [column name and what it represents]
- Y-axis: [column name and what it represents]
- Color encoding (if any): [which column drives color, and what the values mean]
- Faceting (if any): [any small multiples or subplot structure]
- Annotations: [any specific data points or thresholds to call out with labels or lines]
- Style requirements: [clean / dark background / branded colors #HEX1 #HEX2 / publication-ready]
- Figure size: [width x height in inches or pixels]
- Export format: [PNG / SVG / interactive HTML]

Include:
1. The full plotting code with comments
2. A title and axis labels with units
3. Handling for cases where data might be empty or have outliers
4. How to save the figure to disk
5. Optional: an interactive version using plotly if the static version is matplotlib/seaborn
Always describe the insight you want the chart to communicate, not just the data. A chart that clearly shows 'Month 3 is the peak churn month' is more useful than a chart that just shows monthly churn rates.
4

Statistical Analysis in Python

Help me run the following statistical analysis in Python. I need working code, interpretation guidance, and output that I can include in a report.

Analysis goal: [describe what you're trying to determine — e.g., "Is there a statistically significant difference in conversion rates between our two landing page variants?" or "What factors most strongly predict customer churn?"]

Data available:
- Sample size: [N = X observations]
- Key variables: [variable name (type: continuous/categorical), description, value range]
- Data is in a pandas DataFrame called df with columns: [list relevant columns]

Statistical test(s) to run:
- Primary test: [t-test / chi-square / ANOVA / Mann-Whitney / correlation / linear regression / logistic regression / A/B test significance]
- Why this test: [reason, or ask ChatGPT to recommend the appropriate test based on the data]
- Significance level: [α = 0.05 unless otherwise specified]
- Null hypothesis: [state it explicitly]

Write code that:
1. Checks test assumptions (normality, variance equality, sample size requirements)
2. Runs the chosen test using scipy.stats or statsmodels
3. Prints results in plain English: p-value, effect size, confidence interval
4. States in one sentence whether to reject or fail to reject the null hypothesis
5. Includes a simple visualization of the result (distribution plot, confidence interval plot, or effect size chart)

Also provide: a plain-language interpretation of the results that I could include in a presentation to a non-technical stakeholder.
Always check your test assumptions before interpreting results. A t-test run on non-normally distributed data with small samples produces misleading p-values. The assumption-checking code in this prompt is as important as the test itself.
5

Excel Formula Builder

Write the Excel formula(s) to accomplish the following task. I need the formula to work in [Excel 365 / Excel 2019 / Google Sheets].

Task description: [describe exactly what you want the formula to calculate or look up]

Data structure:
- Column A: [column name and example values]
- Column B: [column name and example values]
- Column C: [column name and example values]
- The formula will go in: [cell reference, e.g., D2]
- Data starts at row: [2] and the header is in row [1]

Specific requirements:
- [Requirement 1: e.g., "Return 'High', 'Medium', or 'Low' based on value ranges in column B"]
- [Requirement 2: e.g., "Handle blank cells in column C by returning 0 instead of an error"]
- [Requirement 3: e.g., "The lookup should be case-insensitive"]
- [Requirement 4: e.g., "The formula should work when copied down 500 rows without breaking"]

Provide:
1. The complete formula with no placeholders — ready to paste into the cell
2. A step-by-step breakdown of how the formula works (for each nested function)
3. An alternative formula if there's a simpler approach using newer Excel functions (XLOOKUP, LET, LAMBDA, etc.)
4. Common error cases and how the formula handles them (#N/A, #DIV/0!, #VALUE!)
5. A note on performance if the formula will run on a large dataset (10,000+ rows)
Describe the task in plain English first, then add the technical details. 'Find the most recent order date for each customer ID, excluding cancelled orders' gives ChatGPT better context than just listing column names.
6

Cohort Retention Analysis

Write Python code to perform a cohort retention analysis on the following dataset. The analysis will produce a retention matrix showing what percentage of users from each acquisition cohort are still active in subsequent periods.

Data description:
- DataFrame name: df
- User ID column: [column name]
- Date column: [column name — format: YYYY-MM-DD]
- Event type: [what counts as an 'active' event — e.g., login, purchase, page view]
- Cohort definition: [what defines a cohort — first purchase date / signup date / first event date]
- Cohort period: [monthly / weekly]
- Analysis period: [how many periods to track — e.g., 12 months from acquisition]

Write code that:
1. Creates cohort groups based on [cohort definition]
2. Calculates the period number for each subsequent event relative to the user's cohort
3. Builds a cohort retention matrix: rows = cohort, columns = period number, values = % retained
4. Generates a heatmap visualization of the matrix using seaborn with:
   - Color scale from red (low retention) to green (high retention)
   - Percentage values annotated in each cell
   - Cohort sizes shown in the row labels (e.g., "Jan 2024 (n=1,243)")
5. Prints the average retention rate per period across all cohorts
6. Identifies the cohort with the highest and lowest 90-day retention

Also provide: A brief interpretation template for reporting these results (what to look for in the heatmap, what typical retention curves look like for this business model).
Cohort analysis is only as valuable as the action it drives. Before building the heatmap, decide: what would 'good' retention look like for your business model? If you don't have a benchmark, use the analysis to identify which cohort performed best and investigate what was different about those users.
7

Data Pipeline Script (ETL)

Write a Python ETL (Extract, Transform, Load) script to move and transform data from [source] to [destination].

Source details:
- Source type: [CSV file / REST API / PostgreSQL database / S3 bucket / Google Sheets]
- Source location: [file path / API endpoint / database connection details as placeholders]
- Data format: [describe the schema or paste a sample row]
- Authentication: [API key / OAuth / database credentials — use placeholders like os.environ['API_KEY']]
- Update frequency: [full refresh / incremental — if incremental, what's the incremental key: timestamp or ID?]

Transformation requirements:
1. [Transformation 1: e.g., "Parse dates from 'MM-DD-YYYY' to datetime objects"]
2. [Transformation 2: e.g., "Join with a lookup table to add product category names"]
3. [Transformation 3: e.g., "Deduplicate on user_id + event_date, keeping the latest record"]
4. [Transformation 4: e.g., "Filter out test accounts (email contains '@test.com' or '@example.com')"]
5. [Transformation 5: e.g., "Calculate a derived column: revenue_per_unit = total_revenue / units_sold"]

Destination details:
- Destination type: [PostgreSQL / BigQuery / Snowflake / CSV / Google Sheets / S3]
- Table/file name: [destination table or file path]
- Write mode: [append / overwrite / upsert on key column]

Include:
- Logging at each step with row counts
- Error handling for connection failures and data type mismatches
- A dry-run flag that shows what would be written without actually writing
- Requirements: list the pip packages needed
Never hardcode credentials in ETL scripts. Use os.environ for all secrets and document which environment variables the script requires at the top of the file. This prevents accidental credential leaks when scripts are shared or committed to version control.
8

A/B Test Results Analysis

Analyze the results of the following A/B test and tell me whether to ship the variant, and why.

Test details:
- What was tested: [describe the change — landing page copy, button color, pricing, feature, etc.]
- Primary metric: [the one metric this test was designed to move]
- Secondary metrics: [2-3 guardrail or secondary metrics tracked alongside the primary]
- Test duration: [X days]
- Traffic split: [Control: X% / Variant: X%]

Results:
- Control (A):
  - Users: [N_control]
  - Primary metric: [value — e.g., conversion rate: 3.2%]
  - Secondary metric 1: [value]
  - Secondary metric 2: [value]
- Variant (B):
  - Users: [N_variant]
  - Primary metric: [value — e.g., conversion rate: 3.8%]
  - Secondary metric 1: [value]
  - Secondary metric 2: [value]

Calculate:
1. Statistical significance (p-value) using a two-proportion z-test
2. Relative lift on the primary metric (%)
3. Absolute difference and 95% confidence interval
4. Statistical power (was the test adequately powered for this effect size?)
5. Minimum detectable effect for this sample size (retrospective power analysis)
6. Estimated annual revenue impact if we ship the variant

Write Python code for all calculations, then provide a recommendation in plain English: Ship / Do not ship / Run longer / Segment the results (with reasoning for each possible recommendation).
Check your secondary and guardrail metrics before making a ship decision. A variant with a positive primary metric but significant drop in a guardrail metric (e.g., improved signups but doubled support tickets) often needs further investigation before shipping.
9

SQL Window Functions Explainer & Template

Write SQL window function examples for the following use cases. Use [PostgreSQL / BigQuery / Snowflake] syntax. Include working examples using a realistic sample schema.

Sample schema to use (or replace with mine below):
- Table: orders (order_id INT, user_id INT, order_date DATE, revenue DECIMAL, status VARCHAR)
- Table: users (user_id INT, signup_date DATE, country VARCHAR, plan VARCHAR)

[Paste your actual schema here if different]

Generate working SQL examples for each window function use case:

1. **Running total**: Cumulative revenue per user ordered by date
2. **Rank within group**: Rank each order by revenue within each user's history
3. **Lag/Lead**: Compare each order's revenue to the previous order for the same user
4. **Moving average**: 7-day rolling average of daily revenue
5. **Percentile rank**: What percentile does each user's total spend fall into?
6. **First/Last value**: Get each user's first order date and most recent order date in a single query
7. **N-th value**: Get the 3rd highest revenue order per user
8. **Row number with deduplication**: Keep only the most recent record per user_id

For each example:
- Write the complete SQL query (not a fragment)
- Add inline comments explaining the PARTITION BY and ORDER BY logic
- Note when NOT to use a window function (when a simpler aggregate would do)

Finally: write a combined query that uses 3+ window functions together to produce a user-level summary table.
Window functions that partition on high-cardinality columns (like user_id on billions of rows) can be very slow. Always check query execution plans and consider pre-aggregating data in a CTE before applying window functions to large datasets.
10

Machine Learning Model Selection Guide

Help me choose the right machine learning model and write the starter code for the following prediction task.

Problem description: [describe what you want to predict — e.g., "Predict which users will churn in the next 30 days" or "Classify customer support tickets by department"]

Problem type: [binary classification / multi-class classification / regression / clustering / anomaly detection / time series forecasting]

Data available:
- Target variable (Y): [column name, type, and value distribution — e.g., "churned: 0/1, ~8% positive rate"]
- Features (X): [list your input columns with types: continuous, categorical, ordinal, text, datetime]
- Dataset size: [X rows × Y columns]
- Known data issues: [class imbalance, missing values, high cardinality categoricals, correlated features]
- Training/inference environment: [scikit-learn, XGBoost, LightGBM, PyTorch, cloud ML platform]

Requirements:
- Model interpretability needed: [yes — must explain predictions / no — black box is fine]
- Latency requirement: [real-time inference (<100ms) / batch overnight]
- Baseline to beat: [current accuracy / current business metric, e.g., "rule-based model catches 40% of churners"]

Provide:
1. A recommended model (with runner-up alternative) and 3 reasons why
2. Starter scikit-learn pipeline code: preprocessing, model fitting, cross-validation, and evaluation metrics appropriate for this problem type
3. The 3 most important hyperparameters to tune for the recommended model
4. A feature importance extraction snippet
5. Common pitfalls for this specific problem type
Always build a simple baseline model first (logistic regression, decision tree, or just 'always predict the majority class'). Your sophisticated model needs to beat this baseline by enough to justify the added complexity and maintenance cost.
11

Data Dictionary Generator

Generate a data dictionary for the following database table or dataset. This documentation will be shared with new analysts, engineers, and stakeholders who need to understand the data.

Table/dataset name: [name]
Database or data warehouse: [PostgreSQL / BigQuery / Snowflake / etc.]
Purpose of this table: [what business process does this data represent?]
Source system: [where does this data come from? ETL, application database, API, manual upload?]
Update frequency: [real-time / hourly / daily / weekly]
Grain: [one row = one what? e.g., "one row per user per day" or "one row per transaction"]

Columns to document:
[For each column, provide:]
- Column name: [exact_column_name]
- Sample values: [value1, value2, value3]
- Approximate null rate: [X%]
- Notes: [anything unusual — e.g., this field was added in 2024, legacy values may be NULL]

Generate a data dictionary with:
1. A table overview section (purpose, grain, update schedule, owner)
2. A column-by-column table with: name, data type, description, business definition, example values, valid range or allowed values, null policy, and relationships to other tables
3. Common pitfalls or known data quality issues
4. 5 example queries that answer the most common questions analysts ask about this table
5. A "how not to use this table" section — common misinterpretations or query mistakes
Data dictionaries only have value if they're kept up to date. Add a 'Last reviewed' field and assign an owner. A stale data dictionary is often worse than none — it creates false confidence in outdated information.
12

Dashboard KPI Calculation Spec

Write a calculation specification document for the following KPIs that will be built into a BI dashboard [Tableau / Looker / Power BI / Metabase / Redash].

Dashboard name: [name]
Audience: [who will use this dashboard and for what decisions?]
Data source: [database name and main tables]

For each KPI below, write a complete specification:

KPIs to specify:
1. [KPI name: e.g., Monthly Active Users]
2. [KPI name: e.g., Customer Acquisition Cost]
3. [KPI name: e.g., Net Revenue Retention]
4. [KPI name: e.g., Average Order Value]
5. [KPI name: e.g., Support Ticket Resolution Time]

For each KPI, provide:
- **Business definition**: What does this measure in plain English?
- **Formula**: The exact mathematical formula (no SQL yet — just logic)
- **SQL snippet**: A ready-to-use calculation in [database dialect] SQL
- **Numerator logic**: Exactly what to count/sum in the numerator, including any filters
- **Denominator logic**: What to divide by, including edge cases (what if denominator is 0?)
- **Time logic**: Is this period-over-period? Rolling? Point-in-time?
- **Filters and exclusions**: What data is excluded and why (test accounts, refunded orders, etc.)
- **Known edge cases**: What would cause this metric to look wrong, and how to flag it
- **Related metrics**: What other KPIs should be shown alongside this one for context

End with: a data model diagram description (tables → transformations → final KPI layer) for the BI engineer building this.
The most common source of dashboard debates is ambiguous KPI definitions — two people calculating 'active users' differently and getting different numbers. This spec document prevents that by making every decision explicit before anyone writes a query.

How to Use These Prompts

Copy any prompt into ChatGPT or Claude and fill in every bracketed variable with your actual schema, data types, and business context. The more specific your inputs — actual column names, real data types, sample values — the more production-ready the output code will be. Always test AI-generated SQL and Python on a sample dataset before running on production data. Use Prompt Anything Pro to run these prompts directly in your browser while working in your BI tool, Jupyter notebook, or database console.

Need More Prompts?

Get personalized AI suggestions for additional prompts tailored to your specific needs.

AI responses are generated independently and may vary

Frequently Asked Questions

Use These Prompts Inside Your Browser & Data Tools

Prompt Anything Pro lets you trigger any AI prompt directly in your browser — generate SQL, Python, or Excel formulas without switching between tools.