ChatGPT Prompts for Data Analysis
Generate SQL queries, Python pandas code, Excel formulas, and statistical analysis — with AI prompts built for analysts who need working code, not generic advice.
Data analysts spend significant time on repetitive tasks: writing boilerplate SQL queries, cleaning messy datasets, building the same aggregation patterns in pandas, or constructing complex Excel formulas. ChatGPT can accelerate all of these with the right prompts — but generic prompts produce generic code. These prompts are built for analysts who need working, production-ready code snippets that they can adapt to their actual schema and data. Each prompt includes schema inputs, data type specifications, and edge case handling to produce useful output rather than toy examples.
SQL Query for Business Analysis
Write a SQL query to answer the following business question: [describe the specific business question] Database context: - Database type: [PostgreSQL / MySQL / BigQuery / Snowflake / SQL Server / SQLite] - Relevant tables and schema: - Table 1: [table_name] — columns: [col1 (type)], [col2 (type)], [col3 (type)] - Table 2: [table_name] — columns: [col1 (type)], [col2 (type)], [col3 (type)] - Table 3 (if needed): [table_name] — columns: [col1 (type)], [col2 (type)] - Primary keys and foreign keys: [describe relationships between tables] - Date column format: [YYYY-MM-DD / timestamp / Unix epoch] - Data volume (approximate): [X rows in main table] Query requirements: - Time period: [specific date range or relative period like "last 30 days" or "current quarter"] - Grouping: [what dimensions to group by — e.g., by user, by product category, by week] - Aggregations needed: [sum, count distinct, average, percentile, etc.] - Filters: [any WHERE conditions — status, user type, minimum threshold, etc.] - Output columns: [exactly what columns should appear in the result] - Sort order: [how to order results] Write the query with: 1. Comments explaining each major section (CTE, JOIN, filter) 2. Handling for NULL values where relevant 3. An alternate version using window functions if applicable 4. A note on query performance: any indexes that would speed this up significantly
Pandas Data Cleaning Script
Write a Python pandas script to clean the following dataset. The data is loaded from [CSV / Excel / JSON / parquet / SQL database] into a DataFrame called df. Dataset description: - What the data represents: [e.g., user event logs, sales transactions, customer records] - Number of rows (approximate): [X rows] - Number of columns: [X columns] Known data quality issues to fix: 1. [Issue 1: e.g., "The 'email' column has ~15% missing values — these should be dropped"] 2. [Issue 2: e.g., "The 'date' column is stored as string in mixed formats: 'MM/DD/YYYY' and 'YYYY-MM-DD'"] 3. [Issue 3: e.g., "The 'revenue' column has dollar signs and commas as strings: '$1,234.56'"] 4. [Issue 4: e.g., "User IDs are duplicated — keep the record with the most recent timestamp"] 5. [Issue 5: e.g., "The 'country' column has inconsistent capitalization and abbreviations — normalize to ISO 3166-1 alpha-2"] Column schema: [column_name]: [dtype], [description], [example values] [column_name]: [dtype], [description], [example values] Write the cleaning script with: - Each cleaning step as a separate, commented block - Assertions or shape checks after each major step (so we know if something unexpected happens) - A final data quality report: print statements showing row count before/after, null counts per column, and duplicate counts - A cleaned DataFrame saved to [output format and path]
Python Data Visualization Code
Write Python code to create a visualization for the following analysis. Use [matplotlib / seaborn / plotly / altair] as the plotting library. What I'm visualizing: [describe the insight or relationship you want to show] Data structure: [describe the DataFrame — relevant columns, data types, and approximate value ranges] Visualization requirements: - Chart type: [bar chart / line chart / scatter plot / heatmap / histogram / box plot / funnel chart] - X-axis: [column name and what it represents] - Y-axis: [column name and what it represents] - Color encoding (if any): [which column drives color, and what the values mean] - Faceting (if any): [any small multiples or subplot structure] - Annotations: [any specific data points or thresholds to call out with labels or lines] - Style requirements: [clean / dark background / branded colors #HEX1 #HEX2 / publication-ready] - Figure size: [width x height in inches or pixels] - Export format: [PNG / SVG / interactive HTML] Include: 1. The full plotting code with comments 2. A title and axis labels with units 3. Handling for cases where data might be empty or have outliers 4. How to save the figure to disk 5. Optional: an interactive version using plotly if the static version is matplotlib/seaborn
Statistical Analysis in Python
Help me run the following statistical analysis in Python. I need working code, interpretation guidance, and output that I can include in a report. Analysis goal: [describe what you're trying to determine — e.g., "Is there a statistically significant difference in conversion rates between our two landing page variants?" or "What factors most strongly predict customer churn?"] Data available: - Sample size: [N = X observations] - Key variables: [variable name (type: continuous/categorical), description, value range] - Data is in a pandas DataFrame called df with columns: [list relevant columns] Statistical test(s) to run: - Primary test: [t-test / chi-square / ANOVA / Mann-Whitney / correlation / linear regression / logistic regression / A/B test significance] - Why this test: [reason, or ask ChatGPT to recommend the appropriate test based on the data] - Significance level: [α = 0.05 unless otherwise specified] - Null hypothesis: [state it explicitly] Write code that: 1. Checks test assumptions (normality, variance equality, sample size requirements) 2. Runs the chosen test using scipy.stats or statsmodels 3. Prints results in plain English: p-value, effect size, confidence interval 4. States in one sentence whether to reject or fail to reject the null hypothesis 5. Includes a simple visualization of the result (distribution plot, confidence interval plot, or effect size chart) Also provide: a plain-language interpretation of the results that I could include in a presentation to a non-technical stakeholder.
Excel Formula Builder
Write the Excel formula(s) to accomplish the following task. I need the formula to work in [Excel 365 / Excel 2019 / Google Sheets]. Task description: [describe exactly what you want the formula to calculate or look up] Data structure: - Column A: [column name and example values] - Column B: [column name and example values] - Column C: [column name and example values] - The formula will go in: [cell reference, e.g., D2] - Data starts at row: [2] and the header is in row [1] Specific requirements: - [Requirement 1: e.g., "Return 'High', 'Medium', or 'Low' based on value ranges in column B"] - [Requirement 2: e.g., "Handle blank cells in column C by returning 0 instead of an error"] - [Requirement 3: e.g., "The lookup should be case-insensitive"] - [Requirement 4: e.g., "The formula should work when copied down 500 rows without breaking"] Provide: 1. The complete formula with no placeholders — ready to paste into the cell 2. A step-by-step breakdown of how the formula works (for each nested function) 3. An alternative formula if there's a simpler approach using newer Excel functions (XLOOKUP, LET, LAMBDA, etc.) 4. Common error cases and how the formula handles them (#N/A, #DIV/0!, #VALUE!) 5. A note on performance if the formula will run on a large dataset (10,000+ rows)
Cohort Retention Analysis
Write Python code to perform a cohort retention analysis on the following dataset. The analysis will produce a retention matrix showing what percentage of users from each acquisition cohort are still active in subsequent periods. Data description: - DataFrame name: df - User ID column: [column name] - Date column: [column name — format: YYYY-MM-DD] - Event type: [what counts as an 'active' event — e.g., login, purchase, page view] - Cohort definition: [what defines a cohort — first purchase date / signup date / first event date] - Cohort period: [monthly / weekly] - Analysis period: [how many periods to track — e.g., 12 months from acquisition] Write code that: 1. Creates cohort groups based on [cohort definition] 2. Calculates the period number for each subsequent event relative to the user's cohort 3. Builds a cohort retention matrix: rows = cohort, columns = period number, values = % retained 4. Generates a heatmap visualization of the matrix using seaborn with: - Color scale from red (low retention) to green (high retention) - Percentage values annotated in each cell - Cohort sizes shown in the row labels (e.g., "Jan 2024 (n=1,243)") 5. Prints the average retention rate per period across all cohorts 6. Identifies the cohort with the highest and lowest 90-day retention Also provide: A brief interpretation template for reporting these results (what to look for in the heatmap, what typical retention curves look like for this business model).
Data Pipeline Script (ETL)
Write a Python ETL (Extract, Transform, Load) script to move and transform data from [source] to [destination]. Source details: - Source type: [CSV file / REST API / PostgreSQL database / S3 bucket / Google Sheets] - Source location: [file path / API endpoint / database connection details as placeholders] - Data format: [describe the schema or paste a sample row] - Authentication: [API key / OAuth / database credentials — use placeholders like os.environ['API_KEY']] - Update frequency: [full refresh / incremental — if incremental, what's the incremental key: timestamp or ID?] Transformation requirements: 1. [Transformation 1: e.g., "Parse dates from 'MM-DD-YYYY' to datetime objects"] 2. [Transformation 2: e.g., "Join with a lookup table to add product category names"] 3. [Transformation 3: e.g., "Deduplicate on user_id + event_date, keeping the latest record"] 4. [Transformation 4: e.g., "Filter out test accounts (email contains '@test.com' or '@example.com')"] 5. [Transformation 5: e.g., "Calculate a derived column: revenue_per_unit = total_revenue / units_sold"] Destination details: - Destination type: [PostgreSQL / BigQuery / Snowflake / CSV / Google Sheets / S3] - Table/file name: [destination table or file path] - Write mode: [append / overwrite / upsert on key column] Include: - Logging at each step with row counts - Error handling for connection failures and data type mismatches - A dry-run flag that shows what would be written without actually writing - Requirements: list the pip packages needed
A/B Test Results Analysis
Analyze the results of the following A/B test and tell me whether to ship the variant, and why. Test details: - What was tested: [describe the change — landing page copy, button color, pricing, feature, etc.] - Primary metric: [the one metric this test was designed to move] - Secondary metrics: [2-3 guardrail or secondary metrics tracked alongside the primary] - Test duration: [X days] - Traffic split: [Control: X% / Variant: X%] Results: - Control (A): - Users: [N_control] - Primary metric: [value — e.g., conversion rate: 3.2%] - Secondary metric 1: [value] - Secondary metric 2: [value] - Variant (B): - Users: [N_variant] - Primary metric: [value — e.g., conversion rate: 3.8%] - Secondary metric 1: [value] - Secondary metric 2: [value] Calculate: 1. Statistical significance (p-value) using a two-proportion z-test 2. Relative lift on the primary metric (%) 3. Absolute difference and 95% confidence interval 4. Statistical power (was the test adequately powered for this effect size?) 5. Minimum detectable effect for this sample size (retrospective power analysis) 6. Estimated annual revenue impact if we ship the variant Write Python code for all calculations, then provide a recommendation in plain English: Ship / Do not ship / Run longer / Segment the results (with reasoning for each possible recommendation).
SQL Window Functions Explainer & Template
Write SQL window function examples for the following use cases. Use [PostgreSQL / BigQuery / Snowflake] syntax. Include working examples using a realistic sample schema. Sample schema to use (or replace with mine below): - Table: orders (order_id INT, user_id INT, order_date DATE, revenue DECIMAL, status VARCHAR) - Table: users (user_id INT, signup_date DATE, country VARCHAR, plan VARCHAR) [Paste your actual schema here if different] Generate working SQL examples for each window function use case: 1. **Running total**: Cumulative revenue per user ordered by date 2. **Rank within group**: Rank each order by revenue within each user's history 3. **Lag/Lead**: Compare each order's revenue to the previous order for the same user 4. **Moving average**: 7-day rolling average of daily revenue 5. **Percentile rank**: What percentile does each user's total spend fall into? 6. **First/Last value**: Get each user's first order date and most recent order date in a single query 7. **N-th value**: Get the 3rd highest revenue order per user 8. **Row number with deduplication**: Keep only the most recent record per user_id For each example: - Write the complete SQL query (not a fragment) - Add inline comments explaining the PARTITION BY and ORDER BY logic - Note when NOT to use a window function (when a simpler aggregate would do) Finally: write a combined query that uses 3+ window functions together to produce a user-level summary table.
Machine Learning Model Selection Guide
Help me choose the right machine learning model and write the starter code for the following prediction task. Problem description: [describe what you want to predict — e.g., "Predict which users will churn in the next 30 days" or "Classify customer support tickets by department"] Problem type: [binary classification / multi-class classification / regression / clustering / anomaly detection / time series forecasting] Data available: - Target variable (Y): [column name, type, and value distribution — e.g., "churned: 0/1, ~8% positive rate"] - Features (X): [list your input columns with types: continuous, categorical, ordinal, text, datetime] - Dataset size: [X rows × Y columns] - Known data issues: [class imbalance, missing values, high cardinality categoricals, correlated features] - Training/inference environment: [scikit-learn, XGBoost, LightGBM, PyTorch, cloud ML platform] Requirements: - Model interpretability needed: [yes — must explain predictions / no — black box is fine] - Latency requirement: [real-time inference (<100ms) / batch overnight] - Baseline to beat: [current accuracy / current business metric, e.g., "rule-based model catches 40% of churners"] Provide: 1. A recommended model (with runner-up alternative) and 3 reasons why 2. Starter scikit-learn pipeline code: preprocessing, model fitting, cross-validation, and evaluation metrics appropriate for this problem type 3. The 3 most important hyperparameters to tune for the recommended model 4. A feature importance extraction snippet 5. Common pitfalls for this specific problem type
Data Dictionary Generator
Generate a data dictionary for the following database table or dataset. This documentation will be shared with new analysts, engineers, and stakeholders who need to understand the data. Table/dataset name: [name] Database or data warehouse: [PostgreSQL / BigQuery / Snowflake / etc.] Purpose of this table: [what business process does this data represent?] Source system: [where does this data come from? ETL, application database, API, manual upload?] Update frequency: [real-time / hourly / daily / weekly] Grain: [one row = one what? e.g., "one row per user per day" or "one row per transaction"] Columns to document: [For each column, provide:] - Column name: [exact_column_name] - Sample values: [value1, value2, value3] - Approximate null rate: [X%] - Notes: [anything unusual — e.g., this field was added in 2024, legacy values may be NULL] Generate a data dictionary with: 1. A table overview section (purpose, grain, update schedule, owner) 2. A column-by-column table with: name, data type, description, business definition, example values, valid range or allowed values, null policy, and relationships to other tables 3. Common pitfalls or known data quality issues 4. 5 example queries that answer the most common questions analysts ask about this table 5. A "how not to use this table" section — common misinterpretations or query mistakes
Dashboard KPI Calculation Spec
Write a calculation specification document for the following KPIs that will be built into a BI dashboard [Tableau / Looker / Power BI / Metabase / Redash]. Dashboard name: [name] Audience: [who will use this dashboard and for what decisions?] Data source: [database name and main tables] For each KPI below, write a complete specification: KPIs to specify: 1. [KPI name: e.g., Monthly Active Users] 2. [KPI name: e.g., Customer Acquisition Cost] 3. [KPI name: e.g., Net Revenue Retention] 4. [KPI name: e.g., Average Order Value] 5. [KPI name: e.g., Support Ticket Resolution Time] For each KPI, provide: - **Business definition**: What does this measure in plain English? - **Formula**: The exact mathematical formula (no SQL yet — just logic) - **SQL snippet**: A ready-to-use calculation in [database dialect] SQL - **Numerator logic**: Exactly what to count/sum in the numerator, including any filters - **Denominator logic**: What to divide by, including edge cases (what if denominator is 0?) - **Time logic**: Is this period-over-period? Rolling? Point-in-time? - **Filters and exclusions**: What data is excluded and why (test accounts, refunded orders, etc.) - **Known edge cases**: What would cause this metric to look wrong, and how to flag it - **Related metrics**: What other KPIs should be shown alongside this one for context End with: a data model diagram description (tables → transformations → final KPI layer) for the BI engineer building this.
How to Use These Prompts
Copy any prompt into ChatGPT or Claude and fill in every bracketed variable with your actual schema, data types, and business context. The more specific your inputs — actual column names, real data types, sample values — the more production-ready the output code will be. Always test AI-generated SQL and Python on a sample dataset before running on production data. Use Prompt Anything Pro to run these prompts directly in your browser while working in your BI tool, Jupyter notebook, or database console.
Need More Prompts?
Get personalized AI suggestions for additional prompts tailored to your specific needs.
AI responses are generated independently and may vary
Frequently Asked Questions
Use These Prompts Inside Your Browser & Data Tools
Prompt Anything Pro lets you trigger any AI prompt directly in your browser — generate SQL, Python, or Excel formulas without switching between tools.