Data Analysis
Use Claude Code for data analysis, processing, and visualization.
Basic Data Analysis
# Start Claude Code
claude
# Analyze a CSV file
You: Analyze data/sales.csv and give me a sales trend report
Claude: Let me read and analyze the data...
Claude wants to:
• Read file: data/sales.csv
Allow? [y/n/always/never]
You: y
Claude: Data overview:
- Total records: 1,245
- Date range: 2025-01-01 to 2025-12-31
- Fields: date, product, quantity, revenue, region
Key findings:
1. Total revenue: $2,456,789
2. Average order: $1,972
3. Best-selling product: Product A (345 units)
4. Highest revenue region: North ($892,345)
Trend analysis:
- Q1: $456,789 (18.6%)
- Q2: $678,234 (27.6%) ↑
- Q3: $589,123 (24.0%) ↓
- Q4: $732,643 (29.8%) ↑
Would you like me to generate a detailed report or visualization?Data Cleaning
You: Clean data/users.csv, remove duplicates, fill missing values
Claude: Let me process the data...
Claude wants to:
• Read file: data/users.csv
Allow? [y/n/always/never]
You: y
Claude: Data quality check:
Issues found:
1. Duplicate records: 23 (1.8%)
2. Missing emails: 45 (3.6%)
3. Invalid dates: 12 (0.9%)
4. Outlier ages: 8 (age > 120)
Cleaning plan:
```python
import pandas as pd
# Read data
df = pd.read_csv('data/users.csv')
# Remove duplicates
df = df.drop_duplicates(subset=['user_id'])
# Fill missing emails
df['email'] = df['email'].fillna('unknown@example.com')
# Fix invalid dates
df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce')
# Filter outlier ages
df = df[(df['age'] >= 0) & (df['age'] <= 120)]
# Save cleaned data
df.to_csv('data/users_cleaned.csv', index=False)Execute cleaning?
You: y
Claude wants to: • Write file: scripts/clean_data.py • Run command: python scripts/clean_data.py Allow? [y/n/always/never]
You: y
Claude: Data cleaning complete!
Results:
- Original records: 1,267
- After cleaning: 1,224
- Duplicates removed: 23
- Missing values fixed: 45
- Outliers filtered: 20
Output file: data/users_cleaned.csv
## Statistical Analysis
```bash
You: Perform statistical analysis on data/experiments.csv, including mean, median, and standard deviation
Claude: Let me perform the statistical analysis...
Claude wants to:
• Read file: data/experiments.csv
Allow? [y/n/always/never]
You: y
Claude: Statistical analysis report:
Descriptive statistics:Variable: response_time (ms)
- Sample size: 500
- Mean: 245.6 ms
- Median: 238.0 ms
- Standard deviation: 45.3 ms
- Minimum: 156.2 ms
- Maximum: 412.8 ms
- 25th percentile: 215.4 ms
- 75th percentile: 268.9 ms
Variable: success_rate (%)
- Sample size: 500
- Mean: 94.2%
- Median: 95.0%
- Standard deviation: 3.8%
- Minimum: 78.5%
- Maximum: 100.0%
Distribution analysis:
- response_time follows a normal distribution
- success_rate is left-skewed
Outlier detection (IQR method):
- response_time: 8 outliers found (> 350ms)
- success_rate: 3 outliers found (< 85%)
Would you like me to generate visualization charts?Data Visualization
You: Create visualization charts for the sales data
Claude: I'll generate multiple charts...
Claude wants to:
• Write file: scripts/visualize.py
Allow? [y/n/always/never]
You: y
Claude: Created visualization script:
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Read data
df = pd.read_csv('data/sales.csv')
df['date'] = pd.to_datetime(df['date'])
# Set style
sns.set_style('whitegrid')
plt.figure(figsize=(15, 10))
# 1. Time series chart
plt.subplot(2, 2, 1)
df.groupby('date')['revenue'].sum().plot()
plt.title('Revenue Over Time')
plt.xlabel('Date')
plt.ylabel('Revenue ($)')
# 2. Product sales bar chart
plt.subplot(2, 2, 2)
df.groupby('product')['quantity'].sum().plot(kind='bar')
plt.title('Sales by Product')
plt.xlabel('Product')
plt.ylabel('Quantity')
# 3. Revenue by region pie chart
plt.subplot(2, 2, 3)
df.groupby('region')['revenue'].sum().plot(kind='pie', autopct='%1.1f%%')
plt.title('Revenue by Region')
# 4. Correlation heatmap
plt.subplot(2, 2, 4)
corr = df[['quantity', 'revenue']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.savefig('reports/sales_analysis.png', dpi=300)
print('Charts saved to reports/sales_analysis.png')Claude wants to: • Run command: python scripts/visualize.py Allow? [y/n/always/never]
You: y
Claude: Charts generated!
Output file: reports/sales_analysis.png
Contains 4 charts:
- Revenue time series
- Product sales comparison
- Revenue distribution by region
- Variable correlation
Would you like me to adjust chart styles or add other charts?
## JSON Data Analysis
```bash
You: Analyze logs/api_requests.json and summarize API usage
Claude: Let me analyze the API logs...
Claude wants to:
• Read file: logs/api_requests.json
Allow? [y/n/always/never]
You: y
Claude: API usage analysis:
Overall statistics:
- Total requests: 15,678
- Date range: 2026-04-01 to 2026-04-15
- Unique users: 1,234
- Average response time: 156ms
Endpoint usage ranking:
1. GET /api/users - 5,234 (33.4%)
2. POST /api/orders - 3,456 (22.0%)
3. GET /api/products - 2,890 (18.4%)
4. PUT /api/users/:id - 1,678 (10.7%)
5. DELETE /api/orders/:id - 890 (5.7%)
Status code distribution:
- 200 OK: 14,234 (90.8%)
- 400 Bad Request: 678 (4.3%)
- 401 Unauthorized: 456 (2.9%)
- 500 Server Error: 310 (2.0%)
Performance analysis:
- Fastest endpoint: GET /api/health (12ms)
- Slowest endpoint: POST /api/reports (1,234ms)
- P95 response time: 345ms
- P99 response time: 678ms
Error analysis:
- Primary error: Invalid token (234 occurrences)
- Peak hours: 14:00-16:00
- Problem endpoint: POST /api/orders
Would you like me to generate a detailed report?Database Query Analysis
You: Connect to the database and analyze user activity
Claude: I'll create an analysis script...
Claude wants to:
• Write file: scripts/analyze_users.py
Allow? [y/n/always/never]
You: y
Claude: Created database analysis script:
```python
import psycopg2
import pandas as pd
from datetime import datetime, timedelta
# Connect to database
conn = psycopg2.connect(
host=os.getenv('DB_HOST'),
database=os.getenv('DB_NAME'),
user=os.getenv('DB_USER'),
password=os.getenv('DB_PASSWORD')
)
# Query user activity
query = """
SELECT
DATE(last_login) as date,
COUNT(DISTINCT user_id) as active_users,
COUNT(*) as total_logins,
AVG(session_duration) as avg_session_duration
FROM user_sessions
WHERE last_login >= NOW() - INTERVAL '30 days'
GROUP BY DATE(last_login)
ORDER BY date DESC
"""
df = pd.read_sql(query, conn)
# Analyze results
print("30-Day User Activity Analysis")
print("=" * 50)
print(f"Daily active users: {df['active_users'].mean():.0f}")
print(f"Daily logins: {df['total_logins'].mean():.0f}")
print(f"Average session duration: {df['avg_session_duration'].mean():.1f} minutes")
# Identify trends
trend = df['active_users'].pct_change().mean()
if trend > 0:
print(f"Trend: Growing {trend*100:.1f}%")
else:
print(f"Trend: Declining {abs(trend)*100:.1f}%")
# Save results
df.to_csv('reports/user_activity.csv', index=False)
conn.close()Claude wants to: • Run command: python scripts/analyze_users.py Allow? [y/n/always/never]
You: y
Claude: User activity analysis complete!
30-Day User Activity Analysis
Daily active users: 1,234 Daily logins: 3,456 Average session duration: 23.5 minutes Trend: Growing 5.2%
Key insights:
- Weekend activity drops 30%
- Weekday peaks: 10:00-11:00, 14:00-15:00
- New user retention rate: 68%
- Returning user ratio: 45%
Report saved: reports/user_activity.csv
## Real-Time Data Monitoring
```bash
You: Monitor logs/app.log and track error rate in real time
Claude: I'll create a real-time monitoring script...
Claude wants to:
• Write file: scripts/monitor_logs.py
Allow? [y/n/always/never]
You: y
Claude: Created log monitoring script:
```python
import time
import re
from collections import defaultdict
from datetime import datetime
def monitor_logs(log_file, interval=60):
"""Monitor log file, output stats every minute"""
stats = defaultdict(int)
with open(log_file, 'r') as f:
# Move to end of file
f.seek(0, 2)
print("Starting log monitoring...")
print("=" * 60)
while True:
line = f.readline()
if not line:
time.sleep(1)
continue
# Track log levels
if 'ERROR' in line:
stats['errors'] += 1
elif 'WARN' in line:
stats['warnings'] += 1
elif 'INFO' in line:
stats['info'] += 1
stats['total'] += 1
# Output stats every 100 lines
if stats['total'] % 100 == 0:
error_rate = (stats['errors'] / stats['total']) * 100
print(f"\n[{datetime.now().strftime('%H:%M:%S')}] Stats:")
print(f" Total logs: {stats['total']}")
print(f" Errors: {stats['errors']} ({error_rate:.2f}%)")
print(f" Warnings: {stats['warnings']}")
print(f" Info: {stats['info']}")
# Alert on high error rate
if error_rate > 5:
print(f" ALERT: Error rate too high!")
if __name__ == '__main__':
monitor_logs('logs/app.log')This script will:
- Read the log file in real time
- Count logs by level
- Calculate error rate
- Alert when error rate exceeds 5%
How to run:
python scripts/monitor_logs.pyWould you like me to start monitoring?
## Using Auto-Accept Mode
```bash
# Press Shift+Tab to switch to Auto-Accept Mode
[Bottom shows: ⏵⏵ accept edits on]
You: Batch analyze all CSV files in the data directory
Claude: [Auto-executes]
Analysis complete:
1. sales.csv - 1,245 records
2. users.csv - 3,456 records
3. products.csv - 234 records
4. orders.csv - 5,678 records
Summary report generated: reports/batch_analysis.txtCommon Commands
/model opus # Use Opus for complex data analysis
/fast # Use fast mode for simple statistics
/cost # Check API usage costBest Practices
- Large file handling: Use chunked reading to avoid memory overflow
- Data validation: Check data quality before analysis
- Save results: Save analysis results to files
- Visualization: Use charts to make data more intuitive
- Automation: Create reusable analysis scripts