Data Science & Analytics

Analysis, notebooks, visualization, pandas, statistics

68 bites

How do you leverage and fine-tune BERT for niche classification?

Tests transfer learning with scarce labels. Outline: pick a domain-adjacent checkpoint, add a classification head, use learning rates near 2e-5 with early stopping, and stratify tiny validation splits.

Data Science & Analytics30 sec read

What is a word embedding and how does it beat one-hot encoding?

Tests dense semantic vectors versus sparse one-hot representations. A good answer defines embeddings as learned real-valued vectors where similar words are close, contrasts them with orthogonal one-hot vectors lacking similarity, and names Word2Vec or GloVe.

Data Science & Analytics30 sec read

What is Simpson's Paradox and how can it bias A/B tests?

Tests whether you recognize that aggregate trends can reverse within subgroups. A strong answer defines the paradox, gives an A/B example where treatment wins overall but loses in every segment due to skewed allocation, and prescribes stratified analysis.

Data Science & Analytics30 sec read

How do network effects violate A/B tests and how to mitigate them?

Tests SUTVA violations and network experiment design. Answers note treated users alter control outcomes, then propose social-graph cluster randomization to isolate spillovers. Red flag: ignoring peer-to-peer spillover and using user-level randomization.

Data Science & Analytics31 sec read

Why not stop an A/B test when it looks significant early?

Tests whether you understand repeated looks inflate false positives. The term is peeking: checking daily can turn a 5% Type I error rate into roughly 15% by day 3. Red flag: citing "low sample size" without stating that early stopping invalidates the p-value.

Data Science & Analytics32 sec read

P-value vs confidence interval in an A/B test

WHAT IT TESTS: Frequentist reasoning beyond binary significance. ANSWER OUTLINE: A p-value gauges evidence against the null; a 95% CI shows plausible effect sizes and precision. RED FLAG: Calling the CI a 95% probability the true difference is inside.

Data Science & Analytics30 sec read

How do you determine required sample size for an A/B test?

Tests statistical power and experimental design. Name four inputs: baseline conversion rate, minimum detectable effect, alpha (5%), and power (80%), then solve for N. Red flag: "test until significant" or fixed guesses like 1000 users without effect size.

Data Science & Analytics30 sec read

High ROC-AUC but low PR-AUC: what does this imply?

Tests if ROC-AUC hides imbalance while PR-AUC exposes it. Severe imbalance dilutes FPR across many negatives, inflating ROC-AUC, but precision crashes. Critical for rare positives with costly false positives. Praising the model on ROC-AUC alone fails.

Data Science & Analytics30 sec read

What is cross-validation and why is it more robust than a holdout split?

WHAT IT TESTS: Understanding of generalization and evaluation variance. ANSWER OUTLINE: A single split is noisy and wastes data; k-fold rotates each fold as test, averages scores, and trains on all data.

Data Science & Analytics30 sec read

How would feature engineering for categoricals differ for logistic regression versus LightGBM?

It tests model-specific encoding decisions. Logistic regression needs one-hot to avoid false ordinality; tree models like LightGBM use ordinal encoding since splits rely on thresholds, not distance.

Data Science & Analytics30 sec read

Explain bias-variance tradeoff and how regularization or tree depth manage it

Tests understanding of generalization error decomposition. Define bias as underfitting and variance as sensitivity to training noise; show regularization and shallow trees trade excess variance for slightly higher bias. Red flag: claiming both can hit zero.

Data Science & Analytics30 sec read

How can EDA and visualization identify dataset bias before modeling?

Tests operationalizing bias detection before modeling. Strong answers compare sample distributions to population norms, audit feature coverage with heatmaps, and track missingness by subgroup. Red flag: citing only class imbalance or ignoring sampling frame.

Data Science & Analytics30 sec read

Describe your systematic approach to interpreting an 8-feature pair plot

WHAT IT TESTS: multivariate EDA beyond linear correlation. ANSWER OUTLINE: check 8 diagonals for skew; scan 28 off-diagonals for nonlinear trends, variance patterns, and hue clusters; flag redundancy; spot outliers.

Data Science & Analytics31 sec read

How would you visualize high-cardinality categorical relationships?

This tests dimensionality reduction when categories exceed roughly fifty levels. A strong answer proposes top-N aggregation with an other bucket and density or hierarchy plots like heatmaps or treemaps. A red flag is scrolling a bar chart or sampling rows.

Data Science & Analytics30 sec read

How do you visualize clusters in 100-dimensional numerical data?

Tests PCA versus t-SNE tradeoffs for high-dimensional visualization. Strong answers use PCA first for global variance and outliers, then t-SNE with perplexities 5-50 run to stability. Red flag: interpreting t-SNE cluster sizes or distances as literal metrics.

Data Science & Analytics30 sec read

Which plot visualizes a continuous versus categorical variable and why?

This tests categorical plot selection for continuous vs categorical relationships. A strong answer names boxplots or violinplots for distribution shape and outliers, or swarmplots for density, and cites median spread, skew, and anomalies.

Data Science & Analytics30 sec read

Design a clickstream pipeline from ingestion to data warehouse

Tests data pipeline design under load: buffering, idempotent transform, and warehouse modeling. A strong answer orders ingestion via Kinesis or MSK, Spark EMR sessionization, and Redshift star schemas. Red flag: no buffer and direct warehouse writes.

Data Science & Analytics30 sec read

How would you monitor data quality for a C-level dashboard pipeline?

WHAT IT TESTS: Designing production data observability for executive dashboards. A strong answer maps freshness SLAs, completeness checks, and distribution drift detection to business impact.

Data Science & Analytics30 sec read

What is data pipeline idempotency and how do you design for it?

This tests resilient pipeline design under failure. A strong answer defines idempotency as identical output on repeated runs, highlights safe retries and partial failure recovery, and proposes idempotency keys with atomic writes for daily API loads.

Data Science & Analytics30 sec read

Star schema vs snowflake schema: differences and trade-offs

Tests dimensional modeling: star schemas flatten dimensions for fast joins and simple queries, while snowflakes normalize them to cut redundancy at the cost of extra joins. Red flag: praising snowflake storage savings without admitting query overhead.