Introduction

Data science projects require systematic approaches to transform raw data into actionable insights. While various frameworks exist in the industry, most share common fundamental stages that guide practitioners from initial problem conception to deployment and maintenance. This report examines the typical stages found in generic data science process models, drawing from established methodologies including CRISP-DM (Cross-Industry Standard Process for Data Mining), TDSP (Team Data Science Process), and KDD (Knowledge Discovery in Databases).

The Six Core Stages of Data Science Projects

Stage 1: Business Understanding and Problem Definition

The foundation of any successful data science project begins with thoroughly understanding the business context and clearly defining the problem to be solved. This stage involves translating business objectives into data science questions that can be answered analytically. Project stakeholders must establish success criteria, identify constraints, and assess feasibility.

For example, a retail company seeking to reduce customer churn would first need to define what constitutes churn in their context (no purchase in 90 days versus account closure), determine acceptable prediction accuracy levels, and establish how predictions will be integrated into business operations. This stage often requires multiple iterations as stakeholders refine their understanding of what's possible given available data and resources.

According to Provost and Fawcett (2013) in "Data Science for Business," this initial stage determines project success more than any technical consideration, as misaligned objectives lead to technically sound but practically useless solutions.

Stage 2: Data Collection and Acquisition

Once objectives are clear, the focus shifts to identifying and gathering relevant data sources. This stage encompasses both discovering what data exists within the organization and determining what external data might enhance the analysis. Data scientists must evaluate data accessibility, quality, volume, and relevance to the defined problem.

Consider a healthcare organization developing a patient readmission prediction model. They would need to collect electronic health records, demographic information, treatment histories, and potentially external data such as socioeconomic indicators from census databases. Each data source requires different acquisition methods, from database queries to API calls to manual collection processes.

The challenge extends beyond mere collection to ensuring proper data governance, privacy compliance (especially with regulations like GDPR or HIPAA), and establishing sustainable data pipelines for ongoing projects.

Stage 3: Data Exploration and Preparation

Data exploration and preparation typically consumes 60-80% of a data scientist's time, according to surveys by CrowdFlower (2016) and subsequent industry studies. This stage involves understanding data characteristics through statistical summaries and visualizations, identifying quality issues, and transforming raw data into analysis-ready formats.

Exploratory Data Analysis (EDA) reveals patterns, anomalies, and relationships that inform subsequent modeling decisions. For instance, discovering that customer purchase patterns follow strong seasonal trends would influence both feature engineering and model selection strategies. Data scientists create visualizations ranging from simple histograms to complex correlation matrices to understand variable distributions and relationships.

Data preparation encompasses cleaning (handling missing values, correcting errors), transformation (normalization, encoding categorical variables), and feature engineering (creating derived variables that better capture underlying patterns). A financial fraud detection project might create features such as "transaction velocity" (transactions per hour) or "deviation from typical spending patterns" rather than using raw transaction amounts alone.

Stage 4: Modeling and Algorithm Development

The modeling stage involves selecting appropriate algorithms, training models, and optimizing their performance. This iterative process begins with establishing a baseline using simple approaches before progressing to more sophisticated techniques. Data scientists must balance model complexity with interpretability, considering the trade-offs between accuracy and explainability.

For a customer segmentation project, the progression might start with simple k-means clustering, advance to hierarchical clustering for better interpretability, and potentially incorporate deep learning approaches if the data complexity warrants it. Each approach requires different preprocessing steps, hyperparameter tuning strategies, and validation methodologies.

Model selection depends on multiple factors including data characteristics (volume, dimensionality, structure), problem type (classification, regression, clustering), performance requirements (accuracy versus speed), and deployment constraints (real-time versus batch processing). Cross-validation and careful train-test splitting ensure models generalize well to unseen data rather than memorizing training patterns.

Stage 5: Evaluation and Validation

Rigorous evaluation extends beyond simple accuracy metrics to encompass business relevance, statistical significance, and practical feasibility. Data scientists must select appropriate metrics aligned with business objectives—for instance, in medical diagnosis, minimizing false negatives might outweigh overall accuracy considerations.

Evaluation involves multiple perspectives: statistical performance (accuracy, precision, recall, F1-score), business impact (revenue increase, cost reduction), and operational feasibility (computational requirements, integration complexity). A recommendation system for an e-commerce platform would be evaluated not just on prediction accuracy but on metrics like click-through rates, conversion rates, and customer satisfaction scores in A/B testing scenarios.

Validation strategies must also address potential biases, ensure fairness across different population segments, and test model robustness under various conditions. This includes stress testing with edge cases and evaluating performance degradation over time as data distributions shift.

Stage 6: Deployment and Monitoring

Successful deployment transforms analytical models into operational systems that deliver ongoing business value. This stage requires collaboration between data scientists, engineers, and IT operations to integrate models into existing infrastructure while ensuring scalability, reliability, and maintainability.

Deployment strategies vary from simple batch scoring systems that run periodically to complex real-time prediction services handling millions of requests. A credit scoring model might be deployed as a REST API integrated into loan application systems, requiring considerations for response time, failover mechanisms, and version control.

Post-deployment monitoring becomes crucial as model performance can degrade when real-world data diverges from training distributions—a phenomenon known as model drift. Automated monitoring systems track prediction accuracy, data quality, and system performance, triggering alerts when metrics fall below thresholds. Regular retraining schedules and feedback loops ensure models remain relevant and accurate over time.

The Iterative Nature of Data Science

While presented sequentially, these stages form an iterative cycle rather than a linear progression. Insights from modeling might necessitate collecting additional data, evaluation results could reframe the original problem definition, and deployment experiences often reveal new requirements. The CRISP-DM model, widely adopted across industries, explicitly recognizes this cyclical nature, with arrows connecting non-adjacent phases to represent common iteration patterns.

Agile methodologies increasingly influence data science workflows, promoting rapid prototyping, continuous stakeholder feedback, and incremental delivery. Rather than attempting to perfect each stage before proceeding, teams develop minimum viable models that evolve through successive iterations based on real-world performance and changing business needs.

Conclusion

Understanding these fundamental stages provides a roadmap for managing data science projects effectively, though specific implementations vary based on organizational context, project complexity, and domain requirements. Success requires not just technical proficiency in each stage but also strong project management, clear communication with stakeholders, and flexibility to adapt as understanding deepens throughout the project lifecycle.

References

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0: Step-by-step data mining guide. SPSS Inc.

Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and data-analytic thinking. O'Reilly Media.

Microsoft Team Data Science Process Documentation. (2020). Azure Architecture Center. Microsoft Corporation.

Kelleher, J. D., & Tierney, B. (2018). Data Science. MIT Press.

CrowdFlower. (2016). Data Science Report: The View from Data Scientists. Figure Eight Inc.

Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37-54.

The problem with emerging technologies is that it takes time before they start being good. Take LLMs for example.

We can all agree that these chatbots are really good. But when integrated into software, they're practically dogshit. Same can be said to emerging technologies in the past like shareware games, the true web (Web 2.0), and Linux.

atp, people are still experimenting over what makes these models tick. You have people working on agents, multimodal AI, RAG systems, fine-tuning, prompt engineering, and novel architectures. At scale, these are unreliable as hell. The more control you give to AI, the more prone you are to errors. None of it is reliable at scale. Thus, the same pattern is repeated for these LLMs: early adopters wrestle with janky interfaces, inconsistent outputs, and fragile integrations. Then, slowly, tooling improves. Standards emerge. Best practices crystallize. The "dogshit" phase is not a bug, it’s a feature of innovation. It’s the messy sandbox where the real breakthroughs are prototyped.

Right now, LLMs in production feel like duct-taping a jet engine to a bicycle. Sometimes, it flies. Often, it explodes. But each explosion teaches us how to build a chassis that can handle the thrust.

The winners won’t be the ones who waited for perfection. They’ll be the ones who shipped the janky MVP, learned from the dumpster fires, and iterated like hell while everyone else was complaining about token limits or JSON formatting errors.

Let's keep innovating, people.

Anything Lindrew

Pages

Thursday, September 11, 2025

Stages in a Generic Process Model for Data Science Projects