Tuesday, October 7, 2025
Wednesday, October 1, 2025
C2PA should be mainstream
As AI-generated images become the norm in society, it is imperative that we develop a method of provenance for certain media.
The C2PA (Coalition for Content Provenance and Authenticity) is one such system. It creates a secure, tamper-evident record of a piece of content’s origin and history. When a photo is taken by a camera, edited in software, or generated entirely by an AI model, C2PA can embed verified details like who or what created it, what tools were used, and whether it has been altered. This information travels with the file itself, protected by cryptography so it can’t be easily stripped or forged without detection.
One problem with this, though, is that it cannot prevent media that is screen-recorded. This just means that provenance is only as strong as the ecosystem that supports it. C2PA works best when platforms, devices, and creators ALL PARTICIPATE in preserving and honoring the chain of trust. And while it can’t stop every form of deception, it can establish a baseline of accountability for content that moves through compliant channels that people rely on, like the news and social media platforms.
Read More: C2PA | Verifying Media Content Sources
Thursday, September 11, 2025
Stages in a Generic Process Model for Data Science Projects
Introduction
Data
science projects require systematic approaches to transform raw data into
actionable insights. While various frameworks exist in the industry, most share
common fundamental stages that guide practitioners from initial problem
conception to deployment and maintenance. This report examines the typical
stages found in generic data science process models, drawing from established
methodologies including CRISP-DM (Cross-Industry Standard Process for Data
Mining), TDSP (Team Data Science Process), and KDD (Knowledge Discovery in
Databases).
The Six Core Stages of Data Science Projects
Stage 1: Business Understanding and Problem Definition
The
foundation of any successful data science project begins with thoroughly
understanding the business context and clearly defining the problem to be
solved. This stage involves translating business objectives into data science
questions that can be answered analytically. Project stakeholders must
establish success criteria, identify constraints, and assess feasibility.
For
example, a retail company seeking to reduce customer churn would first need to
define what constitutes churn in their context (no purchase in 90 days versus
account closure), determine acceptable prediction accuracy levels, and
establish how predictions will be integrated into business operations. This
stage often requires multiple iterations as stakeholders refine their
understanding of what's possible given available data and resources.
According
to Provost and Fawcett (2013) in "Data Science for Business," this
initial stage determines project success more than any technical consideration,
as misaligned objectives lead to technically sound but practically useless
solutions.
Stage 2: Data Collection and Acquisition
Once
objectives are clear, the focus shifts to identifying and gathering relevant
data sources. This stage encompasses both discovering what data exists within
the organization and determining what external data might enhance the analysis.
Data scientists must evaluate data accessibility, quality, volume, and
relevance to the defined problem.
Consider a
healthcare organization developing a patient readmission prediction model. They
would need to collect electronic health records, demographic information,
treatment histories, and potentially external data such as socioeconomic
indicators from census databases. Each data source requires different
acquisition methods, from database queries to API calls to manual collection
processes.
The
challenge extends beyond mere collection to ensuring proper data governance,
privacy compliance (especially with regulations like GDPR or HIPAA), and
establishing sustainable data pipelines for ongoing projects.
Stage 3: Data Exploration and Preparation
Data
exploration and preparation typically consumes 60-80% of a data scientist's
time, according to surveys by CrowdFlower (2016) and subsequent industry
studies. This stage involves understanding data characteristics through
statistical summaries and visualizations, identifying quality issues, and
transforming raw data into analysis-ready formats.
Exploratory
Data Analysis (EDA) reveals patterns, anomalies, and relationships that inform
subsequent modeling decisions. For instance, discovering that customer purchase
patterns follow strong seasonal trends would influence both feature engineering
and model selection strategies. Data scientists create visualizations ranging
from simple histograms to complex correlation matrices to understand variable
distributions and relationships.
Data
preparation encompasses cleaning (handling missing values, correcting errors),
transformation (normalization, encoding categorical variables), and feature
engineering (creating derived variables that better capture underlying
patterns). A financial fraud detection project might create features such as
"transaction velocity" (transactions per hour) or "deviation
from typical spending patterns" rather than using raw transaction amounts
alone.
Stage 4: Modeling and Algorithm Development
The
modeling stage involves selecting appropriate algorithms, training models, and
optimizing their performance. This iterative process begins with establishing a
baseline using simple approaches before progressing to more sophisticated
techniques. Data scientists must balance model complexity with
interpretability, considering the trade-offs between accuracy and
explainability.
For a
customer segmentation project, the progression might start with simple k-means
clustering, advance to hierarchical clustering for better interpretability, and
potentially incorporate deep learning approaches if the data complexity
warrants it. Each approach requires different preprocessing steps,
hyperparameter tuning strategies, and validation methodologies.
Model
selection depends on multiple factors including data characteristics (volume,
dimensionality, structure), problem type (classification, regression,
clustering), performance requirements (accuracy versus speed), and deployment
constraints (real-time versus batch processing). Cross-validation and careful
train-test splitting ensure models generalize well to unseen data rather than
memorizing training patterns.
Stage 5: Evaluation and Validation
Rigorous
evaluation extends beyond simple accuracy metrics to encompass business
relevance, statistical significance, and practical feasibility. Data scientists
must select appropriate metrics aligned with business objectives—for instance,
in medical diagnosis, minimizing false negatives might outweigh overall
accuracy considerations.
Evaluation
involves multiple perspectives: statistical performance (accuracy, precision,
recall, F1-score), business impact (revenue increase, cost reduction), and
operational feasibility (computational requirements, integration complexity). A
recommendation system for an e-commerce platform would be evaluated not just on
prediction accuracy but on metrics like click-through rates, conversion rates,
and customer satisfaction scores in A/B testing scenarios.
Validation
strategies must also address potential biases, ensure fairness across different
population segments, and test model robustness under various conditions. This
includes stress testing with edge cases and evaluating performance degradation
over time as data distributions shift.
Stage 6: Deployment and Monitoring
Successful
deployment transforms analytical models into operational systems that deliver
ongoing business value. This stage requires collaboration between data
scientists, engineers, and IT operations to integrate models into existing
infrastructure while ensuring scalability, reliability, and maintainability.
Deployment
strategies vary from simple batch scoring systems that run periodically to
complex real-time prediction services handling millions of requests. A credit
scoring model might be deployed as a REST API integrated into loan application
systems, requiring considerations for response time, failover mechanisms, and
version control.
Post-deployment
monitoring becomes crucial as model performance can degrade when real-world
data diverges from training distributions—a phenomenon known as model drift.
Automated monitoring systems track prediction accuracy, data quality, and
system performance, triggering alerts when metrics fall below thresholds.
Regular retraining schedules and feedback loops ensure models remain relevant
and accurate over time.
The Iterative Nature of Data Science
While
presented sequentially, these stages form an iterative cycle rather than a
linear progression. Insights from modeling might necessitate collecting
additional data, evaluation results could reframe the original problem
definition, and deployment experiences often reveal new requirements. The
CRISP-DM model, widely adopted across industries, explicitly recognizes this
cyclical nature, with arrows connecting non-adjacent phases to represent common
iteration patterns.
Agile
methodologies increasingly influence data science workflows, promoting rapid
prototyping, continuous stakeholder feedback, and incremental delivery. Rather
than attempting to perfect each stage before proceeding, teams develop minimum
viable models that evolve through successive iterations based on real-world
performance and changing business needs.
Conclusion
Understanding
these fundamental stages provides a roadmap for managing data science projects
effectively, though specific implementations vary based on organizational
context, project complexity, and domain requirements. Success requires not just
technical proficiency in each stage but also strong project management, clear
communication with stakeholders, and flexibility to adapt as understanding
deepens throughout the project lifecycle.
References
Chapman,
P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., &
Wirth, R. (2000). CRISP-DM 1.0: Step-by-step data mining guide. SPSS Inc.
Provost,
F., & Fawcett, T. (2013). Data Science for Business: What you need to know
about data mining and data-analytic thinking. O'Reilly Media.
Microsoft
Team Data Science Process Documentation. (2020). Azure Architecture Center.
Microsoft Corporation.
Kelleher,
J. D., & Tierney, B. (2018). Data Science. MIT Press.
CrowdFlower.
(2016). Data Science Report: The View from Data Scientists. Figure Eight Inc.
Fayyad, U.,
Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge
discovery in databases. AI Magazine, 17(3), 37-54.
Sunday, September 7, 2025
LLMs As An Emerging Technology
Saturday, August 2, 2025
The Good, The Bad, and Anthropic
This new article by Anthropic is basically saying "good" and "bad" personality traits are just vectors, as everything truly is for the state of AI today.
Larger Insects
I came across this tweet by Yishan Wong about his kid's ramblings.
After our planet's inception, this was actually possible. During the Carboniferous period (~300 million years ago), oxygen levels were much higher (~35% compared to 21% today), and giant insects like dragonflies with 70 cm wingspans could survive.
Currently, large insects the size of a car or even a dog are not feasible on Earth today due to a combination of factors. Let's review these one by one and determine what changes we would need in each respective element to make large insects a reality.
1. No Lungs = Impossible ...?
Insects do not have lungs. They rely on a tracheal system; it is basically a network of tubes that diffuses oxygen directly to their tissues. However, this works efficiently only at small sizes because diffusion is slow and becomes less effective over longer distances.
However, giant insects existed during the Carboniferous period, most likely because of the high oxygen content in the atmosphere. Thus, if we want to bring back those times, I propose that we should start converting the carbon dioxide in our atmosphere to increase the oxygen content through an artificially managed large-scale photosynthesis program.
2. Exoskeletal Issues
Insects have exoskeletons, not internal bones. As body size increases, (1) volume and weight grow faster than surface area (square-cube law), (2) the exoskeleton would become too heavy and brittle to support a large body, and (3) it would also crush under its own weight or break when trying to move, jump, fly, or perform other mechanical actions.
Suppose we tried to maintain their shape and structure (or at least maintain their insect-like qualities). In that case, the material that composes their exoskeleton must have a stronger microstructure.
3. Heat Regulation
Much like the lobsters Yishan's son mentioned, larger bodies expend more energy, and thus, generate more heat and lose it more slowly. Insects don’t sweat or have internal thermoregulation like mammals. Giant insects right now would likely overheat quickly in direct sunlight or during exertion, especially during these times that we are experiencing a rapid increase in atmospheric temperature brought by global warming.
If we could bring down the global temperature by some degree, larger insects would be one more step closer to being feasible.
These solutions are definitely difficult to achieve and require systematic cooperation among countries. However, if we so desire that larger insects roam our Earth again, we must do this together. I am no entomologist, but I empathize with some of them in their desire for larger insects as domesticable creatures.
Sunday, July 27, 2025
GRPO and Fixed RL Algorithm on Sequence Models
Group Sequence Policy Optimization
Could not understand more than half of this, and I had to rely on what others are saying, but this is really good for LMs trained using RL. Try your hand at understanding their paper and let me know what you found!
Wednesday, July 23, 2025
Reviewing IT Proficiency Learning Materials
Happy 23rd of July! It is 6 PM, and I am writing to you all about my recent progress with reviewing for TOPCIT. Currently, I have only saved the PDF document and done some light reading on Book 3: Overview of System Architecture.
Each of the books is less than 300 pages, but there are 6 of them. Do you think I'll be able to finish all of them by the end of the week (July 26)? I don't think so, but I can certainly try!
Honestly, it's been a while since I've done some reading... About a month ago already. Might take a while to speed things through and I'm still at page 24... sighh...
Wish me luck.