Tuesday, October 7, 2025

Life would have no meaning if not for its impermanence for every eventual permanence, there are many impermanence that build us fundamentally and I think that is beautiful.

Wednesday, October 1, 2025

C2PA should be mainstream

As AI-generated images become the norm in society, it is imperative that we develop a method of provenance for certain media.

The C2PA (Coalition for Content Provenance and Authenticity) is one such system. It creates a secure, tamper-evident record of a piece of content’s origin and history. When a photo is taken by a camera, edited in software, or generated entirely by an AI model, C2PA can embed verified details like who or what created it, what tools were used, and whether it has been altered. This information travels with the file itself, protected by cryptography so it can’t be easily stripped or forged without detection.

One problem with this, though, is that it cannot prevent media that is screen-recorded. This just means that provenance is only as strong as the ecosystem that supports it. C2PA works best when platforms, devices, and creators ALL PARTICIPATE in preserving and honoring the chain of trust. And while it can’t stop every form of deception, it can establish a baseline of accountability for content that moves through compliant channels that people rely on, like the news and social media platforms.

Read More: C2PA | Verifying Media Content Sources

Thursday, September 11, 2025

Stages in a Generic Process Model for Data Science Projects

Introduction

Data science projects require systematic approaches to transform raw data into actionable insights. While various frameworks exist in the industry, most share common fundamental stages that guide practitioners from initial problem conception to deployment and maintenance. This report examines the typical stages found in generic data science process models, drawing from established methodologies including CRISP-DM (Cross-Industry Standard Process for Data Mining), TDSP (Team Data Science Process), and KDD (Knowledge Discovery in Databases).

The Six Core Stages of Data Science Projects

Stage 1: Business Understanding and Problem Definition

The foundation of any successful data science project begins with thoroughly understanding the business context and clearly defining the problem to be solved. This stage involves translating business objectives into data science questions that can be answered analytically. Project stakeholders must establish success criteria, identify constraints, and assess feasibility.

For example, a retail company seeking to reduce customer churn would first need to define what constitutes churn in their context (no purchase in 90 days versus account closure), determine acceptable prediction accuracy levels, and establish how predictions will be integrated into business operations. This stage often requires multiple iterations as stakeholders refine their understanding of what's possible given available data and resources.

According to Provost and Fawcett (2013) in "Data Science for Business," this initial stage determines project success more than any technical consideration, as misaligned objectives lead to technically sound but practically useless solutions.

Stage 2: Data Collection and Acquisition

Once objectives are clear, the focus shifts to identifying and gathering relevant data sources. This stage encompasses both discovering what data exists within the organization and determining what external data might enhance the analysis. Data scientists must evaluate data accessibility, quality, volume, and relevance to the defined problem.

Consider a healthcare organization developing a patient readmission prediction model. They would need to collect electronic health records, demographic information, treatment histories, and potentially external data such as socioeconomic indicators from census databases. Each data source requires different acquisition methods, from database queries to API calls to manual collection processes.

The challenge extends beyond mere collection to ensuring proper data governance, privacy compliance (especially with regulations like GDPR or HIPAA), and establishing sustainable data pipelines for ongoing projects.

Stage 3: Data Exploration and Preparation

Data exploration and preparation typically consumes 60-80% of a data scientist's time, according to surveys by CrowdFlower (2016) and subsequent industry studies. This stage involves understanding data characteristics through statistical summaries and visualizations, identifying quality issues, and transforming raw data into analysis-ready formats.

Exploratory Data Analysis (EDA) reveals patterns, anomalies, and relationships that inform subsequent modeling decisions. For instance, discovering that customer purchase patterns follow strong seasonal trends would influence both feature engineering and model selection strategies. Data scientists create visualizations ranging from simple histograms to complex correlation matrices to understand variable distributions and relationships.

Data preparation encompasses cleaning (handling missing values, correcting errors), transformation (normalization, encoding categorical variables), and feature engineering (creating derived variables that better capture underlying patterns). A financial fraud detection project might create features such as "transaction velocity" (transactions per hour) or "deviation from typical spending patterns" rather than using raw transaction amounts alone.

Stage 4: Modeling and Algorithm Development

The modeling stage involves selecting appropriate algorithms, training models, and optimizing their performance. This iterative process begins with establishing a baseline using simple approaches before progressing to more sophisticated techniques. Data scientists must balance model complexity with interpretability, considering the trade-offs between accuracy and explainability.

For a customer segmentation project, the progression might start with simple k-means clustering, advance to hierarchical clustering for better interpretability, and potentially incorporate deep learning approaches if the data complexity warrants it. Each approach requires different preprocessing steps, hyperparameter tuning strategies, and validation methodologies.

Model selection depends on multiple factors including data characteristics (volume, dimensionality, structure), problem type (classification, regression, clustering), performance requirements (accuracy versus speed), and deployment constraints (real-time versus batch processing). Cross-validation and careful train-test splitting ensure models generalize well to unseen data rather than memorizing training patterns.

Stage 5: Evaluation and Validation

Rigorous evaluation extends beyond simple accuracy metrics to encompass business relevance, statistical significance, and practical feasibility. Data scientists must select appropriate metrics aligned with business objectives—for instance, in medical diagnosis, minimizing false negatives might outweigh overall accuracy considerations.

Evaluation involves multiple perspectives: statistical performance (accuracy, precision, recall, F1-score), business impact (revenue increase, cost reduction), and operational feasibility (computational requirements, integration complexity). A recommendation system for an e-commerce platform would be evaluated not just on prediction accuracy but on metrics like click-through rates, conversion rates, and customer satisfaction scores in A/B testing scenarios.

Validation strategies must also address potential biases, ensure fairness across different population segments, and test model robustness under various conditions. This includes stress testing with edge cases and evaluating performance degradation over time as data distributions shift.

Stage 6: Deployment and Monitoring

Successful deployment transforms analytical models into operational systems that deliver ongoing business value. This stage requires collaboration between data scientists, engineers, and IT operations to integrate models into existing infrastructure while ensuring scalability, reliability, and maintainability.

Deployment strategies vary from simple batch scoring systems that run periodically to complex real-time prediction services handling millions of requests. A credit scoring model might be deployed as a REST API integrated into loan application systems, requiring considerations for response time, failover mechanisms, and version control.

Post-deployment monitoring becomes crucial as model performance can degrade when real-world data diverges from training distributions—a phenomenon known as model drift. Automated monitoring systems track prediction accuracy, data quality, and system performance, triggering alerts when metrics fall below thresholds. Regular retraining schedules and feedback loops ensure models remain relevant and accurate over time.

The Iterative Nature of Data Science

While presented sequentially, these stages form an iterative cycle rather than a linear progression. Insights from modeling might necessitate collecting additional data, evaluation results could reframe the original problem definition, and deployment experiences often reveal new requirements. The CRISP-DM model, widely adopted across industries, explicitly recognizes this cyclical nature, with arrows connecting non-adjacent phases to represent common iteration patterns.

Agile methodologies increasingly influence data science workflows, promoting rapid prototyping, continuous stakeholder feedback, and incremental delivery. Rather than attempting to perfect each stage before proceeding, teams develop minimum viable models that evolve through successive iterations based on real-world performance and changing business needs.

Conclusion

Understanding these fundamental stages provides a roadmap for managing data science projects effectively, though specific implementations vary based on organizational context, project complexity, and domain requirements. Success requires not just technical proficiency in each stage but also strong project management, clear communication with stakeholders, and flexibility to adapt as understanding deepens throughout the project lifecycle.

References

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0: Step-by-step data mining guide. SPSS Inc.

Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and data-analytic thinking. O'Reilly Media.

Microsoft Team Data Science Process Documentation. (2020). Azure Architecture Center. Microsoft Corporation.

Kelleher, J. D., & Tierney, B. (2018). Data Science. MIT Press.

CrowdFlower. (2016). Data Science Report: The View from Data Scientists. Figure Eight Inc.

Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37-54.

Sunday, September 7, 2025

LLMs As An Emerging Technology

The problem with emerging technologies is that it takes time before they start being good. Take LLMs for example.

We can all agree that these chatbots are really good. But when integrated into software, they're practically dogshit. Same can be said to emerging technologies in the past like shareware games, the true web (Web 2.0), and Linux.

atp, people are still experimenting over what makes these models tick. You have people working on agents, multimodal AI, RAG systems, fine-tuning, prompt engineering, and novel architectures. At scale, these are unreliable as hell. The more control you give to AI, the more prone you are to errors. None of it is reliable at scale. Thus, the same pattern is repeated for these LLMs: early adopters wrestle with janky interfaces, inconsistent outputs, and fragile integrations. Then, slowly, tooling improves. Standards emerge. Best practices crystallize. The "dogshit" phase is not a bug, it’s a feature of innovation. It’s the messy sandbox where the real breakthroughs are prototyped.

Right now, LLMs in production feel like duct-taping a jet engine to a bicycle. Sometimes, it flies. Often, it explodes. But each explosion teaches us how to build a chassis that can handle the thrust.

The winners won’t be the ones who waited for perfection. They’ll be the ones who shipped the janky MVP, learned from the dumpster fires, and iterated like hell while everyone else was complaining about token limits or JSON formatting errors.

Let's keep innovating, people.

Saturday, August 2, 2025

The Good, The Bad, and Anthropic

This new article by Anthropic is basically saying "good" and "bad" personality traits are just vectors, as everything truly is for the state of AI today.

Larger Insects

I came across this tweet by Yishan Wong about his kid's ramblings.

Thus, I began to ponder that lobsters are technically insects, and if we could do this to lobsters, why not terrestrial insects?

After our planet's inception, this was actually possible. During the Carboniferous period (~300 million years ago), oxygen levels were much higher (~35% compared to 21% today), and giant insects like dragonflies with 70 cm wingspans could survive.

Currently, large insects the size of a car or even a dog are not feasible on Earth today due to a combination of factors. Let's review these one by one and determine what changes we would need in each respective element to make large insects a reality.

1. No Lungs = Impossible ...?

Insects do not have lungs. They rely on a tracheal system; it is basically a network of tubes that diffuses oxygen directly to their tissues. However, this works efficiently only at small sizes because diffusion is slow and becomes less effective over longer distances.

However, giant insects existed during the Carboniferous period, most likely because of the high oxygen content in the atmosphere. Thus, if we want to bring back those times, I propose that we should start converting the carbon dioxide in our atmosphere to increase the oxygen content through an artificially managed large-scale photosynthesis program.

2. Exoskeletal Issues

Insects have exoskeletons, not internal bones. As body size increases, (1) volume and weight grow faster than surface area (square-cube law), (2) the exoskeleton would become too heavy and brittle to support a large body, and (3) it would also crush under its own weight or break when trying to move, jump, fly, or perform other mechanical actions.

Suppose we tried to maintain their shape and structure (or at least maintain their insect-like qualities). In that case, the material that composes their exoskeleton must have a stronger microstructure.

3. Heat Regulation

Much like the lobsters Yishan's son mentioned, larger bodies expend more energy, and thus, generate more heat and lose it more slowly. Insects don’t sweat or have internal thermoregulation like mammals. Giant insects right now would likely overheat quickly in direct sunlight or during exertion, especially during these times that we are experiencing a rapid increase in atmospheric temperature brought by global warming.

If we could bring down the global temperature by some degree, larger insects would be one more step closer to being feasible.

These solutions are definitely difficult to achieve and require systematic cooperation among countries. However, if we so desire that larger insects roam our Earth again, we must do this together. I am no entomologist, but I empathize with some of them in their desire for larger insects as domesticable creatures.

Sunday, July 27, 2025

GRPO and Fixed RL Algorithm on Sequence Models

New paper released by the Alibaba AI team working on Qwen:
Group Sequence Policy Optimization

It appears that they have fixed GRPO and Reinforcement Learning for sequence models.
Could not understand more than half of this, and I had to rely on what others are saying, but this is really good for LMs trained using RL. Try your hand at understanding their paper and let me know what you found!

Wednesday, July 23, 2025

Reviewing IT Proficiency Learning Materials

Happy 23rd of July! It is 6 PM, and I am writing to you all about my recent progress with reviewing for TOPCIT. Currently, I have only saved the PDF document and done some light reading on Book 3: Overview of System Architecture.

Each of the books is less than 300 pages, but there are 6 of them. Do you think I'll be able to finish all of them by the end of the week (July 26)? I don't think so, but I can certainly try!

Honestly, it's been a while since I've done some reading... About a month ago already. Might take a while to speed things through and I'm still at page 24... sighh...

Wish me luck.

Tuesday, July 22, 2025

Midnight Vent

I have just vented with someone I met more than month ago. Right now, we have established that we are more than just friends. His name is Marco, and I could describe him as a walking historical literature book. He's adept with Pre-Colonial Philippine History, and that's where I found interest in him.

It took us a bit more than a fortnight of talking until I considered a relationship with him. He says he fell in love with me first. Little did he know when I first slid into his DMs, I was already interested in him. He talks a lot about his interests: Pre-Colonial Philippines, the novel he's writing, the comics he's read, the TTRPGs he's played. He is the yapper type. I don't mind that at all. In fact: I enjoyed it. I feel like I made a good friend and partner. Honestly, I think he's a great partner, but somehow there's always something lacking in our relationship.

Not meaning to compare with my previous partner, but I feel like he centers himself a bit much to me. I think my future self will learn to love this part of him, but as I'm typing this, that's what I think of him.

Anyway, that's a bit of a life update I guess, but that's not what I'm here to talk about.

Hey there, dear readers. It's been a while. There ain't probably many of you—just a handful of you. Anyways, I made a lot of promises to deliver content in here, but I never did. I might never will and shit happens. That's what I'll be talking about.

I finished my Second Year of studying Computer Science less than 3 months ago. Hooray! For the summer vacation, I thought I had everything planned out. The first month, I'd be working on a sari-sari POS app to help my mom's canteen. I had a rough idea as to the technologies I'd be using. It would be written in Kotlin and the UI will follow the Material Design guidelines with Jetpack Compose. the local-area-hosted database will be in NocoDB with GraphQL for the querying. Not really experienced with working with databases, so I'm not sure about the NocoDB + GraphQL + LAN-hosting part.

As to the progress of this app... I only got to working with the wireframe and UI conventions. Never got to actual coding. It's already our 3rd month of summer vacation, and I only have less than a month of free time left. It's sad that I never will get to finishing this since I've picked up on a couple of responsibilities the past few days. I have been elected as VP Internal for our student body organization in our college, and we've been working with our schools in Cebu to establish an interschool organization. We're still drafting the constitution for it though.

Anyways, the second month of summer vacation was truly my regretful month. I only played games around that time. Nothing productive. Valorant. Marvel Rivals. Star Wars Battlefront 2. Clair Obscur: Expedition 33. I justified myself by thinking this should be the time I'm reserving for enjoyment since next summer vacation would not be a vacation anymore. Unfortunately, our summer term for our Third Year is going to be our OJT. Not enthusiastic about that.

The third month of the summer vacation is coming to a close now, and upon reflection, I could say I'm very disappointed in myself. The thing is, I always will be at this time of year. I always end up promising myself to be better but never deliver. It's an endless loop. Just an hour ago, I vented to Marco about a mini panic attack I just had a few minutes prior to us conversing. Usually, this panic attack is what drives me to action. Been like that since 8th Grade after a friend of mine committed suicide due to familial pressure and pressure to oneself. I can't describe this panic attack to be uncomfortable, but I don't like it at all. I wish this was never the phenomenon that precedes to my locking in. The loop usually flows as follows:

1. I promise myself to do X.

2. I fail to do X.

3. My body proceeds to go to a slight panic attack.

4. Normally, I would have suicidal thoughts. Nowadays, because of my late friend, I just force myself to lock in.

At least that's what it generally feels like. It's like the storm before the calm? Opposite to the phrase, the calm before the storm.

Anyways, that's all I had to share to y'all on this day. I've omitted a lot of details because I'm getting lazy and the sleepiness is getting to me. I'm just making this blog a piece about myself without regards as to who reads these and the context they might lack as they read these.

If you made it this far, thank you. If you are able to read this on or before the end of the month of this year, send me a message with a screenshot of this section of the blog post, and I'll reply to you with a gift. I love you. Thank you.

Anything Lindrew

Pages