The R&D Data Journey: Understanding Your Scientific Data’s Lifecycle

Mar 28, 2025 | Scientific Data

Author: Dr. Nathalie Batoux

Data is dynamic, constantly moving and interacting with other information. Have you ever wondered about the journey your scientific data takes? Where it’s generated, how it’s used, and whether your current processes are efficient? 

Data Stewardship: Shepherding Your Data

I prefer the term “data shepherding” over “data stewardship” (as I explained in a recent LinkedIn post) because it better captures the active care and guidance needed for data management. This includes the data journey.

In life sciences R&D, we deal with massive amounts of data. Each piece of data has its own lifecycle: creation, manipulation, analysis, storage, access, update, searching, and eventually deletion. Understanding this journey is crucial for ensuring your data remains trustworthy for making informed decisions.

Why the Data Journey Matter

The history and quality of your data become especially critical when using it to make decisions and for AI and Machine Learning (ML) training. You’ve likely heard the phrase “garbage in, garbage out.” If data is corrupted or recorded without proper metadata and context, any downstream decisions based on that data are at serious risk.

Mapping the Data Journey: A London Tube Analogy

I’ve created a visual representation of the data journey inspired by the London Tube map. Rather than showing physical locations, this map illustrates how data flows like the commuters through different stages:

Tube Diagram

The data journey is not linear, far from it. Like in the TFL Tube map there are lots of streams and intersections where the sets of data “interact” with each other. Let’s review the highly simplified stages or “zones” I have come up with:

Zone 6: External Data Sources

  • Publicly published papers
  • Commercially purchased datasets
  • Collaborator data
  • Require due diligence for quality assurance
  • Limited control but need proper referencing and storage…

Zone 5: Internal Data Generation

  • Manual lab journal entries
  • Raw instrument measurements
  • Email communications
  • Primary data capture points…

Zone 4: Initial Processing

  • Basic calculations
  • Automated instrument processing e.g. from an instrument such a curve integration on a HPLC system
  • Raw data transformation …

Zone 3: Data Integration

  • Combining multiple datasets
  • Experimental reporting
  • Results presentation…

Zone 2: Advanced Analysis and data aggreagation

  • Cross-project data aggregation
  • Trend analysis
  • Project decision-making or future projects initiation…

Zone 1: Extended Applications

  • Operational analytics
  • Equipment utilization metrics
  • AI/ML training data…

You will have noticed that I have described here what is happening to the data in a very simplified way in these “stages” or “zones”, but I have not mentioned where the data is stored and how the interactions between the datasets (the white dots on the map) happen. This is because it is very much depending of your organisation and your data maturity.

The Challenges of Data Storage

It is common that data gets generated and initially processed by specific systems and stored within these systems’ realm. Systems may be isolated and don’t enable communication with other systems, or may record data in a proprietary unreadable format. Then the data lives independently in data siloes, making it difficult to extract, link and interact with other data.

Data siloes are one of the major issues encountered when looking at efficient data journeys: the black line on the map simply doesn’t exist or is very broken.

Preferring connected systems, preferably programmatically with public APIs to access and/or extract structured and readable data, is a step in the right direction.

Making Data Interactions Work

Data interactions are the other major point to consider. I represented it them by the white node on the map and simplified the view by having only 2 or 3 pieces of information interacting. There are many criteria to consider for data interaction:

  • How does the exchange happen? is it automatic? programmatic? or do you have to manually find the data and copy & paste it?
  • interoperability : can the data be understood by the other systems?
  • data harmonisation and ontologies: is the data using a “common language” so everybody is calling a spade a spade and not some calling it a shovel?
  • contemporary data: are you interacting with the latest version of the data?
  • accessibility: can you find the data and do you have the permission to access the data?

Practical Improvements

While mapping your entire data journey may seem overwhelming, start by asking these key questions:

  • Can we automate manual data transfers?
  • How can we reduce copy-paste errors?
  • Should we implement a data catalog?
  • Can we streamline data sharing?

Reduce manual input, copy and paste and enabling the use of data catalogs with terms agreed across your organisation or even better, across the industry can dramatically improve the data journey but also the data quality. This in turn will increase the value of the data and limit the effort needed to manage your data.

Remember these best practices:

  • Follow FAIR data principles
  • Use ALCOA+ guidance
  • Use standard formats (like the ones developed by the Allotrope Foundation) can help with the flow of data, making it readable by more systems.

Final Thoughts

Don’t try to solve everything at once. Focus on simple, impactful changes that can deliver quick wins. Most importantly, involve your teams throughout the process—they’re essential to the success of any data initiative.

Dana Karen

About the Author

Dr. Nathalie Batoux

Dr. Nathalie Batoux is an independent data stewardship consultant, based in the UK, with 25+ years of experience bridging scientific research and digital innovation. Combining a PhD in organic Chemistry with extensive software development expertise, she provides strategic data management solutions for R&D departments and life science companies. As a bilingual professional with global team experience, she excels at translating complex scientific challenges into actionable strategies across organizational levels. Her mission is to empower scientists through data management to accelerate research efficiency.

About Kalleid

Kalleid, Inc. is a boutique IT consulting firm that has served the scientific community since 2014. We work across the value chain in R&D, clinical, and quality areas to deliver support services for software implementations in highly complex, multi-site organizations. At Kalleid, we understand how effective project management plays a key role in ensuring the success of your IT projects. Kalleid project managers have the right mix of technical know-how, domain knowledge and soft skills to effectively manage your project over its full lifecycle. From project planning to go-live, our skilled PMs will identify and apply the most effective methodology (e.g., agile, waterfall, or hybrid) for successful delivery. If you are interested in exploring how Kalleid project managers can benefit your organization, please don’t hesitate to contact us today.