For the launch of version 3 of the ODM, we are planning on also launching an ultra-minimal structure and template. The goal for this is to cut away everything that’s not essential so that users can get started in under a minute and understand the structure of the model.
To this end, I have started the process and cut away the reference tables, as while they are essential, I think they clutter the ERD in a way that is confusing to users freshly approaching the ODM. I have also eliminated all other tables except for: measures, samples, and sites I am of the opinion that while it is ideal to capture additional information on methods/protocols, instruments, organizations, etc. It is not purely essential. I also trimmed the majority of the fields in each of these tables, but I left some that I think could still be removed. I’ve attached an ERD of this ultra-minimal structure here for your review:
I think the absolute minimum ERD could get rid of the samples table as well. The mandatory fields in the measures table would be measureRepID, siteID, aDateEnd, measure, value, unit, aggregation and in the sites table just the siteID and the site name (which is missing from your ERD). This is also supported by the way some of the EU countries are presenting data in their dashboards.
What do you think?
I agree that a lot of labs don’t report sample details. But I’m kinda of the opinion that there are two approaches here:
1- a true minimal template with basically just the measure, value, and the site name.
2- a more opinionated minimal template where we opine (somewhat) on what should be considered the minimal data. And for that I think we should include some sample details.
I think that we should maybe be trying to do the second option, but I’m open to moving on that depending on feedback from others.
The minimal version sounds like a great idea. It’ll definitely make the ODM less intimidating and easier to start getting started with.
The question seems, to me, to be who the data is “minimal” for.
If an organisation always uses the same sampling protocol, then the sample data is sort of implied and known to them. However, as a potential user of open ODM datasets, I could not use datasets that don’t at least report some sampling details (was it sludge, wastewater, was flow proportional or time proportional?) The fields you kept in the table look like a very good minimal set for that purpose.
A great point! Thanks, JD. I think - of course - you’re right, that for internal data management and tracking, much of this additional metadata may be moot. Specifically the sampling metadata. I think we should be aiming for data sharing/interoperability standards, even if data is only internal.
I think we could potentially trim this further by removing the reportable fields. We could also potentially drop compartment. I think we could even trim off some of the date fields, for example, and keep only collDT in samples and only aDateEnd in measures. Do we think we should keep fraction still? And for sites, do we think that latitude, longitude, and EPSG are all mandatory? Should be drop EPSG as a piece of the minimal standard?
I think fraction is probably still required, as it’s a very basic piece of data that can be informative about the analysis method. compartment is required if the minimal template is supposed to be usable outside of just wastewater. By removing the different dt’s, you get fewer fields, but you might create more ambiguity for the users, which is also not ideal.
And if geoEPSG is removed, harmonising datasets in GIS will be impossible (unless you impose an EPSG in the minimal structure)
Yes - we specify that the minimal template is for water. We cut compartment and keep fraction. With the caveat that there is a wide and a long version of this template, and in the wide version compartment will be populated for users by embedding it in the column header name for the measurement value.
In keeping with that - we suggest the minimal template is for data collectors, true, but with an eye to data reusability. In line with our philosophy for open and FAIR data, we should be opinionated in seeking to establish that as the norm.
We keep geoEPSG - while it may not be reported, it is still our belief that it should be considered “minimal”.
This will exist as a wide and a long version, to cater to different user preferences. For the wide version we will need to specify the measures being collected - I will suggest the SARS-CoV-2 N1, and N2 gene regions, as well as PMMoV. Any thoughts there?
I think that no matter what we call the “ultra-minimal-bare-bones-just-the-basics” template, it is ultimately users who will use it and decide what they do or do not want to report. The exercise of us determining a minimal template is (in my opinion) to offer a kind of counsel to new users about what we as experts would recommend recording, even if it’s just bare bones. Whether they follow our advice or not is up to them, but I think that trying to set them up for success and for producing data that is re-usable and widely share-able (through sample metadata and GIS linkages) is part of the ethos of the ODM project and something we should try to adhere to.
Stepping back a bit — I think we may be drifting toward an opinionated minimal that goes further than where most users want. We need to meet them at “easy to use” with minimal interoperability for concepts, not the entire WW surveillance.
The LOINC analogy keeps coming back to me. LOINC has tens of thousands of codes, and almost no one uses it fully. In Ontario, the Ontario Lab Information System (OLIS) holds hundreds of millions — likely billions — of lab records, and despite LOINC being a comprehensive standard, day-to-day OLIS records use essentially three fields: the LOINC code, the value, and the unit. That tiny subset carries perfect interoperability - the key interoperability value of LOINC for labs.
In our new ArXiv paper (arXiv:2604.18762), I’ve been wondering about the same philosophy: a small recommended core (about 4–6 fields per table for sites, samples, measures), with the full dictionary available for users who need more. That’s different from what we’ve been calling “opinionated minimal,” which lists fields the team thinks every user should record. The OLIS/LOINC pattern is the inverse — a tiny lived minimum, with a vast model standing behind it.
A few practical implications for the template:
Fewer mandatory fields, not more. I’d lean toward dropping geoEPSG, fraction, and, probably, compartment from the strict minimum (we may want to revisit our names). Useful — but not necessary for a first dataset to be valid and shareable.
Wide-format as the starting point. Most users live in Excel; a wide-format minimal template lowers the entry barrier. The long-format relational view stays available for those who need it.
Wastewater-focused. Per my earlier note — make the minimal template domain-specific (SARS-CoV-2 N1/N2/PMMoV visible as example columns) rather than abstract. That’s how LOINC works in clinical labs too: a curated subset for the domain, not the full dictionary.
The “set users up for success” framing is right, but I’d reinterpret it: success at the entry point is getting started — recording valid, shareable data with as little structural friction as possible. The model’s depth is available for the second step, not the first.
@jeandavidt — your point about reusability for external users is well taken; I’d want to make sure the lightest minimum still produces data that downstream users can join to. But I’m worried we may be conflating “rich enough to be reusable” with “rich enough to support every imaginable downstream use case from day one.” If a user’s first dataset can be shared and joined on site, sample, measure, value, unit, and collection date, that’s already a real win. geoEPSG and fraction can be added when the use case calls for them.
I think Doug is drawing the right line (getting started vs supporting any use case). But I’d say the minimum fields also have to do with minimizing the risk that data is wrongly interpreted, and interpretation can be much harder for different classes of users. For lab workers or program coordinators, things like fraction and compartment are probably standardized across campaigns/labs and taken as redundant information. For downstream dataset users that had no interaction with the data collection process, it wouldn’t be as obvious.
So, it boils down to what use cases should be considered minimal. I think supporting surveillance campaigns across a jurisdiction is a very reasonable minimal use case, in which case geoEPSG, fraction and compartment could all be dropped. But maybe we can toy around with the idea of metadata completeness “tiers” (think bronze, silver and gold) that certify the datasets for more advanced use cases (inter-jurisdiction sharing, open access publishing, etc)?
I also agree with the idea that the wide format should be the entry point for new users. I also really like the idea of domain-specific minimums since what counts as “obvious” or “important” context clearly varies by domain.
A lot of salient points being made here, and I think a “tiered” set of minimal templates and/or domain specific minimums are good ones.
Before continuing on to flesh that out, I will say that sure, maybe compartment and fraction aren’t necessary. And in the wide template, they’re built into the measure field, so they could (in theory) be specified ahead of time and omitted from reporting, while still maintaining that information for future use and interpretation.
It sounds like we’re trying to boil things down to the real minimal essence, but beyond that, how are we conceiving of the tiers/domains? A minimum, modest, and moderate for the domains of lab, public health, and engineers? Interested to hear more along these perspective lines, and then I’ll mock up some new minimum templates along these lines.