For the launch of version 3 of the ODM, we are planning on also launching an ultra-minimal structure and template. The goal for this is to cut away everything that’s not essential so that users can get started in under a minute and understand the structure of the model.
To this end, I have started the process and cut away the reference tables, as while they are essential, I think they clutter the ERD in a way that is confusing to users freshly approaching the ODM. I have also eliminated all other tables except for: measures, samples, and sites I am of the opinion that while it is ideal to capture additional information on methods/protocols, instruments, organizations, etc. It is not purely essential. I also trimmed the majority of the fields in each of these tables, but I left some that I think could still be removed. I’ve attached an ERD of this ultra-minimal structure here for your review:
I think the absolute minimum ERD could get rid of the samples table as well. The mandatory fields in the measures table would be measureRepID, siteID, aDateEnd, measure, value, unit, aggregation and in the sites table just the siteID and the site name (which is missing from your ERD). This is also supported by the way some of the EU countries are presenting data in their dashboards.
What do you think?
I agree that a lot of labs don’t report sample details. But I’m kinda of the opinion that there are two approaches here:
1- a true minimal template with basically just the measure, value, and the site name.
2- a more opinionated minimal template where we opine (somewhat) on what should be considered the minimal data. And for that I think we should include some sample details.
I think that we should maybe be trying to do the second option, but I’m open to moving on that depending on feedback from others.
The minimal version sounds like a great idea. It’ll definitely make the ODM less intimidating and easier to start getting started with.
The question seems, to me, to be who the data is “minimal” for.
If an organisation always uses the same sampling protocol, then the sample data is sort of implied and known to them. However, as a potential user of open ODM datasets, I could not use datasets that don’t at least report some sampling details (was it sludge, wastewater, was flow proportional or time proportional?) The fields you kept in the table look like a very good minimal set for that purpose.
A great point! Thanks, JD. I think - of course - you’re right, that for internal data management and tracking, much of this additional metadata may be moot. Specifically the sampling metadata. I think we should be aiming for data sharing/interoperability standards, even if data is only internal.
I think we could potentially trim this further by removing the reportable fields. We could also potentially drop compartment. I think we could even trim off some of the date fields, for example, and keep only collDT in samples and only aDateEnd in measures. Do we think we should keep fraction still? And for sites, do we think that latitude, longitude, and EPSG are all mandatory? Should be drop EPSG as a piece of the minimal standard?
I think fraction is probably still required, as it’s a very basic piece of data that can be informative about the analysis method. compartment is required if the minimal template is supposed to be usable outside of just wastewater. By removing the different dt’s, you get fewer fields, but you might create more ambiguity for the users, which is also not ideal.
And if geoEPSG is removed, harmonising datasets in GIS will be impossible (unless you impose an EPSG in the minimal structure)
Yes - we specify that the minimal template is for water. We cut compartment and keep fraction. With the caveat that there is a wide and a long version of this template, and in the wide version compartment will be populated for users by embedding it in the column header name for the measurement value.
In keeping with that - we suggest the minimal template is for data collectors, true, but with an eye to data reusability. In line with our philosophy for open and FAIR data, we should be opinionated in seeking to establish that as the norm.
We keep geoEPSG - while it may not be reported, it is still our belief that it should be considered “minimal”.
This will exist as a wide and a long version, to cater to different user preferences. For the wide version we will need to specify the measures being collected - I will suggest the SARS-CoV-2 N1, and N2 gene regions, as well as PMMoV. Any thoughts there?
I think that no matter what we call the “ultra-minimal-bare-bones-just-the-basics” template, it is ultimately users who will use it and decide what they do or do not want to report. The exercise of us determining a minimal template is (in my opinion) to offer a kind of counsel to new users about what we as experts would recommend recording, even if it’s just bare bones. Whether they follow our advice or not is up to them, but I think that trying to set them up for success and for producing data that is re-usable and widely share-able (through sample metadata and GIS linkages) is part of the ethos of the ODM project and something we should try to adhere to.
Stepping back a bit — I think we may be drifting toward an opinionated minimal that goes further than where most users want. We need to meet them at “easy to use” with minimal interoperability for concepts, not the entire WW surveillance.
The LOINC analogy keeps coming back to me. LOINC has tens of thousands of codes, and almost no one uses it fully. In Ontario, the Ontario Lab Information System (OLIS) holds hundreds of millions — likely billions — of lab records, and despite LOINC being a comprehensive standard, day-to-day OLIS records use essentially three fields: the LOINC code, the value, and the unit. That tiny subset carries perfect interoperability - the key interoperability value of LOINC for labs.
In our new ArXiv paper (arXiv:2604.18762), I’ve been wondering about the same philosophy: a small recommended core (about 4–6 fields per table for sites, samples, measures), with the full dictionary available for users who need more. That’s different from what we’ve been calling “opinionated minimal,” which lists fields the team thinks every user should record. The OLIS/LOINC pattern is the inverse — a tiny lived minimum, with a vast model standing behind it.
A few practical implications for the template:
Fewer mandatory fields, not more. I’d lean toward dropping geoEPSG, fraction, and, probably, compartment from the strict minimum (we may want to revisit our names). Useful — but not necessary for a first dataset to be valid and shareable.
Wide-format as the starting point. Most users live in Excel; a wide-format minimal template lowers the entry barrier. The long-format relational view stays available for those who need it.
Wastewater-focused. Per my earlier note — make the minimal template domain-specific (SARS-CoV-2 N1/N2/PMMoV visible as example columns) rather than abstract. That’s how LOINC works in clinical labs too: a curated subset for the domain, not the full dictionary.
The “set users up for success” framing is right, but I’d reinterpret it: success at the entry point is getting started — recording valid, shareable data with as little structural friction as possible. The model’s depth is available for the second step, not the first.
@jeandavidt — your point about reusability for external users is well taken; I’d want to make sure the lightest minimum still produces data that downstream users can join to. But I’m worried we may be conflating “rich enough to be reusable” with “rich enough to support every imaginable downstream use case from day one.” If a user’s first dataset can be shared and joined on site, sample, measure, value, unit, and collection date, that’s already a real win. geoEPSG and fraction can be added when the use case calls for them.