Archiving Data/Breaking Data Changes

Another issue brought up in the Thursday meeting is the issue around population reporting, updating that, and how older data is conserved after updates.

Specifically, let’s say we have a polygon or site with a population (polyPop or popServ). We have been recording data for this site since 2016, and using census data for the population field for 4 years. Then in 2020 we get new census data, so we update the population fields. The lastUpdated field is also changed. However, now when we look back at averages of other measures that are calculated based on a given population - or if we go back and try to calculate them - the population value in the data is no longer correct for any measures from 2016 to 2020. So how do we go about keeping track and conserving this information?

Below is list of suggestions to possibly address this:

  1. One option is to use parent sites (similar to what was suggested in this other post) with a sort of umbrella site for the site writ-large, and child sites for given years or periods of population information. This is, again, a bit of an abuse or misuse of the parent-child structure, and may confuse users. This also feels like a bit of a bandaid solution and not a sustainable one long-term.

  2. Versioning – we could add a “version” field to the sites and polygons tables in order to keep an archive of previous entries after updates. We could add a similar field to the measures table as well to say which version the measure was referencing at the time of recording. This seems like it could get messy quite quickly though, particularly with the likely need for multiple version citations in the measure table.

  3. Semantic ID naming could be another option, where as a sort of combination of option 1 and 2, the siteID and polygonID are named with a reference year. For example site A for the 2016-2020 period is siteA2016 and in 2020 a new site is made called siteA2020. There are parsable based on the first part. This is a bit disorganized and unintuitive, however.

  4. The last option - and maybe the strongest, in my opinion - is to remove polyPop and popServ from the sites and polygons tables. Instead, these could be recorded as site measures (we could also add a polygon specimen for polygon measures), just like any other site measures. To compliment this we could also add a “valid to” and “valid from” header to the measures table (or a “valid period”).

If you have a preference or feedback on any of these numbered points, please respond below. Or if you have another idea you’d like to add to this list for consideration, please also explain in the replies.

Thank you!

@dmanuel @jeandavidt @Sorin

I think the 4th option is the best (cleanest, more consistent) one.

Originally, polyPop and popServed were put in those tables as attributes for convenience, as these quantities were considered “slow moving”. But indeed, as surveillance programs get older, we’ll need to start updating those fields. And since population is involved in calculations, it may not be the best move to overwrite it. Storing it as a measurement makes sense.

If we choose to do “validity period”, I suggest we break it up into a start date and an end date instead of lumping them together.

But it’s worth considering whether that’s actually needed – wouldn’t a new polyPop measurement at time x already imply that measures taken before x are no longer valid? I get that having it makes the model more comprehensive, but maybe a little harder to use as well? For example, what if you have a measurement that you know will eventually become invalid, but you don’t know when – what should you do then? Leave the end of the validity period blank? And then when you do find out that the value is no longer valid, do you need to go back to the original measure and update it?

Looking forward to hear other people’s opinions!

I’m glad to hear support for my preferred option! I agree with your point too, @jeandavidt, that it is handy to have there, but no ideal as surveillance systems age. I wonder too if we could keep the polyPop and popServed fields but instead of reporting a number it just references the measureRepID for the current population measure? This may be pointless/add needless confusion, but sharing for brainstorming purposes.

I agree too that breaking up the period into two fields is probably smarter, and is definitely more in line with a best-practices for ontologies approach. I had the same thought too about an “implied” end of validity date when a new measure for population is added. I think it’s true that we could make do with that kind of simpler - and save the effort of going back to update the old field. Maybe we could say that “valid start date” is mandatory and “valid end date” is optional? Opens up the potential for some ambiguity around validity, potentially, but we can’t make a model that’s totally iron-clad against abuse, I suppose… Curious to hear others’ thoughts as well!

Late to post this update, but in the last meeting of the regular working group, it was discussed and the group landed on moving the “population” measures (like polyPop and popServ) to the measures table for version 2.3 (the next version).

To support this, polyPop and popServ will be removed from the Polygons and Sites tables and changed to being measurements. There will also be a new specimen added - polygon specimen poSpecimen - to differentiate between site- and polygon-level measures.

Lastly, we discussed a validity date (or valid-to and -from date), but there were concerns around the semantics around “validity” as a term. We pivoted instead to the term “relevance”. A relevant-from date and relevant-to date will be added to the measures table to respond to this need, and will be mandatory-if type fields, as they only apply to measures that apply over a period of time. These fields are irrelevant to point measures.

These changes will be made to the new v2.3 draft of the ERD on Lucid Chart and the working draft of the parts list, and will become live in the version 2.3 release.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.