Duplication in Mapping

@martinwellman has been working on developing the mapping tool to move PHA4GE data into ODM format. The current alpha version is working well, and I’ve reviewed the outputs, and they are quite clean.

Where we encounter an issue is that for certain primary keys, if there is a many-to-many relationship, Martin has built in functionality to clone the key but alter it slightly. This way we don’t have collisions or data loss, and we also don’t repeat keys. For example, if PHAC exists as an organization (organizationID), and in PHA4GE that organization has two projects (project 1 and project 2), because of the structure of the ODM, we end up with two rows in the organizations table, and two organizationIDs → PHAC and PHAC001. This way we mitigate data loss, but don’t create collisions by repeating a primary key.

This is, however, less than ideal because we then have two IDs (PHAC and PHAC001) that are referring to the exact same entity. This is less the fault of a smart feature and functionality in the mapping tool, and more an issue of including unnecessary cycling (or almost-cycling) in the ODM database structure. ie. the organizations table and the instruments table two not need a datasetID foreign key linkage, and instruments also does not need a contactID linkage.

As such, dataseID is being removed from both table, and contactID is being removed from the instruments table in version 3. Hopefully this resolves much of this duplication issue with the mapping tool.