There are a large number of Internet‑published genealogy software applications. All of these, however, share at least some of the following undesirable characteristics:
A well-defined data model is a prerequisite for extracting the information from any application once it becomes either unattainable or sub-optimal.
What follows is a definition of a minimalist genealogy data model. It makes no assumption about the DB software tool it might be implemented with. If the genealogy can be divided into "biological" and the "historical" one, on the biological side it only captures the knowledge about individuals, their relationships and children that are the product of such relationships. On the historical side it is restricted to the place names and start and end dates of individuals and their relationships.
The central problem of any computer system that collects and organizes information about persons is the unambiguous identification of individuals. This model assumes this will be done externally to it, by a unique integer number assigned to each individual by the user or administrator of a population (family, clan, tribe, house...) that is the subject of an operational instance of the system based on this model. If the integration of several such instances ("segments") into a larger collection is contemplated, a lot of integration time and effort can be saved by a-priory dividing that function to a number of "segment‑administrators" who will use unique prefixes to the numbers assigned in their "segment‑collections".
Individuals are subject to categorical classification by gender (i.e., they are either female or male). Each individual can have multiple children, each child has one parent of each gender. Either of the two parents of an individual may or may not be known.
Only the relationships between individuals that are capable of producing children are included in the model. Like individuals, relationships are of two types, (legal and common) and, again like individuals, they are subject to categorical classification. Relationships between individuals may overlap in time, but can only exist between two individuals of different genders. (The terms "gender" and "sex" are assumed to be interchangeable).
Like any logical and semantical data model, this one is only an approximation of reality. System designers and users that desire or need to model the realty in more detail than this model provides have an option to store information about additional biological phenomena (e.g., polygyny and polyandry, in‑vitro insemination, gestational surrogacy and adoptions, race, gender fluctuations, same‑sex unions, cryo‑preservation, parthenogenesis...) or historical facts (citizenship, ethnicity and religious affiliation and conversions, participation in wars and rebellions, migrations...) as a narrative in the text and notes provided by a system built using this model, or if the relational algebra productions over the canonical representation of such phenomena is mandatory find a system built on a data model that more closely matches their reality.
A database built using this model might be updated and queried by the user either employing a formal relational table access software (such as SQL), or, preferably, the interaction with the database may be via a text or GUI genealogy‑domain‑specific application programs.
woman { id integer pk unique mother integer null > woman.id father integer null > man.id name string null surname string null maidenSurname string null pEst integer null birthYear integer null birthYearDay integer null birthPlace string null deathYear integer null deathYearDay integer null deathPlace string null occupation string null note text null infoUrl text null } man { id integer pk unique mother integer null > woman.id father integer null > man.id name string null surname string null pEst integer null birthYear integer null birthYearDay integer null bitrhPlace string null deathYear integer null deathYearDay integer null deathPlace string null occupation string null note text null infoUrl text null } marriageLegal { wife integer null > woman.id husband integer null > man.id pEst integer null startYear integer null startYearDay integer null startPlace string null endYear integer null endYearDay integer null note text null infoUrl text null } marriageCommon { wife integer null > woman.id husband integer null > man.id pEst integer null startYear integer null startYearDay integer null endYear integer null endYearDay integer null note text null infoUrl text null }
Tables/Rows
Columns
Notes:
Name: All individuals' names/surnames/maidenSurnames, place-names and occupational designations are variable length UTF-8 character strings of the maximum length of 32 characters. If table entry selections based on the equality of place-names or occupational designations are to be performed, an implementation might need to expand the model with tables of canonical versions of place‑names and occupations. The software based on the model can treat nonconforming strings either as input errors, or silently adjust/truncate the input string to conform to the definition.
Name (and surname) table entries are the name that an individual was commonly known by throughout most of her or his life.
Period Estimate: Birth and death years as well as the start and end years of a marriage can often be only estimated. This is indicated with [.pEst] (abbreviation for "period estimated") indicator as follows:
0: Both start and end year are accurate values
1: Start year of the period is an estimate
2: End year of the period is an estimate
3: Both start and end years are estimates
Years and year day numbers: The model provides for a rather simplistic "two-level" calendar arithmetic granularity: a year and a day. The [...Year] items are either Julian or Gregorian, as defined above. Year day numbers [...YearDay] is the "1-relative" (i.e., January 1 is year-day 1) ordinal day-of-year number. The leap year is taken into account assuming the year is leap according to "proleptic" Julian calendar rules if the year is 1582 and earlier, Gregorian otherwise. In cases where the day numbers are provided, they are assumed to be in an undefined time zone, and it is further assumed that calendar arithmetic maximum error introduced by the lack of time zone information of ±(< one day) is of no consequence in an applications based on this model.
URL's of external information repositories: Implementation specific references to local or network resident collection of documents and/or images providing additional (typically historical) information about the individual or relationship. Canonical content consists of comma separated, "point brackets" (<, >) enclosed URL's of document(s), organized collection of text and/or images with additional information about the individual or relationship. URL's can be either local of network resident, and the location specification may be either relative to the location of the table or absolute. Following each URL there may be a string providing the credentials required to access the resource (web-page, document collection or images). URL sections of text must be "ascii‑restricted UTF-8" (i.e., Unicode characters with code-points up to and including U+007f), while the text outside of brackets can include any UTF-8 character. The maximum string length is 256 characters.
Interoperability with "person‑table" models: Significant majority of genealogical data models in current use store both genders in a single table, commonly named "person" or "individual" and use "instance type indicator" to determine the gender of the person represented by a table row. If the exchange of data with such systems is anticipated, the application based on this model should impose a rule that makes woman.id and man.id integer sets non‑intersecting; the canonical method to achieve this is to restrict woman.id to even and man.id to odd numbers.
The content is published in order to invite comments and collaboration on reference implementation using SqLite in Python and/or C language. Please contact the author at:
The content is published under Creative Commons ByNcNd license.