Gedcom.pm - Manipulating Genealogical Information with Perl

Paul Johnson

paul@pjcj.net

http://www.pjcj.net

YAPC::Europe::2003


Introduction

Genealogy is the study or investigation of ancestry and family histories. In the last 150 years this has become a very popular pastime. With the advent of personal computers people started wanting to store and manipulate their genealogical information digitally. Members of the Church of Jesus Christ of Latter-day Saints (commonly abbreviated to the LDS or Mormon church) do a lot of Genealogical research, and the LDS church created a standard interchange format known as GEDCOM (GEnealogical Data COMmunication). This became, and still is, the de-facto standard for transmitting genealogical information and any genealogical program worth its salt can read and write GEDCOM.


The GEDCOM format

You can get a good idea of what GEDCOM is all about by looking at a short fragment of a GEDCOM file:

    0 @I3@ INDI
    1   NAME George_V  /Windsor/
    1   TITL King of England
    1   SEX M
    1   BIRT
    2     DATE Saturday, 3rd June 1865
    2     PLAC Marlborough Hse,London,England
    1   CHR
    2     DATE Friday, 7th July 1865
    1   DEAT
    2     DATE Monday, 20th January 1936
    2     PLAC Sandringham,Norfolk,England
    1   BURI
    2     DATE Tuesday, 28th January 1936
    2     PLAC Windsor Castle,St. George Chap.,Berkshire,England
    1   FAMS @F3@
    1   FAMC @F2@
    0 @F2@ FAM
    1   HUSB @I1@
    1   WIFE @I2@
    1   CHIL @I3@
    1   MARR
    2     DATE Tuesday, 10th March 1863
    2     PLAC St. George Chap.,Windsor,,England

The important elements are individuals and families. Links are made from individuals to families in which the person is a spouse or a child. Links are made from families to individuals who are husbands, wives or children. Extra information can be provided in individual and family records using a set of three or four letter tags such as NAME, SEX and BIRT. Information within tags can be nested. This information is communicated by the value of the first number on each line. In these examples I have also indented the data appropriately, though this is not required.

The GEDCOM standard is currently at version 5.5. With the standard is distributed a grammar which describes which tags are legal and where they may appear within the GEDCOM file. As an example, here is the description of an individual record:

       INDIVIDUAL_RECORD: =
    n @<XREF:INDI>@   INDI {1:1}
        +1 RESN <RESTRICTION_NOTICE>  {0:1}
        +1 <<PERSONAL_NAME_STRUCTURE>>  {0:M}
        +1 SEX <SEX_VALUE>   {0:1}
        +1 <<INDIVIDUAL_EVENT_STRUCTURE>>  {0:M}
        +1 <<INDIVIDUAL_ATTRIBUTE_STRUCTURE>>  {0:M}
        +1 <<LDS_INDIVIDUAL_ORDINANCE>>  {0:M}
        +1 <<CHILD_TO_FAMILY_LINK>>  {0:M}
        +1 <<SPOUSE_TO_FAMILY_LINK>>  {0:M}
        +1 SUBM @<XREF:SUBM>@  {0:M}
        +1 <<ASSOCIATION_STRUCTURE>>  {0:M}
        +1 ALIA @<XREF:INDI>@  {0:M}
        +1 ANCI @<XREF:SUBM>@  {0:M}
        +1 DESI @<XREF:SUBM>@  {0:M}
        +1 <<SOURCE_CITATION>>  {0:M}
        +1 <<MULTIMEDIA_LINK>>  {0:M}
        +1 <<NOTE_STRUCTURE>>  {0:M}
        +1 RFN <PERMANENT_RECORD_FILE_NUMBER>  {0:1}
        +1 AFN <ANCESTRAL_FILE_NUMBER>  {0:1}
        +1 REFN <USER_REFERENCE_NUMBER>  {0:M}
          +2 TYPE <USER_REFERENCE_TYPE>  {0:1}
        +1 RIN <AUTOMATED_RECORD_ID>  {0:1}
        +1 <<CHANGE_DATE>>  {0:1}

Items within double brackets, such as <<PERSONAL_NAME_STRUCTURE>> are described elsewhere within the grammar as they may be used within other records.


Perl and GEDCOM

A number of years ago, (OK, about nine) I wanted to merge the GEDCOM files from some members of my family who were working on different lines. Before I even started I realised that this was not going to be simple a one-off task so I thought about automating it. Perl 5 was soon to be released and I wanted to try it out. My first child had just been born and since he has never been a good sleeper I had a few nights available to think about and program something.

You will notice that the GEDCOM format seems quite amenable to parsing with Perl, and that turns out to be the case. So I wrote a module to do just that. Four years ago I finally got around to releasing it. Let's take a look at what you can do with it.


Using Gedcom.pm

The first thing to do is to read in the Gedcom file. At its most simple, this will involve a statement such as:

    my $ged = Gedcom->new(gedcom_file => $gedcom_file);

To select an individual use a statement such as:

    my $i = $ged->get_individual("Paul Johnson");

Then, to access attributes of the individual, use a statement such as:

    my $b = $i->birth;

Now, in GEDCOM, the birth record has sub-records containing the useful information such as the date and place of birth. These can be accessed through:

    my $d = $b->date;
    my $p = $b->place;

Or, if that seems like too much work:

    my $bd = $i->get_value("birth date");
    my $bp = $i->get_value("birth place");

You can move around relationships as you might expect to:

    my @sons = $i->sons;

Of course, there's a lot more to it than that, but you can read the documentation for the details.


Example programs

    #!/usr/local/bin/perl -w
    use Gedcom;
    $ged = Gedcom->new(gedcom_file => shift);
    printf "%-40s %s\n", $_->name, $_->get_value("birth date") || "Unknown"
        for $ged->individuals;


Design of Gedcom.pm

A problem with every commercial genealogical program of which I am aware is that an import and export of a GEDCOM file is lossy. Gedcom.pm never converts to an internal format, so no information is ever lost, even if that information is an extension to the standard. I find this to be vitally important. Genealogy is all about information and evidence. I don't think it is acceptable that information should be lost by design.

Grammar

I mentioned that the GEDCOM standard includes a grammar. It also includes descriptions about the standard. The grammar specifies which records are valid at which points in a GEDCOM file. Rather than encode those rules in Gedcom.pm I decided to read and parse the grammar file. This provides the advantage that should the standard ever change I need only to use an updated grammar file. Since releasing Gedcom.pm the standard has not been updated, but some people have slightly altered the grammar file to cater for commercial programs which do not follow the standard.

The grammar file also defines which methods may be called on records. The names of the allowed methods may be either the names of the GEDCOM tags (BIRT, DEAT, PLAC etc) or their descriptions (birth, death, place etc) as defined in the standard, in any case. This is done by using an AUTOLOAD subroutine to catch all unknown method calls, create the method if it is valid and then call it. The method looks something like:

    sub AUTOLOAD
    {
      my ($self) = @_;                         # don't change @_ because of the goto
      my $func = $AUTOLOAD;
      $func =~ s/^.*:://;
      carp "Undefined subroutine $func called" unless $Gedcom::Funcs{lc $func};
      no strict "refs";
      *$func = sub
      {
        my $self = shift;
        my ($count) = @_;
        my $v;
        if (wantarray)
        {
          return map
            { $_ && do { $v = $_->full_value; defined $v && length $v ? $v : $_ } }
            $self->record([$func, $count]);
        }
        else
        {
          my $r = $self->record([$func, $count]);
          return $r && do { $v = $r->full_value; defined $v && length $v ? $v : $r }
        }
      };
      goto &$func
    }

The GEDCOM standard also allows other tags, providing they are prefixed with an underscore. Almost all commercial genealogical programs extend the standard, but many don't follow the rules and so although Gedcom.pm will optionally warn about invalid constructs, they are all accepted.

Lazy parsing

Some people have done extensive genealogical research and may have GEDCOM files with thousands of individuals and families. People are keen to publish their research and of course, Perl is good for building web applications. But Perl is neither the fastest nor the most memory efficient of languages. So one trick I added is lazy parsing.

This is an optional mode in which, the first time a GEDCOM file is parsed, an index file is created giving the offsets into the GEDCOM file at while top level records can be found. Then, when the GEDCOM file is used subsequently, these records are only read and parsed if they are needed. This is inefficient if you are going to be looking at every record anyway, but for something like a CGI script which will only be looking at an individual, or all the individuals in a family, it makes using Gedcom.pm a viable option. Or more viable, anyway. More on this a little later.

Lifelines programs

There is an old Unix genealogical program named Lifelines. A few years back development had ceased and the program looked dead. Lifelines has a report generation language and a number of people had written some nice programs to generate nice postscript output, provide statistics and do other useful or exciting things. I decided that I wanted to use those scripts with Perl, but I didn't want to manually translate them to Perl. Someone named Damian had recently released a module by the name of Parse::RecDescent so I wrote a grammar for the Lifelines language, wrote a module containing Perl translations of all the builtin functions (well, most of them anyway), and then had all those useful programs in Perl.

Since then, Lifelines has received a new lease of life and is actively being worked on. But I still like having the scripts in Perl.


Searching and Sorting

Searching for people is a fundamental operation for a genealogical program. The get_individual method performs 13 matches searching for individuals and returns a list of all the matches, in decreasing order of exactitude. If you call the method in scalar context only the first match is returned.

The matches are:

     1 - Xref
     2 - Exact
     3 - On word boundaries
     4 - Anywhere
     5 - Exact, case insensitive
     6 - On word boundaries, case insensitive
     7 - Anywhere, case insensitive
     8 - Names in any order, on word boundaries
     9 - Names in any order, anywhere
    10 - Names in any order, on word boundaries, case insensitive
    11 - Names in any order, anywhere, case insensitive
    12 - Soundex code
    13 - Soundex of name

You'll notice a pattern in matches 2 to 11 inclusive. Using regular expression values the solution comes out quite nicely. Matches 2 to 7 inclusive come out as:

  for my $n ( map { qr/^$_$/, qr/\b$_\b/, $_ } map { $_, qr/$_/i } qr/\Q$name/ )
  {
    push @i, $ordered->($n, @ind);
    return $i[0] if !$all && @i;
  }

Matches 8 to 11 inclusive use a similar, but slightly more involved technique.

Another interesting part of the searching is the use of a technique similar to the Schwartzian Transform, but with a grep instead of a sort. Rather than get the name of the individual each time it is needed, it is stored in an anonymous array with the individual object. The grep works on these anonymous arrays, then returns a list of the objects at the end. @ind in the example above is an array of the anonymous arrays, and the $ordered function does the grep and subsequent map.


Basic Dynamic CGI

I mentioned earlier that I had written a basic CGI script. It gives basic output such as:

   Victoria HANOVER
   +---------------------------------------------------------------------+
   |  Event   |    Date     |                   Place                    |
   |----------+-------------+--------------------------------------------|
   |  Birth   | 24 MAY 1819 |      Kensington,Palace,London,England      |
   |----------+-------------+--------------------------------------------|
   |  Death   | 22 JAN 1901 |    Osborne House,Isle of Wight,England     |
   |----------+-------------+--------------------------------------------|
   |  Burial  |      -      | Royal Mausoleum,Frogmore,Berkshire,England |
   |----------+-------------+--------------------------------------------|
   | Marriage | 10 FEB 1840 |   Chapel Royal,St. James Palace,England    |
   +---------------------------------------------------------------------+
   +----------------------------------------------------------------+
   | Relation |          Name           |    Birth    |    Death    |
   |----------+-------------------------+-------------+-------------|
   | Husband  | Albert Augustus Charles | 26 AUG 1819 | 14 DEC 1861 |
   |----------+-------------------------+-------------+-------------|
   |  Father  | Edward Augustus HANOVER | 2 NOV 1767  | 23 JAN 1820 |
   |----------+-------------------------+-------------+-------------|
   |  Mother  |  Victoria Mary Louisa   | 17 AUG 1786 | 16 MAR 1861 |
   |----------+-------------------------+-------------+-------------|
   |  Child   | Victoria Adelaide Mary  | 21 NOV 1840 | 5 AUG 1901  |
   |----------+-------------------------+-------------+-------------|
   |  Child   |    Edward_VII WETTIN    | 9 NOV 1841  | 6 MAY 1910  |
   |----------+-------------------------+-------------+-------------|
   |  Child   |     Alice Maud Mary     | 25 APR 1843 | 14 DEC 1878 |
   |----------+-------------------------+-------------+-------------|
   |  Child   |  Alfred Ernest Albert   | 6 AUG 1844  | 30 JUL 1900 |
   |----------+-------------------------+-------------+-------------|
   |  Child   | Helena Augusta Victoria | 25 MAY 1846 | 9 JUN 1923  |
   |----------+-------------------------+-------------+-------------|
   |  Child   | Louise Caroline Alberta | 18 MAR 1848 | 3 DEC 1939  |
   |----------+-------------------------+-------------+-------------|
   |  Child   | Arthur William Patrick  | 1 MAY 1850  | 16 JAN 1942 |
   |----------+-------------------------+-------------+-------------|
   |  Child   |  Leopold George Duncan  | 7 APR 1853  | 28 MAR 1884 |
   |----------+-------------------------+-------------+-------------|
   |  Child   | Beatrice Mary Victoria  | 14 APR 1857 | 26 OCT 1944 |
   +----------------------------------------------------------------+

The names are all links. Of course, it's ugly, but it's just a demonstration. If anyone wants to do something nice, using Template Toolkit for example, I'll happily add it to the distribution. This output comes very simply with Gedcom.pm and CGI.pm.

    sub indi
    {
      my $gedcom = param("gedcom");
      my $indi   = param("indi");
      my $ged    = gedcom($gedcom);
      my $i      = $ged->get_individual($indi);
      my $name   = $i->cased_name;
      my $sex    = uc $i->sex;
      my $spouse = $sex eq "M" ? "wife" : $sex eq "F" ? "husband" : "spouse";
      print header,
            start_html(-title => $name),
            h1($name),
            table
            (
              { -border => undef },
              Tr
              (
                { align => "CENTER", valign => "TOP" },
                [
                  th([ "Event", "Date", "Place"]),
                  event_row("Birth",       $i->birth),
                  event_row("Christening", $i->christening),
                  event_row("Baptism",     $i->baptism),
                  event_row("Death",       $i->death),
                  event_row("Burial",      $i->burial),
                  event_row("Marriage",    $i->get_record(qw(fams marriage))),
                ]
              )
            ),
            p,
            table
            (
              { -border => undef },
              Tr
              (
                { align => "CENTER", valign => "TOP" },
                [
                  th([ "Relation", "Name", "Birth", "Death"]),
                  indi_row($gedcom, ucfirst $spouse ,$i->$spouse()),
                  indi_row($gedcom, "Father", $i->father),
                  indi_row($gedcom, "Mother", $i->mother),
                  indi_row($gedcom, "Child",  $i->children),
                ]
              )
            ),
            p(a({-href => "/cgi-bin/gedcom.cgi?op=main&gedcom=$gedcom"}, $gedcom)),
            end_html;
    }
    sub event_row
    {
      my ($n, @e) = @_;
      map { td
            ([
              $n,
              $_->get_value("date")  || "-",
              $_->get_value("place") || "-",
            ])
          } @e
    }
    sub indi_row
    {
      my ($g, $n, @i) = @_;
      map { td
            ([
              $n,
              a({-href => "/cgi-bin/gedcom.cgi?op=indi&gedcom=$g&indi=" . $_->xref},
                $_->cased_name),
              $_->get_value("birth date") || "-",
              $_->get_value("death date") || "-",
            ])
          } @i
    }

You can see this in action at http://pjcj.sytes.net/cgi-bin/gedcom.cgi?op=indi&gedcom=royal92&indi=I1 (provided my server is up).


The Future

GEDCOM Replacements

Gedcom.pm was designed to grow as the GEDCOM standard grew. However, since the last release, version 5.5, something has happened. Now, if your data format is not XML compliant no one will take you seriously.

Different people have attacked this problem in different ways. The simplest solution is to simply translate the current GEDCOM format into XML. This isn't too hard to do, and a few lines added to Gedcom.pm provide an output to XML method.

But many people have noticed fundamental problems in the GEDCOM standard, and want to use the opportunity to extend or replace the GEDCOM semantics. This has lead to many discussions on many mailing lists about subjects such as how to accurately represent dates, especially those originating in foreign cultures (where foreign is relative), how to accurately represent places, especially when the name of the place has changed since the event, or even moved into a different country, and even about fundamental concepts such as what someone's surname is, for that concept is meaningless in some cultures.

What many people have failed to grasp is that the ontology, or the structuring of the genealogical data is entirely separate from its representation or the way it is stored. This is something that the GEDCOM standard actually makes quite clear. It tells us that GEDCOM is both a data representation language that may be used to represent any form of structured information, and also a lineage-linked grammar which describes individuals linked in family relationships across multiple generations.

To my mind, the data representation is pretty unimportant. I care little whether the format is GEDCOM, XML, a Perl data structure or anything else. Converting between them is basically trivial. It is the ontology which is important; the representation of the objects and concepts in the genealogical data, and the relationships between them.

A number of suggestions have been put forward for new ontologies. Some are well conceived, and some aren't. The GENTECH Genealogical Data Model falls into the former category, but I don't know of any software actually using it. GEDCOM version 6 is also out for review purposes, specifying an XML output format. If anything gets picked up on, it is likely to be this, but there still seems to be a lot of life in the 5.5 standard.

Merging Trees and Identifying Duplicates

This was the reason I first started with Gedcom.pm. And it's still top of the TODO list. In my mind anyway, if not literally - it's the oldest item there, anyway. I've actually started on it recently, but the problem is not trivial. I think the hardest problem is devising a useful API, but I think that about a lot of problems. The point is that once you identify some individuals who might be the same person then their parents, siblings and children are candidates for merging. My current thinking is that the API will include some measure of the confidence that two individuals represent the same person.

GUIs

I edit my GEDCOM files in vim, but most people don't. Most people who do genealogical research want to use a GUI. I wrote a GUI in Tk ages ago. It's still in the distribution, but it won't compile now. It also tried to do some things that really should have been in the main module. Well, now they are, so if anyone fancies writing a nice GUI, I'll include that too.


Conclusion

Perl is well suited to parsing and manipulating GEDCOM files. When I first wrote Gedcom.pm I had a 1990 DECStation 5000/200 and Gedcom.pm was slow. On modern hardware and with the optimisations available it becomes a useful tool for manipulating genealogical data.