Creating a dataset for statistical analysis

Individual data tables are of limited value for analysis, while conventional statistical software cannot trace individual histories through various tables. The following sequences of SQL commands construct a new table with one record per man-year, with variables giving the man's age and the town he was in, the number of days during the year for which he was a member, and the number of days he was on each of the different benefits. A wide range of conventional analyses can be performed on this new table.


(1) Create a temporary look-up table of members' ages

These SQL commands consolidate information about members' ages into a single look-up table; four different tables containing ages are queried, in descending order of reliability. The table created also records the source of each age:

create table temp_age (
member integer,
b_year integer,
source vchar(1));

insert into temp_age (member, b_year, source)
select member, b_year=year-age, source='D'
from d_cert
where age is not null and member is not null
and date is not null;

insert into temp_age (member, b_year, source)
select member, b_year=year-age, source='F'
from funerals
where age is not null and member is not null
and date is not null
and member not in (select member from temp_age);

insert into temp_age (member, b_year, source)
select member, b_year=year-age, source='R'
from reg_book
where age is not null and member is not null
and date is not null
and age < 100
and member not in (select member from temp_age);

insert into temp_age (member, b_year, source)
select member, b_year=year-age_y, source='A'
from ase_reg
where age_y is not null and member is not null
and date is not null
and member not in (select member from temp_age);


(2) Create the basic dataset for analysis.

These commands create a new table containing one row for each SEM member of known age who was continuously a member during a particular year or died during it, with many values initially set to zero or left empty, and then classify the data into age groups:

create table ts_data as
select fin_year=m.report,
town=b.town,
member=m.member,
record_id=m.record_id,
event=m.code,
age=r.start_yr-a.b_year,
age_grp='Unset',
age_source=a.source,
year_len = int4(interval('days', r.end_date - r.start_date)),
u_days=0,
s_days=0,
p_days=0
from members m, branches b, temp_age a, reports r
where m.lsb_id is not null and
m.nsb_id is not null and
m.branch=b.branch and
m.member=a.member and
a.b_year is not null and
r.report=m.report;

/* Adds men who died to the table: */

insert into ts_data (fin_year, town, member, record_id, event, age,
age_grp, age_source, year_len, u_days, s_days, p_days)
select fin_year=m.report,
town=b.town,
member=m.member,
record_id=m.record_id,
event=m.code,
age=r.start_yr-a.b_year,
age_grp='Unset',
age_source=a.source,
year_len = int4(interval('days', f.date - r.start_date)),
u_days=0,
s_days=0,
p_days=0
from members m, funerals f, branches b, temp_age2 a, reports r
where m.lsb_id is not null and
m.nsb_id is null and
m.branch=b.branch and
m.record_id=f.mem_rec and
f.f_status='M' and
f.date is not null and
m.member=a.member and
a.b_year is not null and
r.report=m.report;

/* Assign each record to the relevant age group: */
update ts_data set age_grp='<30' where age < 30;
update ts_data set age_grp='30-39' where age >= 30 and age < 40;
update ts_data set age_grp='40-49' where age >= 40 and age < 50;
update ts_data set age_grp='50-59' where age >= 50 and age < 60;
update ts_data set age_grp='60+' where age >= 60;


(3) Add a count of the number of days unemployed

These commands first create a temporary summary table from the 'payments' table, and then uses this to insert the numbers of days a man was unemployed into the analytical data set. Almost identical sets of commands similarly insert the number of days sick and on superannuation:

create table temp_u_summary as
select mem_rec, u_days=(sum(weeks)*6)+sum(days)
from payments
where type='U'
group by mem_rec;

create unique index usum_idx on temp_u_summary (mem_rec)
with structure = btree,
compression = (nokey),
key = (mem_rec),
nonleaffill = 80,
leaffill = 70,
fillfactor = 80,
location = (ii_database)

update ts_data d from temp_u_summary u
set u_days=u.u_days
where d.record_id=u.mem_rec and
u.u_days is not null;


(4) Specimen Output

The output from the above sequence of commands is very extensive; this is just a small sample of records. The individual men covered are identified by the value in the 'Member' column, so this includes five men from two towns, while the first column identifies the year involved. The 'Event' column shows information taken from the footnotes to the membership list; among these examples, one of the Blackburn members was listed as having gone abroad, but he must have been back by the following year in order to have been included. 'Age' and 'Age Grp' are self-explanatory, while 'Age Src.' indicates the basis for our knowledge of the member's age. 'Year len.' is the number of days 'at risk' during the financial year, usually simply the number of days recorded in the 'reports' table, while the final three columns give the number of days during the year for which the member received each of the three types of benefit:

Fin. Record Age Age Age Year No. of days on:
Year Town Member ID Event Group Src. Len. Unem Sick Super

1873 BLACKBURN 71869036 71873026 . 35 30-39 F 364 0 0 0
1874 BLACKBURN 71869036 71874024 A 36 30-39 F 214 0 0 0
1875 BLACKBURN 71869036 71875023 . 36 30-39 F 365 0 0 0
1870 BLACKBURN 71869112 71870092 . 41 40-49 F 364 0 0 0
1871 BLACKBURN 71869112 71871083 . 42 40-49 F 365 36 0 0
1872 BLACKBURN 71869112 71872081 . 43 40-49 F 364 0 0 0
1872 BLACKBURN 71871100 71872094 . 35 30-39 F 364 0 24 0
1836 BIRMINGHAM 81835001 81836001 . 50 50-59 D 365 0 0 0
1837 BIRMINGHAM 81835001 81837001 . 51 50-59 D 365 0 0 0
1838 BIRMINGHAM 81835001 81838001 . 52 50-59 D 365 0 6 0
1836 BIRMINGHAM 81835020 81836019 . 32 30-39 F 365 0 6 0
1837 BIRMINGHAM 81835020 81837015 . 33 30-39 F 365 0 0 0
1838 BIRMINGHAM 81835020 81838014 . 34 30-39 F 365 0 0 0
1839 BIRMINGHAM 81835020 81839006 . 35 30-39 F 366 0 0 0
1840 BIRMINGHAM 81835020 81840005 . 36 30-39 F 365 0 0 0
1840 BIRMINGHAM 81835020 81840005 . 30 30-39 F 365 0 0 0


(c) Humphrey Southall, 1997


Back to SEM Database Home Page