The HARRY_READ_ME.txt file

Part 35s

So.. back to the update process :-(

Well to take a slightly different tack, I thought I'd look at the gridding end of
things. Specifically, how to run IDL in batch mode. I think I've got it: you create
a batch file with the command(s) in, then setenv IDL_STARTUP [name of batch file].
When you type 'idl' it runs the batch file, unfortunately it doesn't quit afterwards,
though adding an 'exit' line to the batch file does the trick! Of course, there is no
easy way to check it's working properly, since the random element (used when relaxing
to the climatology) ensures that each run gives different results:

crua6[/cru/cruts/version_3_0/secondaries/cld] cmp testglo/cld.2004.11.glo testglo2/cld.2004.11.glo
testglo/cld.2004.11.glo testglo2/cld.2004.11.glo differ: char 9863, line 104

Still, the mechanism is so similar to that used to run other Fortran progs that we
can carry on, I guess. Naturally I would prefer to use the gridder I wrote, partly
because it does a much better, *documentable* job, but mainly because I don'y want
all that effort wasted!

Also looked at NetCDF production, as it's still looming. ncgen looks quite good, it
can work from a 'CDL' file (format is the same as the output from ncdump). It can
even produce fortran code to reproduce the file!!

Ah well. Back to the 'incoming data' process. The fact that the mcdw2cruauto and
climat2cruauto programs worked fine for CLD is a big bonus, they read their runs
and date files andthey wrote their results. Though the results didn't include the
names of the output databases, I've had second thoughts about that. I want the
update program to be in charge, so it should know what files have been produced
(assuming the result is 'OK'). If the conversion program sends back a list, then
the update program will have to parse it to find out which parameter is which,
and that's silly when it should know anyway!! The situation is different for
merging. I don't have a full strategy for file naming yet. Let's look at a typical
process for an unnamed (not tmn or tmx) primary parameter, ie simple case:

File(s) Process
mcdw update(s)
convert mcdw
mcdw db
current db
merge mcdw into current
current+mcdw db

climat update(s)
convert climat
climat db
merge climat into current+mcdw
current+mcdw+climat db
anomaly files
gridded anomalies
gridded actuals
reformat into .dat and .nc
final output files

So, naming. Well the governing principle of the update process is that all files
have the same 10-digit datestamp. So the run can be uniquely identified, as can
all its files (data, log, etc). I am NOT changing that! A main problem is that
we will have to depart from the rigid database naming schema ('tla.datestr.dtb')
because we will have lots of databases in a single run. In the above example,
four databases will all have the same datestamp. Here's a possible name system:

mcdw db mcdw.tla.datestr.dtb
current+mcdw db int1.tla.datestr.dtb
climat db clmt.tla.datestr.dtb
current+mcdw+climat db int2.tla.datestr.dtb

The final db would then be copied or renamed to:


For secondary parameters it's even worse! I'm not super-keen on the use of 'int1'
('interim 1') and so on.. they give no useful information. But a more complicated
schema isn't going to be uderstood by anyone else anyway! And we should have the
Database Master List to refer to at all times.. okay. All interim databases will
be labeled 'int1', 'int2', and so forth. The update program will have to keep
track of numbering. And, of course - it will have to tell the merging program
what to call the output database! Bah.

It gets WORSE. The update program has to know which 'Master' database to pass to
the merge program. For MCDW, it's going to be the 'current' database for that
parameter. But for CLIMAT and BOM, it depends on whether MCDW or CLIMAT
(respectively) merges have gone before. And only for those parameters that are
precursored! More complexity. Well, I suppose I can take one of two approaches:

1. Test at each stage for each parameter (ie for BOM, test whether CLIMAT tmx/tmn
have just been done). This could be done by testing for the filenames or by
setting flags.
2. Maintain a list in memory of 'latest' databases for each parameter. A bit less
elegant, but easier to understand and use.

Well, as we already HAVE (2), we'll go with that one ;0).

Okay. Because it is so complicated (well, for my brain anyway), I'm going to write
out the filenames that update is using and expecting, so I can check that the
conversion and merging programs tie in.

dtstr = 0902161655
par = TMP
source = MCDW
prev db = db/tmp/tmp.0809111204.dtb

runs/runs.0902161655/conv.mcdw.0902161655.dat Run information
updates/MCDW/db/db.0902161655 Dir for output dbs
results/results.0902161655/conv.mcdw.0902161655.res Expected results file
updates/MCDW/db/db.0902161655/mcdw.tmp.0902161655.dtb Expected output db
logs/logs.0902161655/conv.mcdw.0902161655.log Expected log file

db/tmp/tmp.0809111204.dtb Current/latest db
updates/MCDW/db/db.0902161655/mcdw.tmp.0902161655.dtb New db to be merged in
updates/MCDW/db/db.0902161655/int1.tmp.0902161655.dtb Interim output db
runfile.latest.dat Contains name of current run file
runs/runs.0902161655/merg.mcdw.0902161655.dat Run information (read from above)
results/results.0902161655/merg.mcdw.0902161655.res Expected results file
updates/MCDW/db/db.0902161655/int1.tmp.0902161655.dtb Expected output db
logs/logs.0902161655/merg.mcdw.0902161655.log Expected log file

These all seem to match up with the respective programs! Not sure that all
the necessary directories are being created yet, though.. they are now. Some
modifications to the above have been made (and retrospectively updated).

So, with half of the update program written, I got it all compiled, reset all
the incoming data to 'unprocessed', and.. got it working!

Of course, I immediately realised that I'd missed out the DTR conversion at the end.
And that.. didn't go any better than the rest of it, despite a quick conversion of

Well, keen-eyed viewers will remember that all the tmin/tmax/dtr/back-to-tmin-and-tmax
stuff revolves around the tmin and tmax databases being kept in absolute step. That is,
same stations, same coordinates and names, same data spans. Otherwise the job of
synching, and of converting to DTR, becomes horrendous. But look at what happens to the
line counts of the databases as they're mangled through the system:

originals ** identical metadata **
606244 tmn/tmn.0708071548.dtb
606244 tmx/tmx.0708071548.dtb

climat conversions
27090 climat.tmn.0902192248.dtb
27080 climat.tmx.0902192248.dtb

climat merged interims
607692 int2.tmn.0902192248.dtb
604993 int2.tmx.0902192248.dtb

bom conversions ** identical metadata **
5388 bom.tmn.0902192248.dtb
5388 bom.tmx.0902192248.dtb

bom merged (into climat interims) interims
607692 int3.tmn.0902192248.dtb
604993 int3.tmx.0902192248.dtb

Sometimes life is just too hard. It's after midnight - again. And I'm doing all this
over VNC in 256 colours, which hurts. Anyway, the above line counts. I don't know
which is the more worrying - the fact that adding the CLIMAT updates lost us 1251
lines from tmax but gained us 1448 for tmin, or that the BOM additions added sod all.
And yes - I've checked, the int2 and int3 databases are IDENTICAL. Aaaarrgghhhhh.

I guess.. I am going to need one of those programs I wrote to sync the tmin and tmax
databases, aren't I?

Actually, it's worse than that. The CLIMAT merges for TMN and TMX look very similar:

New master database: updates/CLIMAT/db/db.0902192248/int2.tmn.0902192248.dtb

Update database stations: 2922
> Matched with Master stations: 2227
(automatically: 2227)
(by operator: 0)
> Added as new Master stations: 566
> Rejected: 129
Rejects file: updates/CLIMAT/db/db.0902192248/climat.tmn.0902192248.dtb.rejected

New master database: updates/CLIMAT/db/db.0902192248/int2.tmx.0902192248.dtb

Update database stations: 2921
> Matched with Master stations: 2226
(automatically: 2226)
(by operator: 0)
> Added as new Master stations: 566
> Rejected: 129
Rejects file: updates/CLIMAT/db/db.0902192248/climat.tmx.0902192248.dtb.rejected

I don't see how we end up with such drastic differences in line counts!!

Well the first thing to do was to fix climat2cruauto so that it treated tmin and tmax as
inseparable. Thus the CLIMAT databases for these two should be identical (um, apart from
the data values).

OK, this is getting SILLY. Now the BOM and CLIMAT conversions are in sync, and the original
databases are in synch, yet the processing creates massive divergence!!

606244 db/tmn/tmn.0708071548.dtb
606244 db/tmx/tmx.0708071548.dtb

climat conversions
27080 updates/CLIMAT/db/db.0902201023/climat.tmn.0902201023.dtb
27080 updates/CLIMAT/db/db.0902201023/climat.tmx.0902201023.dtb

climat merged interims
607687 updates/CLIMAT/db/db.0902201023/int2.tmn.0902201023.dtb
604987 updates/CLIMAT/db/db.0902201023/int2.tmx.0902201023.dtb

bom conversions ** identical metadata **
5388 updates/BOM/db/db.0902201023/bom.tmn.0902201023.dtb
5388 updates/BOM/db/db.0902201023/bom.tmx.0902201023.dtb

bom merged (into climat interims) interims
607687 updates/BOM/db/db.0902201023/int3.tmn.0902201023.dtb
604987 updates/BOM/db/db.0902201023/int3.tmx.0902201023.dtb

So the behaviour of newmergedbauto is, for want of a better word, unpredictable. Oh, joy.
And, as indicated, the BOM updates are totally rejected:

New master database: updates/BOM/db/db.0902201023/int3.tmn.0902201023.dtb

Update database stations: 898
> Matched with Master stations: 0
(automatically: 0)
(by operator: 0)
> Added as new Master stations: 0
> Rejected: 898
Rejects file: updates/BOM/db/db.0902201023/bom.tmn.0902201023.dtb.rejected

Update database stations: 898
> Matched with Master stations: 0
(automatically: 0)
(by operator: 0)
> Added as new Master stations: 0
> Rejected: 898
Rejects file: updates/BOM/db/db.0902201023/bom.tmx.0902201023.dtb.rejected

I really thought I was cracking this project. But every time, it ends up worse than before.

OK, let's try and work out the order of events. I'm using getheads to look at metadata only.

1. CLIMAT conversions. These seem to be working fine:

crua6[/cru/cruts/..CLIMAT/db/db.0902201023] cmp climat.tmn.0902201023.hds climat.tmx.0902201023.hds

2. Original databases. They look OK:

crua6[/cru/cruts/version_3_0/update_top/db] cmp tmn/tmn.0708071548.hds tmx/tmx.0708071548.hds

3. CLIMAT merging into original databases. Bad, bad, bad.

crua6[/cru/cruts/..CLIMAT/db/db.0902201023] diff int2.tmn.0902201023.hds int2.tmx.0902201023.hds |wc -l

Something is very poorly. It's my programming skills, isn't it.

Looking at the log files for the CLIMAT merging, they give identical stats! what differ are
the dates, ie:

crua6[/cru/cruts/version_3_0/update_top/logs/logs.0902201023] diff merg.climat.tmn.0902201023.log merg.climat.tmx.0902201023.log |more
< Master file: db/tmn/tmn.0708071548.dtb
< Update file: updates/CLIMAT/db/db.0902201023/climat.tmn.0902201023.dtb
> Master file: db/tmx/tmx.0708071548.dtb
> Update file: updates/CLIMAT/db/db.0902201023/climat.tmx.0902201023.dtb
< code match with: 1033800 5247 970 55 HANNOVER DL GM 1927 2006 -999 0
> code match with: 1033800 5247 970 55 HANNOVER DL GM 1930 2006 -999 0
< code match with: 1038400 5247 1340 49 BERLIN-TEMPELHOF GERMANY 1991 2006 -999 0
> code match with: 1038400 5247 1340 49 BERLIN-TEMPELHOF GERMANY 1929 2006 -999 0

..and so on. What's got me stumped is that the headers of both pairs of input databases
are IDENTICAL. These dates are spurious! Look:

crua6[/cru/cruts/version_3_0/update_top/db] grep '55 HANNOVER' tmn/tmn.0708071548.dtb
1033800 5247 970 55 HANNOVER DL GM 1927 2006 -999 0
crua6[/cru/cruts/version_3_0/update_top/db] grep '55 HANNOVER' tmx/tmx.0708071548.dtb
1033800 5247 970 55 HANNOVER DL GM 1927 2006 -999 0
crua6[/cru/cruts/version_3_0/update_top/db] grep '49 BERLIN-TEMPELHOF' tmn/tmn.0708071548.dtb
1038400 5247 1340 49 BERLIN-TEMPELHOF GERMANY 1929 2006 -999 0
crua6[/cru/cruts/version_3_0/update_top/db] grep '49 BERLIN-TEMPELHOF' tmx/tmx.0708071548.dtb
1038400 5247 1340 49 BERLIN-TEMPELHOF GERMANY 1929 2006 -999 0

You see? The HANNOVER 1930 date, and the BERLIN-TEMPELHOF 1991 date, are wrong!! Christ.
That's not even consistent, one's supposedly in the tmin file, the other, the tmax one.

So, an apparently-random pollution of the start dates. And.. FOUND IT! As usual, the program is
doing exactly what I asked it to do. When I wrote it I simply didn't consider the possibility
of tmin and tmax needing to sync. So one of the first things it does, when reading in the
exisitng database, is to truncate station data series where whole years are missing values. And
for HANNOVER, tmax has 1927-1929 missing, but tmin has (some) data in those years. A-ha!

What to do.. I guess the logical thing to do is to not truncate for tmin and tmax! So I added a
flag to newmergedbauto, that it passes to the 'getmos' subroutine, that stops it from replacing
start and end years, and.. it worked!! Hurrah! Or, well.. it ran without giving any errors or
crashing horribly. Yes, that's it. And here are all the 142 files (and directories) it created:

crua6[/cru/cruts/version_3_0/update_top] find . -name '*0902201545*'

So, this leaves the new databases in the db/xxx/ directories, and db/latest.versions.dat telling
us which ones they are. Which should be all the next suite of programs needs to create the final
output files. Eeeeeeeek.

Go on to part 35t, back to index or Email search