research-rainfallradar/rainfallwrangler/README.md

3.3 KiB

rainfallwrangler

Wrangles rainfall radar and water depth data into something sensible.

This Node.js-based tool is designed for wrangling rainfall, heightmap, and water depth data into something that the image semantic segmentation model that is the main feature of this repository can understand.

The reason for this is efficiency: nothing less than a set of .tfrecord files for reading in parallel is sufficient if one wants the model to train in a reasonable length of time.

System requirements

  • Linux (Windows may work but is untested. You will probably have a bad day if you use Windows)
  • Node.js v16+
  • Python 3.8+ (encoding .tfrecord files, as all existing npm packages fo doing this suck)
  • Experience with the terminal
  • Lots of time and patience

Getting started

This tool, unlike nimrod-data-downloader and terrain50-cli, is not published to npm. This is because of the rather niche use-case this tool has.

To get started, first clone this git repository:

git clone git@github.com:sbrl/research-rainfallradar.git;
cd research-rainfallradar/rainfallwrangler;

Then, install dependencies:

npm install
pip3 install --user -r requirements.txt

The entrypoint for the tool is at src/index.mjs. Call it like so:

src/index.mjs --help

It has 4 subcommands:

  • recordify: Converts a .asc heightmap, a concatenated .asc water depths file (output from HAIL-CAESAR), and a nimrod-data-downloader rainfall radar directory into an intermediate .jsonl.gz dataset. Defaults to putting 4096 samples per file.
  • uniq: Deduplicates samples across an entire .jsonl.gz dataset. Basically hashes all samples with SHA256, marks duplicate hashes for deletion, and then files through all files in the dataset to remove those slated for deletion.
  • recompress: Recompresses a .jsonl.gz dataset to ensure that (by default, 4096) samples are in each file. Needed after uniq since uniq can leave different numbers of records in each file.
  • jsonl2tfrecord: Converts the aforementioned .jsonl.gz dataset into a .tfrecord dataset that the DeepLabV3+ model can understand

All of these subcommands, where possible, operate in parallel. The general workflow is:

  1. recordify
  2. uniq
  3. recompress
  4. jsonl2tfrecord

Full help for each command is available if you call --help:

src/index.mjs --help # Show general help for everything
src/index.mjs recordify --help # Snow specific help for the recordify subcommand

Contributing

Contributions are very welcome - both issues and pull requests! Please mention in any pull requests that you release your work under the AGPL-3 (see below).

Licence

Same as that of the main repository. All the code in this repository is released under the GNU Affero General Public License 3.0 unless otherwise specified. The full license text is included in the LICENSE.md file in this repository. GNU have a great summary of the licence which I strongly recommend reading before using this software.