Opinions
In software, calling a tool opinionated means it has decided how things should be done and made those choices the path of least resistance. The opposite — an unopinionated tool — gives you a blank canvas, every decision up to you, and the freedom to spend your first week debating folder names.
Opinionated tools trade flexibility for momentum. You give up some "any way you want" in exchange for "the way that usually works." If the defaults fit your work, you move fast. If they don't, you can usually override them — but the defaults are deliberate.
Cookiecutter-Spatial-Data-Science is opinionated. Its structure reflects choices about how to do collaborative spatial data science work. Those opinions started as the Cookiecutter-Data-Science opinions from DrivenData Labs, then bent and grew to fit the realities of working with ArcGIS Pro, file geodatabases, and Esri's Python ecosystem I have learned in 20 plus years of working at Esri in various roles.
Take what helps. Skip what doesn't. Just do it consistently.
Spatial analysis is a directed acyclic graph
The most important properties of a quality spatial analysis are correctness and reproducibility — anyone should be able to re-run your work using only your code and your raw data and land on the same results - maps, tables, and metrics.
The best way to ensure correctness is to test your code. The best way to ensure reproducibility is to treat your analysis as a directed acyclic graph (DAG): each step is a node, arrows go one way, no loops. Run the graph forwards to recreate any output, or trace backwards from an output to see exactly what code and data produced it.
This is doubly important in spatial work, where it is far too easy to do the right analysis in the wrong projection, or to reproject something twice without realising it.
Raw data is immutable
Treating your analysis as a DAG means raw data is read-only. Read it, copy it,
transform it into new outputs — but never edit it in place. The default data/ layout
makes this easy:
| Folder | Purpose |
|---|---|
data/raw/ |
Your original, immutable inputs (often raw.gdb) |
data/external/ |
Reference data from third parties — also immutable |
data/interim/ |
Cached intermediate outputs (interim.gdb) |
data/processed/ |
Final analytical products (processed.gdb) |
Some dos and do nots that follow:
- ✅ Move the raw data through a documented pipeline, ideally invoked from
scripts/make_data.py. - ✅ Cache long-running intermediate steps to
interim/so you don't have to recompute the universe to debug the last 5%. - ✅ Make it possible (and ideally automated in
scripts/make_data.py) for anyone to reproduce your final products fromdata/raw/plus the code insrc/<package>/. - ⛔ Don't edit raw data — especially not in ArcGIS Pro's edit session, especially not manually, and especially not in Excel.
- ⛔ Don't overwrite a feature class in
raw.gdbwith a "fixed" version. Fix it in the pipeline. - ⛔ Don't keep multiple versions of the same raw dataset side-by-side hoping you'll remember which is which. You won't.
ArcGIS-specific traps
A file geodatabase is opaque from outside ArcGIS, which makes silent in-place edits
tempting. Resist. If a feature class needs cleanup, do it as a documented step in
the pipeline that reads from raw.gdb and writes to interim.gdb — not as a
one-off edit you'll never remember making.
Data should (mostly) stay out of source control
Because raw data is immutable, it doesn't need source control the way code does. The
generated .gitignore excludes the entire data/ directory by default.
This matters more in spatial work than in plain-tabular data science, because file geodatabases, lidar tiles, mosaic datasets, and large GeoTIFFs blow past Git's practical limits very quickly. GitHub warns at 50 MB and rejects files over 100 MB, and a single statewide parcel layer can pass that without trying.
Options for sharing larger spatial data:
- ArcGIS Online / Enterprise: publish the data as a hosted feature layer and pull
it in your scripts via
arcgis.GISand a profile fromsecrets.yml. - Cloud object storage: Amazon S3, Azure Blob Storage, Google Cloud Storage —
all support GeoParquet, Cloud-Optimized GeoTIFF, and FlatGeobuf for cloud-native
spatial workflows.
cloudpathlibandfsspecmake these feel like local paths. - Git LFS for moderately large files you genuinely need versioned alongside the code. (This is a bit of an anti-pattern for raw data, but it can be useful for reference datasets that are small enough to be versioned but large enough to break Git's limits.)
- Small reference datasets that rarely change (e.g. a study-area boundary in GeoJSON) are fine to commit directly.
Tools for orchestrating DAGs
For most projects, make (with the bundled Makefile / make.cmd) is enough — it's
simple, declarative, and already wired up for make env, make data, make docs,
and friends.
If your pipeline grows beyond what make handles gracefully, reach for a Python
DAG orchestrator: Snakemake,
Prefect,
Dagster (a department favorite),
Airflow, or
Luigi. Use what fits — the template doesn't ship one
because most spatial projects don't need it, and the wrong choice early adds drag for
years.
Notebooks are for exploration; source files are for repetition
Jupyter notebooks are unbeatable for exploratory spatial analysis — quick reprojections,
inline maps, iterating on a buffer distance until something looks right. They are also
poor vehicles for reproducible analysis, because diffs of .ipynb JSON are
near-unreadable, merging is brutal, and they encourage copy-paste.
The rule of thumb: once a notebook cell does something you'll do again, move it into
src/<package>/.
Classic signs you've crossed that threshold:
- You're duplicating cells from an old notebook to start a new one.
- You're copy-pasting the same function between notebooks.
- You're defining classes inside a notebook.
Generated projects install the package in editable mode, so you can refactor a function
to src/<package>/analysis.py and immediately use it from a notebook:
A loose convention that helps when notebooks accumulate: use a numeric prefix, your
initials, and a short description — 0.3-jck-explore-buffer-distances.ipynb. It costs
nothing and makes intent obvious six months later.
For notebooks that become part of the documentation (worked examples, methodology
write-ups), move them to docsrc/mkdocs/notebooks/ so MkDocs picks them up.
Keep modeling organised
Modeling pipelines vary too much to over-prescribe, so the models/ directory is
deliberately light. What it does include is models/emd/, because in the ArcGIS
world deep-learning artifacts are bundled as
Esri Model Definition (.emd) files
that pair the model weights with the metadata ArcGIS Pro needs to apply them.
Whatever you train, document enough about each experiment to reproduce it later — at minimum:
- the data version that went in (path, date, or hash),
- the code version that produced it (Git commit),
- the parameters and hyperparameters,
- the metrics that came out.
For small projects, a JSON or YAML file per run is plenty. For anything bigger, or for team work, graduate to MLflow or whatever tooling your team has standardised on.
Build from the environment up
Reproducing an analysis starts with reproducing the environment that ran it — same tools, same libraries, same versions. In Python this means picking and configuring an environment manager.
This template uses Conda, for two non-negotiable reasons in the spatial world:
- ArcPy is shipped through Conda and cannot be installed via pip.
- Spatial dependencies (GDAL, GEOS, PROJ) are notoriously painful to compile from source. Conda packages them for you.
The recommended Conda installer is Miniforge when you're not using ArcGIS Pro, and the bundled ArcGIS Pro Python distribution when you are.
make env builds a project-local environment in ./env, isolated from your other
projects. If you need stronger reproducibility (pinning the full transitive dependency
tree), add conda-lock on top.
Keep secrets and configuration out of version control
You really don't want your ArcGIS Online token, your Postgres password, or your client's API key on GitHub. Treat these as you would any other piece of leaked credential — assume someone will find them.
The template's pattern is two YAML files in config/:
| File | Purpose | Committed? |
|---|---|---|
config.yml |
Non-sensitive settings (paths, log levels, environment overrides) | Yes |
secrets_template.yml |
Template showing the shape of secrets, with placeholder values | Yes |
secrets.yml |
Your real credentials — copy from the template and fill in | No (gitignored) |
In code, both are loaded via singletons that support dot- and dict-style access:
from <package>.config import config
gis_url = config.esri.gis_url # secrets merged into config — no separate namespace
log_level = config.logging.level
Active environment is controlled by the PROJECT_ENV environment variable (dev,
test, prod) so you can switch without editing code. See the
Configuration
docs in the generated project for the full contract.
For ArcGIS Online / Enterprise specifically, prefer
arcgis.gis.GIS(profile=...)
with a profile name stored in secrets.yml over embedding usernames and passwords
anywhere in code.
Encourage adaptation from a consistent default
The template is meant to be a starting point, not a straitjacket. The right call is to be liberal in adapting the structure for your project, but conservative in changing the default for everyone.
Common adaptations:
- Simplify — small projects don't need every folder. If
models/andreferences/stay empty for the life of a project, delete them. - Expand — long-running projects often grow sub-packages under
src/<package>/for related groups of code (e.g.analysis/,viz/,io/), or sub-folders undernotebooks/separating exploratory work from polished reports. - Re-organise — when
notebooks/gets crowded on a long project, a top-levelresearch/folder with sub-folders per experiment (each with its own notebooks and optionally its own Makefile) keeps things navigable.
If you find yourself making the same change on every project, that's a signal worth sharing — open an issue on the template repo.
In Short
The structure exists so you can stop thinking about structure and start thinking about the analysis. Treat the data as immutable, treat the analysis as a graph, and keep the parts that change (notebooks, exploration) separate from the parts that need to be trustworthy (source code, tests, documentation). Adapt freely. Just be consistent.