Analyzing the Belgian Crossroadsbank for Enterprises open data with dbt and DuckDB part 1 — loading the data

Posted on Sat 12 October 2024 in Data engineering

While looking for interesting datasets to play around with DuckDB, I came across the ‘open’ data of the Belgian Crossroadsbank for Enterprises (abbrivated in Dutch to KBO). The dataset is free to download, but you need to register beforehand.

You can download either the full dataset, or the incremental dataset every month. The full dataset is a 240MB zip file containing 10 csv files.

My goal is to use dbt to load this data in DuckDB (dbt is the E in L, right?), do some interesting analysis on top of it, and maybe expose the data as a Streamlit application.

All code in this article can be found on Github. The data can be downloaded from the offical source.

Exploring the data

Let’s quickly explore the downloaded zip file. Using unzip , we can list the files inside:

> unzip -l KboOpenData_0128_2024_10_Full.zip  
Archive:  KboOpenData_0128_2024_10_Full.zip  
  Length      Date    Time    Name  
---------  ---------- -----   ----  
 88155847  2024-10-06 08:51   enterprise.csv  
 68820875  2024-10-06 08:51   establishment.csv  
150209614  2024-10-06 08:51   denomination.csv  
292002916  2024-10-06 08:52   address.csv  
 31857265  2024-10-06 08:52   contact.csv  
912198724  2024-10-06 08:52   activity.csv  
   308904  2024-10-06 08:52   branch.csv  
      149  2024-10-06 08:52   meta.csv  
  1356938  2024-10-06 08:52   code.csv  
---------                     -------  
1544911232                     9 files

The definition of these csv’s is quite well defined in the cookbook supplied by the FPS Economy (FOD Economie), although in a pdf and not in a structured format.

We can also check the number of records in each file:

 unzip -l KboOpenData_0128_2024_10_Full.zip   ls *.csv | xargs -I X wc -l X | sed  's|\\s|\\t\\t|g'  
20488460        activity.csv  
2774975         address.csv  
7355            branch.csv  
14949           code.csv  
652301          contact.csv  
3236485         denomination.csv  
1904225         enterprise.csv  
1638592         establishment.csv  
6               meta.csv

Getting started

Setting up the dbt project

First, we set up our working directory. We create a python virtual environment in it, setup our requirements.txt file, and install our python dependencies:

mkdir be_kbo_analysis  
cd be_kbo_analysis  
python3 -m venv .venv  
echo ".venv" >> .gitignore  
echo "dbt-core" >> requirements.txt  
echo "dbt-duckdb" >> requirements.txt  
# Activate the venv  
source .venv/bin/activate  
pip install -r requirements.txt

Use dbt init kbo to create a dbt project called kbo and select duckdb as database. The default values in the newly created profiles.yaml are fine (DuckDB will write to a databasefile named dev.duckdb).

Let’s start by editing our dbt_project.yml file.

cd kbo  
mkdir models/001_staging  
code dbt_project.yml

We add a variable kbo_files_root_path, pointing to our unzipped folder, in my case the folder be_kbo_analysis in my home directory. I’m also going to make a 001_staging folder inside models that will contain the definitions of the models, directly imported from csv. To make things more performant, I will materialize the models inside that folder as table.

I prefer numbering my folders to keep some structure, but they don’t affect the dbt project at all. For big dbt projects, try grouping models based on common code owners, models that run at the same time or models that belong in the same target schema. The organisation of big dbt projects deserves a separate post.

dbt_project.yml now looks like this:

name: 'kbo'  
version: '1.0.0'  

profile: 'kbo'  

model-paths: ["models"]
analysis-paths: ["analyses"]  
test-paths: ["tests"]  
seed-paths: ["seeds"]  
macro-paths: ["macros"]  
snapshot-paths: ["snapshots"]  

clean-targets:   
  - "target"  
  - "dbt_packages"  

vars:  
  kbo_files_root_path: "{{ env_var('HOME') }}/be_kbo_analysis"  

models:  
  kbo:  
    staging:  
      +materialized: table

Loading the csv files in DuckDB

As dbt ‘only’ sends rendered sql to the database engine, and duckdb is very capable in reading from csv files, we should be able to just create a model per csv file, and select * from it.

Let’s first scaffold the sql files:

ls ../*.csv | xargs -I X basename X .csv | xargs -I X touch models/001_staging/X.sql

For every csv file, this creates an empty file which we will fill up with:

# models/001_staging/activity.sql  
select * from read_csv('{{ var("kbo_files_root_path") }}/activity.csv')  
# models/001_staging/enterprise.sql  
select * from read_csv('{{ var("kbo_files_root_path") }}/enterprise.csv')  
# etc etc

Now let’s try running dbt! Make sure you are in the dbt project folder (in our case, kbo ), and execute dbt run .

 dbt run  
13:36:46  Running with dbt=1.8.7  
13:36:47  Registered adapter: duckdb=1.9.0  
13:36:48  Found 9 models, 416 macros  
13:36:48    
13:36:48  Concurrency: 1 threads (target='dev')  
13:36:48    
13:36:48  1 of 9 START sql table model main.activity ..................................... [RUN]  
13:38:03  1 of 9 OK created sql table model main.activity ................................ [OK in 74.73s]  
13:38:03  2 of 9 START sql table model main.address ...................................... [RUN]  
13:38:31  2 of 9 OK created sql table model main.address ................................. [OK in 28.38s]  
13:38:32  3 of 9 START sql table model main.branch ....................................... [RUN]  
13:38:32  3 of 9 OK created sql table model main.branch .................................. [OK in 0.65s]  
13:38:32  4 of 9 START sql table model main.code ......................................... [RUN]  
13:38:33  4 of 9 OK created sql table model main.code .................................... [OK in 0.74s]  
13:38:33  5 of 9 START sql table model main.contact ...................................... [RUN]  
13:38:36  5 of 9 OK created sql table model main.contact ................................. [OK in 3.55s]  
13:38:37  6 of 9 START sql table model main.denomination ................................. [RUN]  
13:38:50  6 of 9 OK created sql table model main.denomination ............................ [OK in 13.88s]  
13:38:50  7 of 9 START sql table model main.enterprise ................................... [RUN]  
13:38:58  7 of 9 OK created sql table model main.enterprise .............................. [OK in 7.56s]  
13:38:58  8 of 9 START sql table model main.establishment ................................ [RUN]  
13:39:01  8 of 9 OK created sql table model main.establishment ........................... [OK in 2.79s]  
13:39:01  9 of 9 START sql table model main.meta ......................................... [RUN]  
13:39:02  9 of 9 OK created sql table model main.meta .................................... [OK in 0.70s]  
13:39:02    
13:39:02  Finished running 9 table models in 0 hours 2 minutes and 13.94 seconds (133.94s).  
13:39:02    
13:39:02  Completed successfully  
13:39:02    
13:39:02  Done. PASS=9 WARN=0 ERROR=0 SKIP=0 TOTAL=9

Hooray! We’ve just created 9 tables inside the dev.duckdb database. As you might notice, my PC where I’m running this on, is very slow. However, once these tables are loaded, I’m hoping to use the power of DuckDB to make my analysis fast!

A fun tidbit is to check the file size of the DuckDB database, which is bigger than the original zip file (240MB vs 370MB):

 ls -lah dev.duckdb  
-rw-r--r-- 1 wout wout 371M Oct 12 15:39 dev.duckdb

Documenting our staging tables

Every of our 9 models, should be nicely documented and tested. Because the data now exists in our DuckDB database, we quickly scaffold the yamls for our models using dbt-osmosis .

First, add a config to our dbt_project.yml :

models:  
  kbo:  
    +dbt-osmosis: "_{model}.yml"

Second, install the tool and let it do it’s magic:

pip install dbt-osmosis  
dbt-osmosis yaml refactor

This generates 1 yml file per model, which we can use to describe the model. We can, for example, add descriptions (coming from the cookbook).

We can also add some tests!

According to the cookbook, the NaceVersion is always either 2003 or 2008. The same source tells us that the field NaceCode should be found in the code table. We can add these validations as tests.

An example in models/001_staging/_activity.yml:

version: 2  
models:  
  - name: activity  
    description: "Bevat 1 lijn per activiteit van een entiteit of vestigingseenheid. Een entiteit of vestigingseenheid kan meerdere activiteiten uitoefenen"  
    columns:  
      - name: EntityNumber  
        description: 'Het vestigingseenheids- of ondernemingsnummer.'  
        data_type: VARCHAR  
      - name: ActivityGroup  
        description: 'Soort activiteit. Zie codetabel'  
        data_type: VARCHAR  
      - name: NaceVersion  
        description: 'Geeft aan of de activiteit is gecodeerd in Nace versie 2003 of Nace versie 2008.'  
        data_type: BIGINT  
        tests:  
          - accepted_values:  
              values: ['2003', '2008']  
      - name: NaceCode  
        description: 'De activiteit van de entiteit of vestigingseenheid, gecodeerd in een Nace code (in de aangegeven versie). Zie codetabel (Nace2003, Nace2008).'  
        data_type: VARCHAR  
        tests:  
          - relationships:  
              to: ref('code')  
              field: Code  
      - name: Classification  
        description: 'Geeft aan of dit een hoofd-, neven- of hulpactiviteit is. Zie codetabel.'  
        data_type: VARCHARCreating dbt docs

To run these 2 tests, we can execute dbt test:

 dbt test  
14:18:36  Running with dbt=1.8.7  
14:18:37  Registered adapter: duckdb=1.9.0  
14:18:37  Found 9 models, 2 data tests, 416 macros  
14:18:37    
14:18:38  Concurrency: 1 threads (target='dev')  
14:18:38    
14:18:38  1 of 2 START test accepted_values_activity_NaceVersion__2003__2008 ............. [RUN]  
14:18:38  1 of 2 PASS accepted_values_activity_NaceVersion__2003__2008 ................... [PASS in 0.36s]  
14:18:38  2 of 2 START test relationships_activity_NaceCode__Code__ref_code_ ............. [RUN]  
14:18:39  2 of 2 PASS relationships_activity_NaceCode__Code__ref_code_ ................... [PASS in 1.22s]  
14:18:40    
14:18:40  Finished running 2 data tests in 0 hours 0 minutes and 2.17 seconds (2.17s).  
14:18:40    
14:18:40  Completed successfully  
14:18:40    
14:18:40  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Looks great! Always run your tests before building on your data, so you know that your incoming data is clean and ready to use.

Now that we have the beginning of a dbt project, we can take a look at the dbt documentation.

dbt docs generate  
dbt docs serve  

Which opens a local webserver at localhost:8080 A screenshot of the generated dbt docs site, including table and column descriptions and tests.

A screenshot of the generated dbt docs site, including table and column descriptions and tests.

A screenshot of the generated dbt docs site, including table and column descriptions and tests.

Voila, now the data is ready for analysis, either further via dbt, or by connecting your favorite BI tool to the DuckDB database.