Ingestion Workflow

This document explains how the ingestion service turns Portuguese legal sources into normalized data for review in the operations console.

For reconstructed article text histories, see `ARTICLE_TIMELINES.md`.

The important rule is that source data and parser output are preserved before database writes. Raw DRE dump tables stay read-only. The ingestion service writes only to the ingestion schema.

System Map

Rendering diagram…

Storage Boundaries

public.dreapp_document and public.dreapp_documenttext are raw staging

tables from the imported DRE dump. Normal ingestion scripts do not mutate them; raw refreshes go through scripts/refresh_dre_public_dump.py.

Prisma owns only the ingestion schema.
Raw table rows are connected with soft references such as dre_document_id,

dre_documenttext_id, and dre_content_id.

Network fetchers and parsers should write JSON or CSV artifacts before loading

normalized rows.

Database loaders should be idempotent, resumable, and safe to rerun.
DGSI Acórdãos are court-decision sources. Resolved links from decisions to

acts/codes/articles live in court_decision_target_links, not in legal_act_relations.

Main Data Layers

Raw DRE dump

The dump is the broad local source for published acts and original HTML text. It is imported into Postgres and then treated as raw source data. Normal ingestion reads it without mutation; the dedicated refresh script is the only writer.

Script:

bzcat 2026-05-03-DRE_dump.sql.bz2 \
  | psql postgresql://invera:invera_dev_password@localhost:55432/invera_dre

For an existing database, do not pipe a newer full dump directly into psql, because that can duplicate rows. Use the refresh script:

python3 scripts/refresh_dre_public_dump.py \
  --dump-url https://uploads.tretas.org/YYYY-MM-DD-DRE_dump.sql.bz2 \
  --download

python3 scripts/refresh_dre_public_dump.py \
  --dump-file artifacts/dre/dumps/YYYY-MM-DD-DRE_dump.sql.bz2 \
  --stage \
  --report artifacts/dre/dumps/YYYY-MM-DD-refresh-stage.json

python3 scripts/refresh_dre_public_dump.py \
  --dump-file artifacts/dre/dumps/YYYY-MM-DD-DRE_dump.sql.bz2 \
  --apply \
  --confirm-public-raw-update \
  --report artifacts/dre/dumps/YYYY-MM-DD-refresh-apply.json

The script extracts only the two raw DRE COPY blocks into dre_dump_refresh.* staging tables, reports candidate row counts, and appends only rows with ids above the current local max ids. It does not delete rows or execute the full upstream dump SQL against public.*.

Consolidated code ingestion

Consolidated code PDFs provide the current article text, article hierarchy, and the amendment notes embedded in each article.

The operations console entry point is Codes -> New code. The user enters the DRE consolidated page URL and, for current DRE pages, uploads the consolidated PDF fallback. The worker infers code metadata from the PDF plus URL and queues ingest_consolidated_code, which runs parse, validate, load, link, and effect-resolution steps in sequence.

Scripts:

python3 scripts/ingest_consolidated_code.py \
  --source-url <dre-consolidated-url> \
  --pdf-path <uploaded-or-local-pdf>

python3 scripts/parse_law_pdf.py <code.pdf> --sigla CC --out artifacts/codigos/cc_articles.json
python3 scripts/validate_artifact.py artifacts/codigos/cc_articles.json --kind parsed_law_pdf
python3 scripts/load_parsed_law_pdf.py artifacts/codigos/cc_articles.json ...

Writes:

legal_codes
legal_articles
legal_article_versions
legal_article_changes
parser/run/artifact provenance

Change linking

The consolidated PDF says that a given article was changed by a diploma, but it does not always have a direct local act id. The linker resolves those textual references against the raw DRE dump.

Script:

python3 scripts/link_article_changes.py --code-key CC --relink

Writes:

legal_article_changes.changed_by_act_id
article-level rows in legal_act_relations
missing modifying acts in legal_acts

Broad act indexing

This indexes many laws or decree-laws from the raw DRE dump into normalized act rows. It does not deeply parse every act.

Script:

python3 scripts/ingest_dre_documents.py \
  --types Lei,Decreto-Lei \
  --from-date 2025-01-01 \
  --batch-size 1000

Writes:

legal_acts
source_documents
ingestion_runs
ingestion_tasks

Single-act backfill

This is used when a professional review needs one exact DRE act with original HTML and direct dump-discoverable rectification relations.

Script:

python3 scripts/backfill_dre_dump_act.py \
  --dre-content-id 913223399 \
  --include-retifications \
  --run-key backfill:lei-39-2025-retification

Writes:

legal_acts
legal_act_texts
act-level legal_act_relations
source_documents

Original act provision parsing

This parses the original DRE HTML for a law or decree-law into source provisions and legal effects. For example, Lei n.º 39/2025 article 2.º amends multiple Código Civil articles.

The parser handles both modern standalone headings and older DRE dump layouts where several Art. headings are grouped inside one HTML paragraph with line breaks. It splits those source headings before extracting target article blocks, so older acts such as Decreto-Lei n.º 496/77 can still produce source provisions/effects from the raw dump text. It also keeps quoted target-code article blocks inside the source amendment article, preserves suffixes such as 102.º-A, and scopes republication annex articles as annex_article provisions. Annex article keys include the annex number, so a republished Artigo 1.º cannot collide with the source act's own Artigo 1.º.

Scripts:

python3 scripts/parse_dre_act_html.py \
  --legal-act-id 16 \
  --out artifacts/dre/acts/lei-39-2025.provisions.json

python3 scripts/load_dre_act_provisions.py \
  artifacts/dre/acts/lei-39-2025.provisions.json

Writes:

legal_act_provisions
legal_act_provision_effects
run/artifact provenance

Resolved effects point directly to loaded legal_articles. Unresolved effects still keep target_label and target_article_number, so they can be resolved after the target code is loaded.

After a new code is loaded, run the effect resolver to connect existing parsed source acts to the newly available articles:

python3 scripts/resolve_effect_targets.py --code-key CRC

This step is idempotent and does not reparse source acts.

DRE legal-analysis snapshot

This fetches the DRE analysis page for one act as a dated sourced snapshot. It does not overwrite normalized legal text. Instead, it stores DRE-side metadata and validation rows so reviewers can see where DRE and local normalized data agree or differ.

The fetcher resolves the current diariodarepublica.pt/dr/detalhe/... ref before requesting analysis pages. The raw dump dre_content_id can be a file id such as https://dre.pt/application/file/..., so it is not safe to treat it as the public detail-page id. Resolution order:

Use stored metadata overrides such as dre_detail_ref or dre_analysis_ref.
Use existing DRE detail or analysis URLs stored on the act.
Build an ELI path from act_type, number, and published_at, then GET

https://diariodarepublica.pt/redirect/LinkELI.aspx?search=<eli-path> and follow the redirect.

Try the raw dump dre_content_id only as a last candidate, and accept it only

if the parsed DRE heading matches the local act type and number.

When a current DRE ref is found, the loader stores it in legal_acts.metadata as dre_detail_ref, dre_analysis_ref, dre_detail_content_id, and dre_detail_url. It preserves the original raw dump id in legal_acts.dre_content_id and records it again as raw_dre_content_id, so the raw dump relation remains intact while future relation resolution can still match current DRE public URLs.

Scripts:

python3 scripts/fetch_dre_analysis.py \
  --legal-act-id 16 \
  --out artifacts/dre/analysis/act-16.analysis.json

python3 scripts/load_dre_analysis.py \
  artifacts/dre/analysis/act-16.analysis.json

Writes:

dre_analysis_snapshots
dre_analysis_descriptors
dre_analysis_modifications
dre_analysis_associations
dre_analysis_validations
source fetches and artifact provenance

Current DRE behavior: the main legal-analysis page exposes a prerendered HTML snapshot to crawler-style requests, while several tab-specific URLs can still return only the OutSystems JavaScript shell. The fetcher records those tab statuses explicitly so the console can distinguish “fetched but shell-only” from “not fetched”. When the dedicated Modificações tab is shell-only, the fetcher still records the visible modification statements from the DRE summary as dre_analysis_modifications with lower confidence and source_section = summary.

PGDL secondary-source snapshot

PGDL is used as a secondary static source, not as the canonical source. The fetcher can search PGDL by diploma type, number, and year, select the exact lei_mostra_articulado.php?nid=... result, convert it into the PGDL print endpoint, parse article blocks, extract source effects with the same target-resolution rules used by the DRE act parser, and write a JSON artifact. PGDL pages do not reliably expose the current DRE detail URL, so PGDL is not the resolver for diariodarepublica.pt/dr/detalhe/...; it is used to cross-check content when DRE live analysis tabs are unavailable or shell-only. The parser preserves article suffixes written after the ordinal marker, such as 3.º-A or 447.º A, so old codes do not collapse suffixed articles into the base article number.

Scripts:

python3 scripts/fetch_pgdl_act.py \
  --legal-act-id 15 \
  --act-type "Decreto-Lei" \
  --act-number "496/77" \
  --act-year 1977 \
  --out artifacts/pgdl/acts/act-15.pgdl.json

python3 scripts/load_pgdl_act.py \
  artifacts/pgdl/acts/act-15.pgdl.json

Writes:

pgdl_act_snapshots
pgdl_act_articles
pgdl_act_effects
pgdl_act_validations
source fetches and artifact provenance

The loader cross-checks PGDL article count, effect count, and effect target distribution against the DRE-derived legal_act_provisions and legal_act_provision_effects rows. Validation issues are displayed in the act page instead of being merged silently.

DGSI Acórdãos

DGSI decisions are ingested as a jurisprudence corpus whose primary purpose is to link explicit court-decision citations to loaded legal acts, codes, and articles.

The detailed operational reference is `ACORDAOS_INGESTION.md`.

Pipeline:

fetch_dgsi_acordao_index.py crawls DGSI court databases with

?OpenDatabase&Start=N, deduplicates repeated DGSI page-boundary rows by decision_key, and writes an index artifact.

load_dgsi_acordao_index.py upserts court_decisions and

dgsi_court_sync_state.

fetch_dgsi_acordao_details.py fetches expanded detail pages and preserves

raw HTML artifacts.

load_dgsi_acordao_details.py upserts DGSI source_documents, decision

metadata, and court_decision_texts.

extract_dgsi_acordao_citation_candidates.py scans summary/full text and

preserves exact spans, raw text, sentence, and context.

split_dgsi_acordao_citation_candidates.py can stream very large candidate

artifacts into decision-safe shards plus a manifest for Codex. Each decision stays in a single shard because shard loading replaces citation rows for the affected decisions.

classify_dgsi_acordao_citations.py applies deterministic semantic phrase

rules, or Codex uses .agents/skills/invera-acordaos to write the same classified JSON shape. Codex may add a short Portuguese usage_summary explaining how the cited provision is used when the saved context supports it. Unclear citations remain unknown.

load_dgsi_acordao_citations.py loads canonical citation rows, including

unresolved candidates. load_dgsi_acordao_citation_parts does the same for every classified shard in a split manifest.

resolve_dgsi_acordao_citations.py resolves loaded codes/articles and

soft-upserts matching laws/decrees from public.dreapp_document when possible. Only resolved targets are materialized. open_ended references such as e seguintes are bounded by the nearest loaded subsection, section, or chapter before materialized links are written.

Worker actions exposed to the console:

sync_dgsi_acordao_index
sync_dgsi_acordao_details
extract_dgsi_acordao_citation_candidates
classify_dgsi_acordao_citations
split_dgsi_acordao_citation_candidates
extract_dgsi_acordao_citations
load_dgsi_acordao_citations
load_dgsi_acordao_citation_parts
resolve_dgsi_acordao_citations

The top navigation includes /acordaos, with queue controls, court coverage, Codex prompt generation, search/filtering, unresolved counts, and decision detail pages. Act, code, and article detail views show inbound Acórdãos panels from court_decision_target_links; when a citation has usage_summary, the panel shows it above the preserved source context.

Dashboard Queue Flow

The dashboard should not spawn shell commands inside a web request. Instead, it creates tasks in the database.

A user clicks an action such as Queue parse original HTML.
The Next.js API inserts a pending ingestion_tasks row.
A separate worker locks the next pending task.
The worker runs the mapped script commands.
The worker writes task/run output or failure details.
The dashboard refreshes from ingestion_runs and ingestion_tasks.

Run one queued task:

python3 scripts/ingestion_worker.py

Run continuously:

python3 scripts/ingestion_worker.py --watch

Supported queued actions:

ingest_consolidated_code
link_article_changes
resolve_effect_targets
backfill_dre_dump_act
parse_dre_act_html
fetch_dre_analysis
fetch_pgdl_act
sync_dgsi_acordao_index
sync_dgsi_acordao_details
extract_dgsi_acordao_citation_candidates
classify_dgsi_acordao_citations
extract_dgsi_acordao_citations
load_dgsi_acordao_citations
resolve_dgsi_acordao_citations

Status Shown In The Console

Dashboard

The dashboard shows per-code ingestion status:

articles loaded from consolidated PDFs
article changes parsed
changes linked to DRE acts
source effects parsed from original acts
pending, running, and failed tasks
the next recommended ingestion step

Code Page

The code page shows whether the code has:

article text
consolidated PDF change history
linked modifying acts
original DRE source effects pointing into this code
an Acórdãos tab with inbound decisions that resolve to the code or its articles

It can queue link_article_changes and resolve_effect_targets.

Article Page

The article panel shows:

current normalized article text
DRE source effects parsed from original acts
consolidated PDF change history
whether amendment text has been parsed from source acts
inbound Acórdãos that resolve to this article, including Codex usage summaries

when available

Act Page

The act page shows whether the act has:

a raw DRE dump row
original HTML text
parsed source provisions
parsed source effects
effects linked to loaded articles
DRE legal-analysis snapshots, descriptors, associations, and validation

differences

PGDL secondary-source snapshots and cross-check issues
inbound Acórdãos that resolve to the act, its code, or its articles

It can queue backfill, original HTML provision parsing, DRE analysis snapshot refreshes, PGDL discovery/fetch jobs, and PGDL refreshes when a PGDL source URL is already known. DRE analysis and PGDL actions are separate from the original HTML parser because they store sourced comparison data.

Current Baseline

Código Civil is loaded from the consolidated PDF.

Lei n.º 39/2025 has:

original DRE dump row
original HTML text
retification relation
8 source provisions
63 provision effects
36 effects resolved to Código Civil articles
DRE analysis snapshot with 10 descriptors, 3 DRE summary-derived modification

statements, 1 retification association, and field-level validation against local act metadata

Article 125 of Código Civil shows:

the source amendment from Lei n.º 39/2025 article 2.º
the target wording parsed from original DRE HTML
the consolidated PDF change history

Decreto-Lei n.º 496/77 has:

original DRE dump row
original HTML text
187 source provisions
168 provision effects resolved to Código Civil articles
DRE analysis fetch saved as shell-only/unparsed for the requested live page
PGDL secondary-source snapshot with 187 articles, 168 effects, and 0

comparison issues against the DRE-derived provision/effect rows

Ingestion guide

Ingestion Workflow

System Map

Storage Boundaries

Main Data Layers

Raw DRE dump

Consolidated code ingestion

Change linking

Broad act indexing

Single-act backfill

Original act provision parsing

DRE legal-analysis snapshot

PGDL secondary-source snapshot

DGSI Acórdãos

Dashboard Queue Flow

Status Shown In The Console

Dashboard

Code Page

Article Page

Act Page

Current Baseline