Documentation

Ingestion guide

Operational reference for scripts, queue processing, database writes, and review status across the ingestion service.

Ingestion Workflow

This document explains how the ingestion service turns Portuguese legal sources into normalized data for review in the operations console.

The important rule is that source data and parser output are preserved before database writes. Raw DRE dump tables stay read-only. The ingestion service writes only to the ingestion schema.

System Map

Rendering diagram…

Storage Boundaries

  • public.dreapp_document and public.dreapp_documenttext are raw staging

tables from the imported DRE dump.

  • Prisma owns only the ingestion schema.
  • Raw table rows are connected with soft references such as dre_document_id,

dre_documenttext_id, and dre_content_id.

  • Network fetchers and parsers should write JSON or CSV artifacts before loading

normalized rows.

  • Database loaders should be idempotent, resumable, and safe to rerun.
  • DGSI Acórdãos are court-decision sources. Resolved links from decisions to

acts/codes/articles live in court_decision_target_links, not in legal_act_relations.

Main Data Layers

Raw DRE dump

The dump is the broad local source for published acts and original HTML text. It is imported once into Postgres and then treated as read-only.

Script:

bzcat 2026-05-03-DRE_dump.sql.bz2 \
  | psql postgresql://invera:invera_dev_password@localhost:55432/invera_dre

Consolidated code ingestion

Consolidated code PDFs provide the current article text, article hierarchy, and the amendment notes embedded in each article.

The operations console entry point is Codes -> New code. The user enters the DRE consolidated page URL and, for current DRE pages, uploads the consolidated PDF fallback. The worker infers code metadata from the PDF plus URL and queues ingest_consolidated_code, which runs parse, validate, load, link, and effect-resolution steps in sequence.

Scripts:

python3 scripts/ingest_consolidated_code.py \
  --source-url <dre-consolidated-url> \
  --pdf-path <uploaded-or-local-pdf>

python3 scripts/parse_law_pdf.py <code.pdf> --sigla CC --out artifacts/codigos/cc_articles.json
python3 scripts/validate_artifact.py artifacts/codigos/cc_articles.json --kind parsed_law_pdf
python3 scripts/load_parsed_law_pdf.py artifacts/codigos/cc_articles.json ...

Writes:

  • legal_codes
  • legal_articles
  • legal_article_versions
  • legal_article_changes
  • parser/run/artifact provenance

Change linking

The consolidated PDF says that a given article was changed by a diploma, but it does not always have a direct local act id. The linker resolves those textual references against the raw DRE dump.

Script:

python3 scripts/link_article_changes.py --code-key CC --relink

Writes:

  • legal_article_changes.changed_by_act_id
  • article-level rows in legal_act_relations
  • missing modifying acts in legal_acts

Broad act indexing

This indexes many laws or decree-laws from the raw DRE dump into normalized act rows. It does not deeply parse every act.

Script:

python3 scripts/ingest_dre_documents.py \
  --types Lei,Decreto-Lei \
  --from-date 2025-01-01 \
  --batch-size 1000

Writes:

  • legal_acts
  • source_documents
  • ingestion_runs
  • ingestion_tasks

Single-act backfill

This is used when a professional review needs one exact DRE act with original HTML and direct dump-discoverable rectification relations.

Script:

python3 scripts/backfill_dre_dump_act.py \
  --dre-content-id 913223399 \
  --include-retifications \
  --run-key backfill:lei-39-2025-retification

Writes:

  • legal_acts
  • legal_act_texts
  • act-level legal_act_relations
  • source_documents

Original act provision parsing

This parses the original DRE HTML for a law or decree-law into source provisions and legal effects. For example, Lei n.º 39/2025 article 2.º amends multiple Código Civil articles.

The parser handles both modern standalone headings and older DRE dump layouts where several Art. headings are grouped inside one HTML paragraph with line breaks. It splits those source headings before extracting target article blocks, so older acts such as Decreto-Lei n.º 496/77 can still produce source provisions/effects from the raw dump text. It also keeps quoted target-code article blocks inside the source amendment article, preserves suffixes such as 102.º-A, and scopes republication annex articles as annex_article provisions. Annex article keys include the annex number, so a republished Artigo 1.º cannot collide with the source act's own Artigo 1.º.

Scripts:

python3 scripts/parse_dre_act_html.py \
  --legal-act-id 16 \
  --out artifacts/dre/acts/lei-39-2025.provisions.json

python3 scripts/load_dre_act_provisions.py \
  artifacts/dre/acts/lei-39-2025.provisions.json

Writes:

  • legal_act_provisions
  • legal_act_provision_effects
  • run/artifact provenance

Resolved effects point directly to loaded legal_articles. Unresolved effects still keep target_label and target_article_number, so they can be resolved after the target code is loaded.

After a new code is loaded, run the effect resolver to connect existing parsed source acts to the newly available articles:

python3 scripts/resolve_effect_targets.py --code-key CRC

This step is idempotent and does not reparse source acts.

This fetches the DRE analysis page for one act as a dated sourced snapshot. It does not overwrite normalized legal text. Instead, it stores DRE-side metadata and validation rows so reviewers can see where DRE and local normalized data agree or differ.

The fetcher resolves the current diariodarepublica.pt/dr/detalhe/... ref before requesting analysis pages. The raw dump dre_content_id can be a file id such as https://dre.pt/application/file/..., so it is not safe to treat it as the public detail-page id. Resolution order:

  1. Use stored metadata overrides such as dre_detail_ref or dre_analysis_ref.
  2. Use existing DRE detail or analysis URLs stored on the act.
  3. Build an ELI path from act_type, number, and published_at, then GET

https://diariodarepublica.pt/redirect/LinkELI.aspx?search=<eli-path> and follow the redirect.

  1. Try the raw dump dre_content_id only as a last candidate, and accept it only

if the parsed DRE heading matches the local act type and number.

When a current DRE ref is found, the loader stores it in legal_acts.metadata as dre_detail_ref, dre_analysis_ref, dre_detail_content_id, and dre_detail_url. It preserves the original raw dump id in legal_acts.dre_content_id and records it again as raw_dre_content_id, so the raw dump relation remains intact while future relation resolution can still match current DRE public URLs.

Scripts:

python3 scripts/fetch_dre_analysis.py \
  --legal-act-id 16 \
  --out artifacts/dre/analysis/act-16.analysis.json

python3 scripts/load_dre_analysis.py \
  artifacts/dre/analysis/act-16.analysis.json

Writes:

  • dre_analysis_snapshots
  • dre_analysis_descriptors
  • dre_analysis_modifications
  • dre_analysis_associations
  • dre_analysis_validations
  • source fetches and artifact provenance

Current DRE behavior: the main legal-analysis page exposes a prerendered HTML snapshot to crawler-style requests, while several tab-specific URLs can still return only the OutSystems JavaScript shell. The fetcher records those tab statuses explicitly so the console can distinguish “fetched but shell-only” from “not fetched”. When the dedicated Modificações tab is shell-only, the fetcher still records the visible modification statements from the DRE summary as dre_analysis_modifications with lower confidence and source_section = summary.

PGDL secondary-source snapshot

PGDL is used as a secondary static source, not as the canonical source. The fetcher can search PGDL by diploma type, number, and year, select the exact lei_mostra_articulado.php?nid=... result, convert it into the PGDL print endpoint, parse article blocks, extract source effects with the same target-resolution rules used by the DRE act parser, and write a JSON artifact. PGDL pages do not reliably expose the current DRE detail URL, so PGDL is not the resolver for diariodarepublica.pt/dr/detalhe/...; it is used to cross-check content when DRE live analysis tabs are unavailable or shell-only. The parser preserves article suffixes written after the ordinal marker, such as 3.º-A or 447.º A, so old codes do not collapse suffixed articles into the base article number.

Scripts:

python3 scripts/fetch_pgdl_act.py \
  --legal-act-id 15 \
  --act-type "Decreto-Lei" \
  --act-number "496/77" \
  --act-year 1977 \
  --out artifacts/pgdl/acts/act-15.pgdl.json

python3 scripts/load_pgdl_act.py \
  artifacts/pgdl/acts/act-15.pgdl.json

Writes:

  • pgdl_act_snapshots
  • pgdl_act_articles
  • pgdl_act_effects
  • pgdl_act_validations
  • source fetches and artifact provenance

The loader cross-checks PGDL article count, effect count, and effect target distribution against the DRE-derived legal_act_provisions and legal_act_provision_effects rows. Validation issues are displayed in the act page instead of being merged silently.

DGSI Acórdãos

DGSI decisions are ingested as a jurisprudence corpus whose primary purpose is to link explicit court-decision citations to loaded legal acts, codes, and articles.

The detailed operational reference is `ACORDAOS_INGESTION.md`.

Pipeline:

  1. fetch_dgsi_acordao_index.py crawls DGSI court databases with

?OpenDatabase&Start=N, deduplicates repeated DGSI page-boundary rows by decision_key, and writes an index artifact.

  1. load_dgsi_acordao_index.py upserts court_decisions and

dgsi_court_sync_state.

  1. fetch_dgsi_acordao_details.py fetches expanded detail pages and preserves

raw HTML artifacts.

  1. load_dgsi_acordao_details.py upserts DGSI source_documents, decision

metadata, and court_decision_texts.

  1. extract_dgsi_acordao_citation_candidates.py scans summary/full text and

preserves exact spans, raw text, sentence, and context.

  1. split_dgsi_acordao_citation_candidates.py can stream very large candidate

artifacts into decision-safe shards plus a manifest for Codex. Each decision stays in a single shard because shard loading replaces citation rows for the affected decisions.

  1. classify_dgsi_acordao_citations.py applies deterministic semantic phrase

rules, or Codex uses .agents/skills/invera-acordaos to write the same classified JSON shape. Codex may add a short Portuguese usage_summary explaining how the cited provision is used when the saved context supports it. Unclear citations remain unknown.

  1. load_dgsi_acordao_citations.py loads canonical citation rows, including

unresolved candidates. load_dgsi_acordao_citation_parts does the same for every classified shard in a split manifest.

  1. resolve_dgsi_acordao_citations.py resolves loaded codes/articles and

soft-upserts matching laws/decrees from public.dreapp_document when possible. Only resolved targets are materialized. open_ended references such as e seguintes are bounded by the nearest loaded subsection, section, or chapter before materialized links are written.

Worker actions exposed to the console:

  • sync_dgsi_acordao_index
  • sync_dgsi_acordao_details
  • extract_dgsi_acordao_citation_candidates
  • classify_dgsi_acordao_citations
  • split_dgsi_acordao_citation_candidates
  • extract_dgsi_acordao_citations
  • load_dgsi_acordao_citations
  • load_dgsi_acordao_citation_parts
  • resolve_dgsi_acordao_citations

The top navigation includes /acordaos, with queue controls, court coverage, Codex prompt generation, search/filtering, unresolved counts, and decision detail pages. Act, code, and article detail views show inbound Acórdãos panels from court_decision_target_links; when a citation has usage_summary, the panel shows it above the preserved source context.

Dashboard Queue Flow

The dashboard should not spawn shell commands inside a web request. Instead, it creates tasks in the database.

  1. A user clicks an action such as Queue parse original HTML.
  2. The Next.js API inserts a pending ingestion_tasks row.
  3. A separate worker locks the next pending task.
  4. The worker runs the mapped script commands.
  5. The worker writes task/run output or failure details.
  6. The dashboard refreshes from ingestion_runs and ingestion_tasks.

Run one queued task:

python3 scripts/ingestion_worker.py

Run continuously:

python3 scripts/ingestion_worker.py --watch

Supported queued actions:

  • ingest_consolidated_code
  • link_article_changes
  • resolve_effect_targets
  • backfill_dre_dump_act
  • parse_dre_act_html
  • fetch_dre_analysis
  • fetch_pgdl_act
  • sync_dgsi_acordao_index
  • sync_dgsi_acordao_details
  • extract_dgsi_acordao_citation_candidates
  • classify_dgsi_acordao_citations
  • extract_dgsi_acordao_citations
  • load_dgsi_acordao_citations
  • resolve_dgsi_acordao_citations

Status Shown In The Console

Dashboard

The dashboard shows per-code ingestion status:

  • articles loaded from consolidated PDFs
  • article changes parsed
  • changes linked to DRE acts
  • source effects parsed from original acts
  • pending, running, and failed tasks
  • the next recommended ingestion step

Code Page

The code page shows whether the code has:

  • article text
  • consolidated PDF change history
  • linked modifying acts
  • original DRE source effects pointing into this code
  • an Acórdãos tab with inbound decisions that resolve to the code or its articles

It can queue link_article_changes and resolve_effect_targets.

Article Page

The article panel shows:

  • current normalized article text
  • DRE source effects parsed from original acts
  • consolidated PDF change history
  • whether amendment text has been parsed from source acts
  • inbound Acórdãos that resolve to this article, including Codex usage summaries

when available

Act Page

The act page shows whether the act has:

  • a raw DRE dump row
  • original HTML text
  • parsed source provisions
  • parsed source effects
  • effects linked to loaded articles
  • DRE legal-analysis snapshots, descriptors, associations, and validation

differences

  • PGDL secondary-source snapshots and cross-check issues
  • inbound Acórdãos that resolve to the act, its code, or its articles

It can queue backfill, original HTML provision parsing, DRE analysis snapshot refreshes, PGDL discovery/fetch jobs, and PGDL refreshes when a PGDL source URL is already known. DRE analysis and PGDL actions are separate from the original HTML parser because they store sourced comparison data.

Current Baseline

Código Civil is loaded from the consolidated PDF.

Lei n.º 39/2025 has:

  • original DRE dump row
  • original HTML text
  • retification relation
  • 8 source provisions
  • 63 provision effects
  • 36 effects resolved to Código Civil articles
  • DRE analysis snapshot with 10 descriptors, 3 DRE summary-derived modification

statements, 1 retification association, and field-level validation against local act metadata

Article 125 of Código Civil shows:

  • the source amendment from Lei n.º 39/2025 article 2.º
  • the target wording parsed from original DRE HTML
  • the consolidated PDF change history

Decreto-Lei n.º 496/77 has:

  • original DRE dump row
  • original HTML text
  • 187 source provisions
  • 168 provision effects resolved to Código Civil articles
  • DRE analysis fetch saved as shell-only/unparsed for the requested live page
  • PGDL secondary-source snapshot with 187 articles, 168 effects, and 0

comparison issues against the DRE-derived provision/effect rows