Ingestion Workflow
This document explains how the ingestion service turns Portuguese legal sources into normalized data for review in the operations console.
The important rule is that source data and parser output are preserved before database writes. Raw DRE dump tables stay read-only. The ingestion service writes only to the ingestion schema.
System Map
Storage Boundaries
public.dreapp_documentandpublic.dreapp_documenttextare raw staging
tables from the imported DRE dump.
- Prisma owns only the
ingestionschema. - Raw table rows are connected with soft references such as
dre_document_id,
dre_documenttext_id, and dre_content_id.
- Network fetchers and parsers should write JSON or CSV artifacts before loading
normalized rows.
- Database loaders should be idempotent, resumable, and safe to rerun.
- DGSI Acórdãos are court-decision sources. Resolved links from decisions to
acts/codes/articles live in court_decision_target_links, not in legal_act_relations.
Main Data Layers
Raw DRE dump
The dump is the broad local source for published acts and original HTML text. It is imported once into Postgres and then treated as read-only.
Script:
bzcat 2026-05-03-DRE_dump.sql.bz2 \
| psql postgresql://invera:invera_dev_password@localhost:55432/invera_dreConsolidated code ingestion
Consolidated code PDFs provide the current article text, article hierarchy, and the amendment notes embedded in each article.
The operations console entry point is Codes -> New code. The user enters the DRE consolidated page URL and, for current DRE pages, uploads the consolidated PDF fallback. The worker infers code metadata from the PDF plus URL and queues ingest_consolidated_code, which runs parse, validate, load, link, and effect-resolution steps in sequence.
Scripts:
python3 scripts/ingest_consolidated_code.py \
--source-url <dre-consolidated-url> \
--pdf-path <uploaded-or-local-pdf>
python3 scripts/parse_law_pdf.py <code.pdf> --sigla CC --out artifacts/codigos/cc_articles.json
python3 scripts/validate_artifact.py artifacts/codigos/cc_articles.json --kind parsed_law_pdf
python3 scripts/load_parsed_law_pdf.py artifacts/codigos/cc_articles.json ...Writes:
legal_codeslegal_articleslegal_article_versionslegal_article_changes- parser/run/artifact provenance
Change linking
The consolidated PDF says that a given article was changed by a diploma, but it does not always have a direct local act id. The linker resolves those textual references against the raw DRE dump.
Script:
python3 scripts/link_article_changes.py --code-key CC --relinkWrites:
legal_article_changes.changed_by_act_id- article-level rows in
legal_act_relations - missing modifying acts in
legal_acts
Broad act indexing
This indexes many laws or decree-laws from the raw DRE dump into normalized act rows. It does not deeply parse every act.
Script:
python3 scripts/ingest_dre_documents.py \
--types Lei,Decreto-Lei \
--from-date 2025-01-01 \
--batch-size 1000Writes:
legal_actssource_documentsingestion_runsingestion_tasks
Single-act backfill
This is used when a professional review needs one exact DRE act with original HTML and direct dump-discoverable rectification relations.
Script:
python3 scripts/backfill_dre_dump_act.py \
--dre-content-id 913223399 \
--include-retifications \
--run-key backfill:lei-39-2025-retificationWrites:
legal_actslegal_act_texts- act-level
legal_act_relations source_documents
Original act provision parsing
This parses the original DRE HTML for a law or decree-law into source provisions and legal effects. For example, Lei n.º 39/2025 article 2.º amends multiple Código Civil articles.
The parser handles both modern standalone headings and older DRE dump layouts where several Art. headings are grouped inside one HTML paragraph with line breaks. It splits those source headings before extracting target article blocks, so older acts such as Decreto-Lei n.º 496/77 can still produce source provisions/effects from the raw dump text. It also keeps quoted target-code article blocks inside the source amendment article, preserves suffixes such as 102.º-A, and scopes republication annex articles as annex_article provisions. Annex article keys include the annex number, so a republished Artigo 1.º cannot collide with the source act's own Artigo 1.º.
Scripts:
python3 scripts/parse_dre_act_html.py \
--legal-act-id 16 \
--out artifacts/dre/acts/lei-39-2025.provisions.json
python3 scripts/load_dre_act_provisions.py \
artifacts/dre/acts/lei-39-2025.provisions.jsonWrites:
legal_act_provisionslegal_act_provision_effects- run/artifact provenance
Resolved effects point directly to loaded legal_articles. Unresolved effects still keep target_label and target_article_number, so they can be resolved after the target code is loaded.
After a new code is loaded, run the effect resolver to connect existing parsed source acts to the newly available articles:
python3 scripts/resolve_effect_targets.py --code-key CRCThis step is idempotent and does not reparse source acts.
DRE legal-analysis snapshot
This fetches the DRE analysis page for one act as a dated sourced snapshot. It does not overwrite normalized legal text. Instead, it stores DRE-side metadata and validation rows so reviewers can see where DRE and local normalized data agree or differ.
The fetcher resolves the current diariodarepublica.pt/dr/detalhe/... ref before requesting analysis pages. The raw dump dre_content_id can be a file id such as https://dre.pt/application/file/..., so it is not safe to treat it as the public detail-page id. Resolution order:
- Use stored metadata overrides such as
dre_detail_refordre_analysis_ref. - Use existing DRE detail or analysis URLs stored on the act.
- Build an ELI path from
act_type,number, andpublished_at, then GET
https://diariodarepublica.pt/redirect/LinkELI.aspx?search=<eli-path> and follow the redirect.
- Try the raw dump
dre_content_idonly as a last candidate, and accept it only
if the parsed DRE heading matches the local act type and number.
When a current DRE ref is found, the loader stores it in legal_acts.metadata as dre_detail_ref, dre_analysis_ref, dre_detail_content_id, and dre_detail_url. It preserves the original raw dump id in legal_acts.dre_content_id and records it again as raw_dre_content_id, so the raw dump relation remains intact while future relation resolution can still match current DRE public URLs.
Scripts:
python3 scripts/fetch_dre_analysis.py \
--legal-act-id 16 \
--out artifacts/dre/analysis/act-16.analysis.json
python3 scripts/load_dre_analysis.py \
artifacts/dre/analysis/act-16.analysis.jsonWrites:
dre_analysis_snapshotsdre_analysis_descriptorsdre_analysis_modificationsdre_analysis_associationsdre_analysis_validations- source fetches and artifact provenance
Current DRE behavior: the main legal-analysis page exposes a prerendered HTML snapshot to crawler-style requests, while several tab-specific URLs can still return only the OutSystems JavaScript shell. The fetcher records those tab statuses explicitly so the console can distinguish “fetched but shell-only” from “not fetched”. When the dedicated Modificações tab is shell-only, the fetcher still records the visible modification statements from the DRE summary as dre_analysis_modifications with lower confidence and source_section = summary.
PGDL secondary-source snapshot
PGDL is used as a secondary static source, not as the canonical source. The fetcher can search PGDL by diploma type, number, and year, select the exact lei_mostra_articulado.php?nid=... result, convert it into the PGDL print endpoint, parse article blocks, extract source effects with the same target-resolution rules used by the DRE act parser, and write a JSON artifact. PGDL pages do not reliably expose the current DRE detail URL, so PGDL is not the resolver for diariodarepublica.pt/dr/detalhe/...; it is used to cross-check content when DRE live analysis tabs are unavailable or shell-only. The parser preserves article suffixes written after the ordinal marker, such as 3.º-A or 447.º A, so old codes do not collapse suffixed articles into the base article number.
Scripts:
python3 scripts/fetch_pgdl_act.py \
--legal-act-id 15 \
--act-type "Decreto-Lei" \
--act-number "496/77" \
--act-year 1977 \
--out artifacts/pgdl/acts/act-15.pgdl.json
python3 scripts/load_pgdl_act.py \
artifacts/pgdl/acts/act-15.pgdl.jsonWrites:
pgdl_act_snapshotspgdl_act_articlespgdl_act_effectspgdl_act_validations- source fetches and artifact provenance
The loader cross-checks PGDL article count, effect count, and effect target distribution against the DRE-derived legal_act_provisions and legal_act_provision_effects rows. Validation issues are displayed in the act page instead of being merged silently.
DGSI Acórdãos
DGSI decisions are ingested as a jurisprudence corpus whose primary purpose is to link explicit court-decision citations to loaded legal acts, codes, and articles.
The detailed operational reference is `ACORDAOS_INGESTION.md`.
Pipeline:
fetch_dgsi_acordao_index.pycrawls DGSI court databases with
?OpenDatabase&Start=N, deduplicates repeated DGSI page-boundary rows by decision_key, and writes an index artifact.
load_dgsi_acordao_index.pyupsertscourt_decisionsand
dgsi_court_sync_state.
fetch_dgsi_acordao_details.pyfetches expanded detail pages and preserves
raw HTML artifacts.
load_dgsi_acordao_details.pyupserts DGSIsource_documents, decision
metadata, and court_decision_texts.
extract_dgsi_acordao_citation_candidates.pyscans summary/full text and
preserves exact spans, raw text, sentence, and context.
split_dgsi_acordao_citation_candidates.pycan stream very large candidate
artifacts into decision-safe shards plus a manifest for Codex. Each decision stays in a single shard because shard loading replaces citation rows for the affected decisions.
classify_dgsi_acordao_citations.pyapplies deterministic semantic phrase
rules, or Codex uses .agents/skills/invera-acordaos to write the same classified JSON shape. Codex may add a short Portuguese usage_summary explaining how the cited provision is used when the saved context supports it. Unclear citations remain unknown.
load_dgsi_acordao_citations.pyloads canonical citation rows, including
unresolved candidates. load_dgsi_acordao_citation_parts does the same for every classified shard in a split manifest.
resolve_dgsi_acordao_citations.pyresolves loaded codes/articles and
soft-upserts matching laws/decrees from public.dreapp_document when possible. Only resolved targets are materialized. open_ended references such as e seguintes are bounded by the nearest loaded subsection, section, or chapter before materialized links are written.
Worker actions exposed to the console:
sync_dgsi_acordao_indexsync_dgsi_acordao_detailsextract_dgsi_acordao_citation_candidatesclassify_dgsi_acordao_citationssplit_dgsi_acordao_citation_candidatesextract_dgsi_acordao_citationsload_dgsi_acordao_citationsload_dgsi_acordao_citation_partsresolve_dgsi_acordao_citations
The top navigation includes /acordaos, with queue controls, court coverage, Codex prompt generation, search/filtering, unresolved counts, and decision detail pages. Act, code, and article detail views show inbound Acórdãos panels from court_decision_target_links; when a citation has usage_summary, the panel shows it above the preserved source context.
Dashboard Queue Flow
The dashboard should not spawn shell commands inside a web request. Instead, it creates tasks in the database.
- A user clicks an action such as
Queue parse original HTML. - The Next.js API inserts a pending
ingestion_tasksrow. - A separate worker locks the next pending task.
- The worker runs the mapped script commands.
- The worker writes task/run output or failure details.
- The dashboard refreshes from
ingestion_runsandingestion_tasks.
Run one queued task:
python3 scripts/ingestion_worker.pyRun continuously:
python3 scripts/ingestion_worker.py --watchSupported queued actions:
ingest_consolidated_codelink_article_changesresolve_effect_targetsbackfill_dre_dump_actparse_dre_act_htmlfetch_dre_analysisfetch_pgdl_actsync_dgsi_acordao_indexsync_dgsi_acordao_detailsextract_dgsi_acordao_citation_candidatesclassify_dgsi_acordao_citationsextract_dgsi_acordao_citationsload_dgsi_acordao_citationsresolve_dgsi_acordao_citations
Status Shown In The Console
Dashboard
The dashboard shows per-code ingestion status:
- articles loaded from consolidated PDFs
- article changes parsed
- changes linked to DRE acts
- source effects parsed from original acts
- pending, running, and failed tasks
- the next recommended ingestion step
Code Page
The code page shows whether the code has:
- article text
- consolidated PDF change history
- linked modifying acts
- original DRE source effects pointing into this code
- an Acórdãos tab with inbound decisions that resolve to the code or its articles
It can queue link_article_changes and resolve_effect_targets.
Article Page
The article panel shows:
- current normalized article text
- DRE source effects parsed from original acts
- consolidated PDF change history
- whether amendment text has been parsed from source acts
- inbound Acórdãos that resolve to this article, including Codex usage summaries
when available
Act Page
The act page shows whether the act has:
- a raw DRE dump row
- original HTML text
- parsed source provisions
- parsed source effects
- effects linked to loaded articles
- DRE legal-analysis snapshots, descriptors, associations, and validation
differences
- PGDL secondary-source snapshots and cross-check issues
- inbound Acórdãos that resolve to the act, its code, or its articles
It can queue backfill, original HTML provision parsing, DRE analysis snapshot refreshes, PGDL discovery/fetch jobs, and PGDL refreshes when a PGDL source URL is already known. DRE analysis and PGDL actions are separate from the original HTML parser because they store sourced comparison data.
Current Baseline
Código Civil is loaded from the consolidated PDF.
Lei n.º 39/2025 has:
- original DRE dump row
- original HTML text
- retification relation
- 8 source provisions
- 63 provision effects
- 36 effects resolved to Código Civil articles
- DRE analysis snapshot with 10 descriptors, 3 DRE summary-derived modification
statements, 1 retification association, and field-level validation against local act metadata
Article 125 of Código Civil shows:
- the source amendment from Lei n.º 39/2025 article 2.º
- the target wording parsed from original DRE HTML
- the consolidated PDF change history
Decreto-Lei n.º 496/77 has:
- original DRE dump row
- original HTML text
- 187 source provisions
- 168 provision effects resolved to Código Civil articles
- DRE analysis fetch saved as shell-only/unparsed for the requested live page
- PGDL secondary-source snapshot with 187 articles, 168 effects, and 0
comparison issues against the DRE-derived provision/effect rows