Unstructured Data Extraction

 

Spark NLP provides comprehensive capabilities for extracting and processing unstructured data from various document formats at enterprise scale using its Reader2X components and related annotators handle common document AI tasks, with comparisons to other frameworks.

Complete Text Coverage from Complex Documents

Problem

Enterprise pipelines require extracting every piece of visible text from documents, including navigation menus, footers, captions, tables, figure titles, and metadata fields. Capturing all visible text is essential for traceable, auditable corpora where any omission could lead to information loss or compliance gaps.

Spark NLP Solution

To clean text extracted from HTML using Spark NLP, we leveraged the following annotators:

from sparknlp.reader.reader2doc import Reader2Doc
from sparknlp.annotator import DocumentNormalizer, SentenceDetectorDLModel
from pyspark.ml import Pipeline

reader2doc = Reader2Doc() \
    .setContentType('text/html') \
    .setContentPath(directory) \
    .setOutputCol('document')

normalizer = DocumentNormalizer() \
    .setInputCols(['document']) \
    .setOutputCol('normalized') \
    .setAutoMode("HTML_CLEAN") \
    .setPatterns([(":")])

sentence_detector = SentenceDetectorDLModel() \
    .pretrained() \
    .setInputCols(['normalized']) \
    .setOutputCol('sentences') \
    .setExplodeSentences(True)

pipeline = Pipeline(stages=[reader2doc, normalizer, sentence_detector])

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

Benefits

  • Complete coverage: Extracts the full visible text layer without filtering
  • Scalable processing: Built on Apache Spark for distributed processing
  • Unified pipeline: Flows directly into tokenizers, embeddings, and NLP models
  • Traceability: Maintains metadata (source path, page number, character offsets)

Use Cases: Enterprise-scale ingestion, full-text indexing, document alignment, compliance auditing

Maintaining Structural Context for Data-Rich Documents

Problem

In healthcare, finance, insurance, and legal domains, critical insights are embedded in structured elements like tables and figures. Without preservation of structural context (headers, captions, section hierarchy), downstream NLP systems struggle to interpret the extracted information.

Spark NLP Solution

from sparknlp.reader.reader2table import Reader2Table

reader2doc = Reader2Table() \
    .setContentType('text/html') \
    .setContentPath('html_docs/EHR-2025-12-000002.html') \
    .setOutputCol('table') \
    .setExplodeDocs(True)

pipeline = Pipeline(stages=[reader2doc])

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

JSON Output (structured data)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"caption":"","header":["Test","Result","Units","Reference Range","Status"],"rows":[["PSA","0.32","ng/mL","0-4.0","Excellent"],["Testosterone","125","ng/dL","300-1000","Recovering"],["Hemoglobin","14.3","g/dL","13.5-17.5","Normal"],["WBC","7.2","K/uL","4.5-11.0","Normal"],["Creatinine","0.9","mg/dL","0.7-1.3","Normal"],["ALT","22","U/L","7-56","Normal"]]}]                                                                        |
|[{"caption":"","header":["Test","Result","Units","Reference Range","Status"],"rows":[["Testosterone","105","ng/dL","300-1000","Recovering"],["Hemoglobin","12.3","g/dL","13.5-17.5","Normal"],["Creatinine","0.7","mg/dL","0.7-1.3","Normal"]]}]                                                                                                                                                                                               |
|[{"caption":"","header":["Medication","Dose","Frequency","Indication","Status"],"rows":[["Atorvastatin (Lipitor)","10 mg PO","Daily","Hyperlipidemia","Active"],["Aspirin","81 mg PO","Daily","Cardiovascular prophylaxis","Active"],["Vitamin D3","2000 IU PO","Daily","Bone health","Active"],["Calcium carbonate","500 mg PO","BID","Bone health (post-ADT)","Active"],["Multivitamin","1 tab PO","Daily","Nutritional support","Active"]]}]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

HTML Output (for rendering)

reader2doc = Reader2Table() \
    .setContentType('text/html') \
    .setContentPath('html_docs/EHR-2025-12-000002.html') \
    .setOutputCol('table') \
    .setOutputFormat('html-table') \
    .setExplodeDocs(True)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[<table class="lab-table"><thead><tr><th>Test</th><th>Result</th><th>Units</th><th>Reference Range</th><th>Status</th></tr></thead><tbody><tr><td>PSA</td><td><strong>0.32</strong></td><td>ng/mL</td><td>0-4.0</td><td><span class="status-badge status-active">Excellent</span></td></tr><tr><td>Testosterone</td><td><strong>125</strong></td><td>ng/dL</td><td>300-1000</td><td><span class="status-badge status-completed">Recovering</span></td></tr><tr><td>Hemoglobin</td><td><strong>14.3</strong></td><td>g/dL</td><td>13.5-17.5</td><td><span class="status-badge status-active">Normal</span></td></tr><tr><td>WBC</td><td><strong>7.2</strong></td><td>K/uL</td><td>4.5-11.0</td><td><span class="status-badge status-active">Normal</span></td></tr><tr><td>Creatinine</td><td><strong>0.9</strong></td><td>mg/dL</td><td>0.7-1.3</td><td><span class="status-badge status-active">Normal</span></td></tr><tr><td>ALT</td><td><strong>22</strong></td><td>U/L</td><td>7-56</td><td><span class="status-badge status-active">Normal</span></td></tr></tbody></table>]|
|[<table class="lab-table"><thead><tr><th>Test</th><th>Result</th><th>Units</th><th>Reference Range</th><th>Status</th></tr></thead><tbody><tr><td>Testosterone</td><td><strong>105</strong></td><td>ng/dL</td><td>300-1000</td><td><span class="status-badge status-completed">Recovering</span></td></tr><tr><td>Hemoglobin</td><td><strong>12.3</strong></td><td>g/dL</td><td>13.5-17.5</td><td><span class="status-badge status-active">Normal</span></td></tr><tr><td>Creatinine</td><td><strong>0.7</strong></td><td>mg/dL</td><td>0.7-1.3</td><td><span class="status-badge status-active">Normal</span></td></tr></tbody></table>]                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|[<table class="lab-table"><thead><tr><th>Medication</th><th>Dose</th><th>Frequency</th><th>Indication</th><th>Status</th></tr></thead><tbody><tr><td>Atorvastatin (Lipitor)</td><td>10 mg PO</td><td>Daily</td><td>Hyperlipidemia</td><td><span class="status-badge status-active">Active</span></td></tr><tr><td>Aspirin</td><td>81 mg PO</td><td>Daily</td><td>Cardiovascular prophylaxis</td><td><span class="status-badge status-active">Active</span></td></tr><tr><td>Vitamin D3</td><td>2000 IU PO</td><td>Daily</td><td>Bone health</td><td><span class="status-badge status-active">Active</span></td></tr><tr><td>Calcium carbonate</td><td>500 mg PO</td><td>BID</td><td>Bone health (post-ADT)</td><td><span class="status-badge status-active">Active</span></td></tr><tr><td>Multivitamin</td><td>1 tab PO</td><td>Daily</td><td>Nutritional support</td><td><span class="status-badge status-active">Active</span></td></tr></tbody></table>]                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Metadata Output

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[orderTableIndex -> 1, nearestHeader -> 🔬 Most Recent Laboratory Results (10/22/2016), pageNumber -> 1, domPath -> /html[1]/body[1]/div[1]/div[3]/div[4]/table[1], elementType -> Table, sentence -> 8}]|
|[orderTableIndex -> 2, nearestHeader -> History Laboratory Results (10/22/2016), pageNumber -> 1, domPath -> /html[1]/body[1]/div[1]/div[3]/div[4]/table[2], elementType -> Table, sentence -> 10}]      |
|[orderTableIndex -> 1, nearestHeader -> đź’Š Current Medications, pageNumber -> 1, domPath -> /html[1]/body[1]/div[1]/div[3]/div[5]/table[1], elementType -> Table, sentence -> 12}]                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Benefits

This output captures rich structural information alongside extracted content:

  • DOM paths (e.g., /html[1]/body[1]/div[3]/div[5]/table[1]) that identify exactly where in the HTML document a table or image came from.
  • Nearest section header context so that a table is semantically linked to its surrounding narrative (“Laboratory Results”, “Current Medications”, etc.).
  • Order and hierarchy metadata such as orderTableIndex, allowing precise reconstruction of document structure.
  • A structured JSON representation of tables (with headers, rows, captions, and field metadata)
  • A HTML representation for visualization, rendering, or further processing.

Use Cases: These enriched representations help downstream NLP tasks such as:

  • Table-aware question answering: Models like TAPAS leverage structured table data to answer natural language questions over tables with high accuracy, something that plain text extraction alone cannot support.
  • Contextual table interpretation: Structural metadata enables models to understand why a table occurs where it does, improving joint inference between narrative text and tabular data, which is known to boost extraction quality when the context is considered.
  • Semantic integration with knowledge graphs and IE systems: By preserving layout and section cues, extracted table data can be merged into structured knowledge representations with clear provenance.

Processing Millions of Documents Efficiently

Problem

Modern organizations must process massive volumes of unstructured documents, including PDFs, HTML pages, contracts, medical records, and regulatory filings. These documents often number in the millions and arrive continuously through ingestion pipelines and compliance workflows.

Why traditional approaches fail

Single-machine or sequential processing does not scale with growing data volumes. Processing times increase as files are handled one by one, pipelines become fragile and fail mid-run, infrastructure costs rise due to inefficient resource usage, and NLP workflows become harder to scale as tokenization, NER, and classification are added.

For data engineering teams supporting real-time compliance, analytics, and document intelligence, this is no longer a minor inefficiency. It becomes a core scalability bottleneck that limits reliability and impact.

Spark NLP Solution

Spark NLP is built natively on Apache Spark, bringing distributed data processing to text analytics and NLP workloads. This enables text extraction, normalization, and NLP tasks to run in parallel across clusters, allowing millions of documents to be processed efficiently, reproducibly, and at scale.

from sparknlp.reader.reader_assembler import ReaderAssembler

# Ingest all files in directory using ReaderAssembler
reader_assembler = ReaderAssembler() \
    .setContentPath(directory) \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader_assembler])
model = pipeline.fit(empty_df)

# Process and save as Parquet
df = model.transform(empty_df)
df.select("document_text.result").write.mode("overwrite").parquet(output)

Benchmark Results

Processing 60 mixed-format documents on single machine:

  • Spark NLP achieved ~2Ă— faster throughput than sequential processing
  • Automatic parallelization across all available CPU cores
  • Same pipeline scales to Spark clusters with linear scalability

Benefits

  • Scalable architecture: Workloads partitioned across Spark executors
  • Fault tolerance: Automatic checkpointing and resilient distributed datasets (RDDs)
  • Unified pipeline integration: Ingestion, extraction, tokenization, and NLP in single Spark job
  • Operational efficiency: Designed for terabytes of daily data

Use Cases: Enterprise document ingestion, batch processing, compliance workflows, large-scale ETL

Document Format Support

Spark NLP provides comprehensive support for common document formats through its Reader2X components.

Supported Formats

Format Spark NLP Components Description
PDF Reader2Doc, Reader2Table, Reader2Image Extract text and images, handles complex layouts
HTML Reader2Doc, Reader2Table, Reader2Image Parse structure, extract tables, preserve DOM context
DOCX Reader2Doc, Reader2Table, Reader2Image Support text, tables, and images from Word documents
PPTX Reader2Doc, Reader2Table, Reader2Image Extract slide content, notes, tables, and images
XLSX Reader2Doc, Reader2Table, Reader2Image Parse spreadsheets, extract structured data
CSV Reader2Doc, Reader2Table Read tabular data with proper schema
Email (MSG, EML) Reader2Doc, Reader2Table, Reader2Image Parse headers, body, and attachments
XML Reader2Doc, Reader2Table, Reader2Image Preserve structure, control tag handling
Markdown Reader2Doc, Reader2Table, Reader2Image Parse text and embedded images
Plain Text Reader2Doc Simple text ingestion

Data Preparation and Cleaning

Spark NLP’s DocumentNormalizer provides powerful text cleaning and normalization capabilities that scale to large datasets.

Encoding Conversion

normalizer = DocumentNormalizer() \
    .setInputCols(['document']) \
    .setOutputCol('normalized') \
    .setEncoding("UTF-8")
  • Converts byte strings into text strings
  • Fully Spark-native for distributed processing
  • Integrates into NLP pipelines via DocumentNormalizer

Remove Non-ASCII Characters

normalizer = DocumentNormalizer() \
    .setInputCols(['document']) \
    .setOutputCol('normalized') \
    .setPresetPattern('CLEAN_NON_ASCII')
    # OR
    # .setAutoMode('HTML_CLEAN')

Clean Bullets and Dashes

normalizer = DocumentNormalizer() \
    .setInputCols(['document']) \
    .setOutputCol('normalized') \
    .setPresetPattern('CLEAN_BULLETS') \
    # OR
    # .setAutoMode('DOCUMENT_CLEAN')

Clean Ordered Bullets

normalizer = DocumentNormalizer() \
    .setInputCols(['document']) \
    .setOutputCol('normalized') \
    .setPresetPattern('CLEAN_ORDERED_BULLETS') \
    # OR
    # .setAutoMode('DOCUMENT_CLEAN')
  • Explicit support for ordered list bullets (1., 2., a., b., etc.)
  • Implemented via preset patterns
  • Can be composed with other document cleaners
  • Clear semantics for removing enumerated list markers

Remove Punctuation

normalizer = DocumentNormalizer() \
    .setInputCols(['document']) \
    .setOutputCol('normalized') \
    .setPresetPattern('REMOVE_PUNCTUATION') \
    # OR
    # .setAutoMode('SOCIAL_CLEAN')

Custom Pattern Cleaning

# Remove specific prefix patterns
normalizer = DocumentNormalizer() \
    .setPatterns(Array("(?i)^(SUMMARY|DESCRIPTION):")) \
    .setAction("clean") \
    .setReplacement(" ") \
    .setPolicy("pretty_all")

# Remove postfix patterns
normalizer = DocumentNormalizer() \
    .setPatterns(Array("(?i)(END|STOP)$")) \
    .setAction("clean") \
    .setReplacement(" ") \
    .setPolicy("pretty_all")

Text Translation

from sparknlp.annotator import MarianTransformer

# German to English translation
translator = MarianTransformer.pretrained("opus_mt_de_en", "xx") \
    .setInputCols(["sentence"]) \
    .setOutputCol("translation")

# French to English translation
translator = MarianTransformer.pretrained("opus_mt_fr_en", "xx") \
    .setInputCols(["sentence"]) \
    .setOutputCol("translation")

# Spanish to English translation
translator = MarianTransformer.pretrained("opus_mt_es_en", "xx") \
    .setInputCols(["sentence"]) \
    .setOutputCol("translation")
  • Uses neural machine translation (MarianTransformer)
  • Supports many language pairs (200+ models available)
  • Production-grade and scalable across Spark clusters
  • Higher translation quality than rule-based approaches
  • GPU acceleration recommended for large-scale processing

Auto Modes

DocumentNormalizer provides preset cleaning modes for common scenarios:

Auto Mode Purpose Includes
HTML_CLEAN Clean HTML content Remove HTML tags, clean non-ASCII, normalize Unicode
DOCUMENT_CLEAN General document cleaning Clean bullets, dashes, trailing punctuation
SOCIAL_CLEAN Social media text Remove punctuation, normalize social media patterns
LIGHT_CLEAN Minimal cleaning Clean trailing punctuation only

Entity Extraction

Spark NLP provides token-aware entity extraction that scales to large datasets.

Date Extraction

from sparknlp.annotator import DateMatcher

date_matcher = DateMatcher() \
    .setInputCols(['document', 'token']) \
    .setOutputCol('date') \
    .setOutputFormat("yyyy-MM-dd HH:mm:ss")
  • Handles relaxed and relative dates
  • Normalized output format
  • Semantic date parsing

Email and Contact Extraction

from sparknlp.annotator import EntityRulerModel

# Extract email addresses
entity_ruler = EntityRulerModel \
    .pretrained() \
    .setAutoMode("EMAIL_ENTITIES")

# Extract phone numbers
entity_ruler = EntityRulerModel \
    .pretrained() \
    .setAutoMode("CONTACT_ENTITIES")

# Extract IP addresses
entity_ruler = EntityRulerModel \
    .pretrained() \
    .setAutoMode("NETWORK_ENTITIES")

# Extract hostnames and IP address labels
entity_ruler = EntityRulerModel \
    .pretrained() \
    .setAutoMode("NETWORK_ENTITIES") \
    .setRegexEntities(Array(
        "IP_ADDRESS_PATTERN",
        "HOSTNAME_PATTERN"
    ))
  • Token-based extraction with offsets
  • Can be combined with other communication entities
  • Production-ready for network entity extraction
  • Supports both IP addresses and associated hostnames/labels

Custom Entity Patterns

entity_ruler = EntityRulerModel \
    .pretrained() \
    .setRegexEntities(Array(
        "EMAIL_ADDRESS_PATTERN",
        "US_PHONE_NUMBERS_PATTERN",
        "MAPI_ID_PATTERN"
    ))
  • Token-aware extraction with offsets and metadata
  • Integrates with other entity extraction pipelines
  • Scales efficiently to large corpora
  • Provides rich annotation metadata

Text Chunking

Spark NLP provides flexible chunking strategies for preparing text for downstream processing.

Character-Based Chunking

from sparknlp.annotator import DocumentCharacterTextSplitter

splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("chunks") \
    .setChunkSize(1000) \
    .setChunkOverlap(100) \
    .setExplodeSplits(True)

Token-Based Chunking

from sparknlp.annotator import DocumentTokenSplitter

splitter = DocumentTokenSplitter() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("chunks") \
    .setNumTokens(512) \
    .setTokenOverlap(50) \
    .setExplodeSplits(True)

Features

  • Configurable split patterns with regex support
  • Control over overlap between chunks
  • Can preserve or remove separators
  • Explode chunks to rows for parallelism
  • Deterministic behavior for reproducibility

Use Cases: LLM context preparation, semantic search indexing, document summarization

Comparison: Spark NLP vs Other Frameworks

Architecture Philosophy

Spark NLP:

  • Built natively on Apache Spark for distributed processing
  • Explicit reader separation (Doc / Table / Image)
  • Strong typing of outputs
  • Designed for large-scale production pipelines

Other Frameworks:

  • Typically single-node Python libraries
  • Unified API across file types with automatic inference
  • Focus on simplicity and ease of use
  • Better for small to medium datasets

Spark NLP vs Unstructured.io: Practical Trade-offs

Aspect Spark NLP Unstructured.io
Processing Model Distributed (Spark) Single-node
Scalability Linear with cluster size Limited to single machine
Text Coverage Complete extraction Semantic filtering applied
Structural Context Full DOM paths and metadata Limited context preservation
Performance (60 docs) ~2Ă— faster Baseline
API Complexity More configuration Simpler API
Pipeline Integration Native Spark integration Requires external orchestration
Use Case Enterprise scale, compliance Prototyping, small datasets

Check the full comparison in this blog post: Evaluating Document AI Frameworks: Spark NLP vs Unstructured for Large-Scale Text Processing

When to Use Spark NLP

Choose Spark NLP when you need:

  • Processing millions of documents
  • Complete text extraction without filtering
  • Rich structural and positional metadata
  • Integration with existing Spark/Hadoop infrastructure
  • Distributed processing and fault tolerance
  • Production-grade scalability and reliability
  • Traceable, auditable document processing

When to Consider Alternatives

Consider lighter frameworks when:

  • Processing small datasets (< 1000 documents)
  • Prototyping or exploratory analysis
  • No existing Spark infrastructure
  • Semantic content extraction preferred over completeness
  • Simple API more important than configurability
Last updated