Free Β· Private Β· Browser-Based

Convert PDF to XML
Instantly & Free

Upload any PDF and get a clean, structured XML file in seconds. Your files never leave your device β€” 100% private, zero server uploads.

No signup needed No file uploads 100% free forever
500K+Files Converted
100%Free Forever
0 KBUploaded to Servers
50+PDF Formats Supported

PDF to XML Converter Tool

Upload your PDF below. All processing happens locally in your browser β€” completely private.

Drag & Drop your PDF here

or click to browse Β· Supports all PDF versions Β· Max 50 MB

filename.pdf
XML Preview

      
πŸ”’ 100% Private: All PDF processing uses PDF.js in your browser. Your file is never sent to any server and no one can access its contents.
Learn

What Is PDF to XML Conversion?

A complete guide to understanding why and how to convert PDF documents to structured XML format.

PDF (Portable Document Format) is a fixed-layout file format designed for consistent document presentation across all devices and operating systems. While excellent for sharing finalized documents, PDFs are notoriously difficult to programmatically extract and manipulate data from.

XML (eXtensible Markup Language) is a flexible, structured text format designed for data storage and exchange. Unlike PDF, XML is hierarchical, human-readable, and parseable by any programming language or database system.

PDF to XML conversion extracts the text, structure, and metadata from a PDF file and represents it as valid XML β€” making the data machine-readable, searchable, and importable into virtually any application or workflow.

Benefits of Converting PDF to XML

  • Data Interoperability: XML is a universal exchange format supported by every programming language and enterprise system.
  • Full-Text Search: XML content is easily indexed by search engines and internal search systems, unlike static PDFs.
  • Automated Processing: Extract specific data fields using XPath or XSLT for business workflow automation.
  • Database Import: Directly import structured XML data into SQL, NoSQL, or any database system.
  • API Integration: Use XML output as input for REST APIs, SOAP services, or any web service.
  • Compliance & Archiving: Many regulatory frameworks require structured, machine-readable formats.
  • Format Conversion: Use XML as an intermediate to convert further to JSON, CSV, Excel, or HTML.

Common Use Cases

🏦

Finance & Banking

Extract financial statements, invoices, and bank statements for automated accounting and reconciliation workflows.

βš–οΈ

Legal & Compliance

Convert contracts and regulatory documents to XML for searchable archives and e-discovery systems.

πŸ₯

Healthcare

Transform medical records and lab reports into structured XML for EHR integration (HL7, FHIR).

πŸ“š

Publishing & Education

Convert textbooks and research papers to XML for e-learning platforms and digital publishing.

🏭

Manufacturing

Extract product specs and technical manuals from PDF to XML for ERP and PLM system integration.

πŸ›οΈ

Government

Digitize public records from PDF to structured XML for open data portals and citizen services.

Guide

How to Convert PDF to XML

Our free converter is designed to be simple. Here's exactly how it works in 5 steps:

Upload Your PDF File

Click "Browse File" or drag and drop your PDF into the upload area. We support all standard PDF versions. Your file is processed entirely on your device and never uploaded to any server.

Click "Convert to XML"

Press Convert. Our tool uses PDF.js β€” the engine Mozilla Firefox uses β€” to parse your PDF, extract text, detect headings and paragraphs, and build a well-structured XML document with full metadata.

Preview the XML Output

The generated XML appears in a live preview on the page. Scroll through it to verify structure and content before downloading.

Download or Copy Your XML

Click "Download XML" to save as a .xml file, or "Copy XML" to copy the entire output to your clipboard for immediate use in your editor or application.

Optional: Convert XML to JSON

Need your data in JSON format instead? Scroll down to our XML to JSON Converter and paste your XML there for clean, formatted JSON output instantly.

Features

Why Choose Our Converter?

Built for developers, analysts, and business users who need reliable PDF to XML conversion without the complexity.

100% Private & Secure

All processing happens in your browser via JavaScript. Your files are never sent anywhere. Zero data retention, guaranteed.

Lightning Fast

Client-side processing means no upload wait times. Large multi-page PDFs convert in seconds without any network latency.

Structured XML Output

Smart paragraph and heading detection produces clean, hierarchical XML with metadata β€” not just a wall of raw text.

Multi-Page Support

Handles PDFs of any length. Each page is a separate XML element with page number, dimensions, and extracted content.

No Installation

Works directly in Chrome, Firefox, Safari, and Edge. No plugins, no extensions, no account, no installation required.

Completely Free

No hidden fees, no premium tiers, no watermarks on output. Convert unlimited PDFs for free β€” forever.

Bonus Tool

XML to JSON Converter

Need your XML in JSON format? Paste any valid XML and convert it to clean, formatted JSON instantly β€” entirely in your browser.

JSON Output
Your JSON will appear here...

Ready to Convert Your PDF?

Join hundreds of thousands of users who trust our free PDF to XML converter. No signup, no payment, no limits.

FAQ

Frequently Asked Questions

Everything you need to know about our free PDF to XML converter and XML to JSON tool.

Yes, completely free. No hidden fees, no subscription tiers, no usage limits. Convert as many PDF files to XML as you need, forever, at no cost. Our tool is sustained by responsible, non-intrusive advertising.

Your data is 100% safe. All PDF parsing and XML conversion happens locally in your browser using PDF.js. Your file is never sent to any server β€” no one can see or access the contents of your documents. This makes our tool ideal for sensitive or confidential files.

Our converter supports all standard text-based PDF files β€” reports, invoices, contracts, academic papers, and more. Note that scanned image-only PDFs may not yield full text results, as they require OCR processing. PDFs created by Word, Excel, PowerPoint, or any standard PDF generator work best.

The output is a structured document with a root <document> element containing <metadata> (filename, page count, conversion date) and <pages>. Each page is a <page> element with a number attribute, containing <heading> and <paragraph> elements extracted from the PDF content. You can preview the output before downloading.

Yes! We include a free XML to JSON converter on this same page. Just paste your XML into the input field, click "Convert to JSON", and get clean formatted JSON output instantly. This is useful for converting the XML output of our PDF converter into JSON for use with REST APIs and modern JavaScript applications.

Yes. Our website and tools are fully responsive and work on smartphones and tablets as well as desktop computers. You can upload a PDF from your phone's storage and download the XML directly to your device.

Since processing is done in your browser, the practical limit depends on your device's memory. PDFs up to 50 MB and several hundred pages convert without issues on modern devices. Very large PDFs may take longer on devices with limited RAM.

No installation required. Our converter works directly in your web browser β€” Chrome, Firefox, Safari, or Edge. No plugins, no extensions, no account, no download needed. Simply open the page and start converting.

Our Story

We Build Tools That Respect Your Privacy

PDFtoXML.tools was founded on a simple belief: powerful data conversion tools should be free, private, and accessible to everyone β€” not hidden behind paywalls or data-harvesting sign-up forms.

We are a small team of developers and data engineers who were frustrated with existing PDF conversion tools that either cost money, uploaded your sensitive files to third-party servers, or produced messy, unusable output. So we built our own.

Today, PDFtoXML.tools processes over 500,000 conversions, and every single one of them happens entirely in the user's browser β€” no server ever sees your documents.

"Data extraction shouldn't require giving up your own data."

β€” The PDFtoXML.tools Team


2022Founded
3Core Team Members
500K+Conversions Served
0 KBYour Data on Our Servers
Values

What We Stand For

Four principles guide every decision we make about our tools and our product.

πŸ”’

Privacy First

Our tools run entirely in the browser. Your files are never sent to our servers β€” because we don't have servers that receive your files. What you process stays on your device, period.

πŸ†“

Free for Everyone

We believe quality tools shouldn't be gated behind a credit card. Our core tools are and will remain free. We sustain the project through responsible, non-intrusive advertising.

⚑

Performance Matters

We're obsessed with speed. Client-side processing means zero upload wait times. We use best-in-class libraries like PDF.js β€” the same engine powering Mozilla Firefox.

🌍

Open & Transparent

We're honest about what our tools can and cannot do. Our privacy policy is written in plain English. We don't track your conversions and we don't sell your data. Ever.

Team

The People Behind the Tool

A small, focused team of engineers and product thinkers who care deeply about doing things right.

A

Alex Rivera

Co-Founder & Lead Engineer

10+ years in data engineering. Previously built PDF parsing systems for Fortune 500 companies before going independent in 2020.

S

Sarah Kim

Co-Founder & Product

Former UX researcher at a major browser company. Passionate about making complex developer tools accessible and beautiful for everyone.

M

Marcus Chen

Frontend Developer

Specialist in browser-based processing and performance optimization. Active contributor to several open-source document parsing projects.

2022Founded
500K+Conversions
3Team Members
0 KBData on Our Servers

Ready to Convert Your First PDF?

Experience privacy-first PDF to XML conversion. No signup. No upload. Just results.

Get in Touch

Contact Us

Have a question, found a bug, or want to suggest a feature? We read every message and aim to respond within 1–2 business days.

Ways to Reach Us

Email Us

hello@pdftoxml.tools

Response Time

1–2 business days

Location

Remote β€” Worldwide

Common topics: Bug reports, feature requests, business inquiries, privacy concerns, general feedback.

Send a Message

βœ… Thanks! Your message has been sent. We'll be in touch soon.

We respect your privacy. See our .

Our Blog

Insights on PDF, XML & Data Formats

Guides, tutorials, and deep dives into PDF conversion, XML processing, data extraction, and developer tools.

Latest Articles

πŸ”„
How-ToJanuary 22, 2025Β·6 min read

XML to JSON: When to Use Each Format and How to Convert Between Them

Both XML and JSON are data exchange formats, but they shine in different contexts. Learn when to choose XML over JSON, when to do the opposite, and how to convert between them instantly.

Read Article
πŸ”’
PrivacyFebruary 3, 2025Β·5 min read

Why Browser-Based File Processing Is the Future of Privacy-First Tools

When you upload a file to a web service, you're trusting them with your data. But what if the tool never received your file at all? We explore the architecture behind truly private, client-side processing.

Read Article
⚑
Deep DiveFebruary 14, 2025Β·10 min read

How PDF.js Works: A Technical Deep Dive into Browser-Based PDF Parsing

PDF.js is the open-source library that powers Firefox's built-in PDF viewer and our converter. In this technical deep dive, we explore how it parses PDF files, extracts text, and handles complex document structures.

Read Article
🏦
Use CasesFebruary 28, 2025Β·7 min read

5 Real-World Use Cases for PDF to XML Conversion in Business

From automating invoice processing to integrating with ERP systems, converting PDFs to structured XML unlocks powerful business automation. Here are five real-world scenarios where it makes a measurable difference.

Read Article
πŸ“Š
TutorialMarch 10, 2025Β·9 min read

Working with XML Data: XPath, XSLT, and How to Query Your Converted Files

Once you have your PDF converted to XML, how do you actually use it? This tutorial covers XPath expressions, XSLT transformations, and practical examples for querying and transforming your converted XML data.

Read Article
πŸ› οΈ
ToolsMarch 22, 2025Β·4 min read

PDF Conversion Tools Compared: Online vs Desktop vs Browser-Based

There are three main categories of PDF conversion tools: online services, desktop software, and browser-based tools. Each has trade-offs around privacy, cost, performance, and convenience. Here's how they stack up.

Read Article
Newsletter

Stay Updated

Get new articles, tool updates, and developer tips delivered to your inbox. No spam, unsubscribe anytime.

The Complete Guide to PDF to XML Conversion: Everything You Need to Know

PDF files are everywhere β€” but they're notoriously hard to work with programmatically. This guide covers everything you need to know about converting PDF to XML.

πŸ“„

Why PDF Files Are Difficult to Work With

The PDF format was designed with one goal in mind: to make documents look identical on every device and operating system. It achieves this brilliantly. But this focus on visual fidelity comes at a significant cost for anyone who wants to extract, manipulate, or programmatically process the data inside a PDF.

Unlike a Word document or an HTML file, a PDF doesn't think in terms of "paragraphs" or "headings" or "sentences." It thinks in terms of individual text characters, positioned precisely at specific coordinates on a page. To a PDF, the heading "Annual Report 2024" is not a heading β€” it's a collection of characters at specific X,Y positions, rendered in a particular font and size.

A PDF knows where to paint each character. It has no idea what the characters mean together.

This is why extracting text from PDFs is notoriously tricky, and why tools that do it well β€” like PDF.js, which powers our converter β€” are genuinely impressive pieces of engineering.

What Is XML and Why Convert to It?

XML (eXtensible Markup Language) is a text-based format for storing and exchanging structured data. If you've ever worked with HTML, XML will look familiar β€” it uses the same angle-bracket tag syntax. The key difference is that XML doesn't have predefined tags. You define the structure yourself.

When we convert a PDF to XML, we're taking the visual, position-based data in the PDF and restructuring it into a logical, hierarchical form that any application can understand and work with. The result looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<document>
  <metadata>
    <filename>report.pdf</filename>
    <totalPages>12</totalPages>
  </metadata>
  <pages>
    <page number="1" width="612" height="792">
      <heading>Q4 Financial Summary</heading>
      <paragraph>Revenue for the quarter reached $4.2M...</paragraph>
    </page>
  </pages>
</document>

How Our Converter Works Under the Hood

Our PDF to XML converter is built on PDF.js, an open-source library originally developed by Mozilla and now used by millions of developers worldwide. It's the same engine that powers Firefox's built-in PDF viewer.

Here's what happens when you click "Convert":

  1. File Reading: The PDF file is read into memory using the browser's File API as an ArrayBuffer.
  2. PDF Parsing: PDF.js parses the binary PDF structure, reading the PDF's internal object graph, page dictionary, and content streams.
  3. Text Extraction: For each page, PDF.js extracts text items β€” each with a string value, position (X, Y coordinates), font information, and transform matrix.
  4. Grouping & Structure Detection: Our code groups text items by their Y position (items on the same horizontal line), then uses proximity thresholds to group lines into paragraphs. Short lines without terminal punctuation are heuristically classified as headings.
  5. XML Serialization: The structured data is serialized into well-formed XML with proper entity escaping for special characters.

Best Practices for Working with Converted XML

  • Always preview before downloading β€” the live preview lets you verify the extraction quality before committing to the output.
  • Use text-based PDFs, not scanned images β€” our tool extracts embedded text. Scanned documents need OCR first.
  • Validate the XML β€” paste the output into an XML validator to confirm it's well-formed before feeding it into other systems.
  • Use XPath for targeted extraction β€” once you have the XML, use XPath expressions to extract specific data fields programmatically.
Share this article:

XML to JSON: When to Use Each Format and How to Convert Between Them

Both XML and JSON are ubiquitous data formats, but they serve different purposes. Here's a practical guide to choosing the right one β€” and converting between them when you need to.

πŸ”„

A Tale of Two Formats

XML was born in 1998, emerging as a simplified descendant of SGML β€” the complex markup language that also gave us HTML. It was designed to be both human-readable and machine-readable, and quickly became the lingua franca of enterprise data exchange, SOAP web services, and configuration files.

JSON arrived in the early 2000s, popularized by Douglas Crockford. It was derived from JavaScript object literal syntax and designed to be even simpler than XML β€” less verbose, faster to parse, and a natural fit for web APIs and JavaScript applications.

Today, both formats are used extensively. Understanding their strengths helps you choose the right tool for each job.

XML: When to Use It

  • Document-centric data β€” XML handles mixed content (text interspersed with markup) elegantly, making it ideal for documents like books, articles, and reports.
  • Enterprise and legacy systems β€” Many ERP, CRM, and healthcare systems (HL7, FHIR) use XML as their native format.
  • SOAP APIs β€” Web services using SOAP protocol exchange XML messages by definition.
  • When you need validation β€” XML Schema (XSD) provides powerful, declarative data validation that JSON Schema only partially replicates.
  • Configuration files β€” Maven, Ant, Spring, Android β€” many frameworks use XML for configuration due to its expressiveness.

JSON: When to Use It

  • REST APIs β€” JSON is the de facto standard for modern REST APIs. It's compact, fast to parse, and native to JavaScript.
  • Frontend applications β€” Browser JavaScript handles JSON natively without any additional parsing libraries.
  • NoSQL databases β€” MongoDB, Firestore, DynamoDB, and most modern databases store and query JSON documents natively.
  • Microservices communication β€” JSON's simplicity and small payload size make it ideal for high-volume inter-service communication.

Side-by-Side Comparison

// Same data in XML:
<users>
  <user id="1">
    <name>Jane Smith</name>
    <email>jane@example.com</email>
    <active>true</active>
  </user>
</users>

// Same data in JSON:
{
  "users": [
    {
      "id": "1",
      "name": "Jane Smith",
      "email": "jane@example.com",
      "active": true
    }
  ]
}

How to Convert XML to JSON Instantly

You don't need a server or a paid tool to convert XML to JSON. Our free XML to JSON converter on the home page handles this entirely in your browser. Simply paste your XML, click convert, and copy the JSON output.

Behind the scenes, it uses the browser's native DOMParser API to parse the XML into a DOM tree, then recursively walks the tree to build a corresponding JavaScript object, handling attributes, text nodes, and repeated elements (converting them to arrays automatically).

Share this article:

Why Browser-Based File Processing Is the Future of Privacy-First Tools

When you upload a file to a web service, you're trusting them with your data. But what if the tool never received your file at all? We explore the architecture behind truly private, client-side processing.

πŸ”’

The Problem with "Free" Online File Tools

Search for any file conversion tool online and you'll find dozens of "free" services. They all follow the same pattern: you upload your file, wait for it to be processed on their server, and download the result. Simple, right?

But think about what's actually happening. Your file β€” which might contain confidential financial data, medical records, proprietary business documents, or personal information β€” is traveling over the internet to a server you know nothing about, being processed by code you can't inspect, stored on hardware you don't control.

Privacy policies that say "we don't store your files" are great. But architecture that makes storing them impossible is better.

How Client-Side Processing Works

Modern browsers are remarkably capable computing environments. Thanks to APIs like the File API, Web Workers, WebAssembly, and libraries like PDF.js, a significant amount of file processing that used to require a server can now happen entirely on your device.

When you use our PDF to XML converter, here's what the network traffic looks like:

  • βœ… Your browser downloads the PDF.js library once (cached for future visits)
  • βœ… You select a PDF file from your local storage
  • βœ… PDF.js parses and processes the PDF entirely in your browser's JavaScript engine
  • βœ… The XML output is generated in memory and offered as a download
  • ❌ Zero bytes of your PDF are ever transmitted to any server

The Technical Stack Making This Possible

Three key browser technologies enable truly private, client-side file processing:

File API & ArrayBuffer: The browser's File API lets JavaScript read files directly from the user's device without any server involvement. Files are loaded as ArrayBuffers β€” raw binary data in memory.

Web Workers: Heavy computation (like parsing a large PDF) can be offloaded to Web Workers β€” background threads that don't block the UI. This is how PDF.js processes large files without freezing the browser.

Blob API & Object URLs: Once processing is complete, the Blob API lets JavaScript create downloadable files in memory. The browser generates a temporary Object URL that the user can click to save the result β€” no server, no round trip.

What This Means for Your Sensitive Documents

If you're converting a PDF containing medical records, financial statements, legal documents, or any other sensitive information, browser-based processing offers a fundamentally different security guarantee than server-based tools:

  • No data breach at the provider can expose your documents (they were never there)
  • No subpoenas or government requests can result in your data being disclosed
  • No employee of the service can access your documents
  • No storage logs or access logs can be leaked
Share this article:

How PDF.js Works: A Technical Deep Dive into Browser-Based PDF Parsing

PDF.js is the open-source library powering Firefox's PDF viewer and our converter. Here's how it parses PDF files, extracts text, and handles complex document structures β€” all in the browser.

⚑

What Is PDF.js?

PDF.js is an open-source JavaScript library developed and maintained by Mozilla. It's the same engine that powers Firefox's built-in PDF viewer and is used by millions of developers worldwide. The library can parse any PDF file and render it in a browser β€” all without plugins, extensions, or server-side processing.

Our PDF to XML converter uses PDF.js not to render PDFs visually, but to extract their text content and structural data, which we then serialize into well-formed XML.

The PDF File Format

To understand how PDF.js works, you first need to understand what a PDF file actually is. At its core, a PDF is a binary file containing a collection of objects: pages, fonts, images, annotations, and content streams. These objects reference each other through an internal cross-reference table (xref table), forming a directed graph structure.

Content streams are the most important part for text extraction. They contain sequences of PDF operators β€” commands that describe how to render content. The text-related operators (Td, Tm, Tj, TJ, etc.) define the position, font, size, and character codes of rendered text.

How PDF.js Parses Text

When PDF.js processes a page for text extraction, it calls page.getTextContent(). This method returns a list of TextItem objects, each containing:

  • str β€” the text string
  • transform β€” a 6-element array representing the 2D transformation matrix [a, b, c, d, e, f], where e and f are the X and Y position
  • width β€” the width of the text item
  • height β€” the font size
  • fontName β€” the name of the font used

Our converter uses the Y coordinate (transform[5]) to group text items into lines, then groups lines with similar spacing into paragraphs, and applies heuristics to detect headings based on length and punctuation patterns.

Limitations of Text Extraction

Even PDF.js, impressive as it is, has limitations. Scanned PDFs don't contain any text objects at all β€” they're just images. Complex multi-column layouts may not extract in reading order. Right-to-left text requires special handling. And PDFs with custom encoding or non-standard font mappings may produce garbled output.

These are fundamental limitations of the PDF format itself, not of PDF.js. True OCR (converting images of text to actual text) requires a different class of tools entirely β€” neural network-based models like Tesseract or cloud OCR services.

5 Real-World Use Cases for PDF to XML Conversion in Business

From automating invoice processing to integrating with ERP systems, converting PDFs to structured XML unlocks powerful business automation. Here are five real-world scenarios where it makes a measurable difference.

🏦

1. Automated Invoice Processing

Accounts payable departments receive thousands of vendor invoices per month β€” many of them as PDFs. Manually keying invoice data into accounting systems is slow, error-prone, and expensive. Converting PDFs to XML creates structured data that can be parsed by Python, JavaScript, or any language to automatically extract invoice numbers, dates, line items, and totals.

2. Legal Document Management

Law firms and compliance teams deal with enormous volumes of PDF contracts, filings, and regulatory documents. Converting these to XML enables full-text search, automated clause extraction, and integration with document management systems like iManage or NetDocuments. E-discovery workflows that once took weeks can be dramatically accelerated.

3. Healthcare: EHR Data Integration

Medical records, lab reports, and clinical documents are frequently distributed as PDFs. Converting them to XML allows integration with Electronic Health Record (EHR) systems. Modern healthcare data standards like HL7 FHIR are XML-based, making PDF-to-XML conversion a natural first step in healthcare data pipelines.

4. Publishing: Content Migration

Publishers migrating legacy content to digital platforms often have thousands of documents in PDF format. Converting to XML provides a structured intermediate format from which content can be transformed to HTML, EPUB, or any other target format using XSLT stylesheets β€” a well-established workflow in the publishing industry.

5. Manufacturing: Technical Documentation

Product specification sheets, bills of materials, and technical manuals are commonly distributed as PDFs. Extracting this data to XML enables integration with Product Lifecycle Management (PLM) and ERP systems, automating data flows that would otherwise require manual data entry by engineers.

Working with XML Data: XPath, XSLT, and How to Query Your Converted Files

Once you have your PDF converted to XML, how do you actually use it? This tutorial covers XPath expressions, XSLT transformations, and practical examples for querying and transforming your data.

πŸ“Š

Introduction to XPath

XPath (XML Path Language) is a query language for selecting nodes from an XML document. Think of it as SQL for XML β€” a powerful, standardized way to navigate and extract data from any XML structure. XPath is supported natively in every major programming language, XML editor, and browser.

// Select all paragraph elements in any page:
//paragraph

// Select headings from page 1 only:
/document/pages/page[@number="1"]/heading

// Get the total number of pages from metadata:
/document/metadata/totalPages/text()

// Find all paragraphs containing a specific word:
//paragraph[contains(text(), 'revenue')]

Introduction to XSLT

XSLT (XSL Transformations) is a language for transforming XML documents into other formats β€” another XML document, HTML, plain text, or any text-based format. An XSLT stylesheet defines rules (called templates) that match elements in the source XML and specify how to transform them.

A simple XSLT to convert our XML output to HTML would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="...">
  <xsl:template match="/">
    <html><body>
      <xsl:apply-templates select="//page"/>
    </body></html>
  </xsl:template>
  <xsl:template match="heading">
    <h2><xsl:value-of select="."/></h2>
  </xsl:template>
  <xsl:template match="paragraph">
    <p><xsl:value-of select="."/></p>
  </xsl:template>
</xsl:stylesheet>

Practical Python Example

Python's lxml library makes XPath queries on XML files easy:

from lxml import etree

tree = etree.parse('converted.xml')

# Get all headings
headings = tree.xpath('//heading/text()')
for h in headings:
    print(h)

# Get all content from page 1
page1 = tree.xpath('/document/pages/page[@number="1"]/*')
for el in page1:
    print(f'{el.tag}: {el.text}')