Powershell: how to overcome MaxCharactersInDocument limit when reading XML

When working with XML files, which I usually process with Powershell scripts, sometimes I stumble on the following error message:

Select-Xml : The file 'C:\data\lalala.xml' cannot be read: The input document has exceeded a limit set by MaxCharactersInDocument.

The problem here is that Powershell has a hard-coded limit to stop you from reading huge files. This limit is 536870912 bytes (0.5 Gb).

So what can you do if you need to process such a large file? C# to the rescue! You can add inline C# code that uses XmlTextReader, which is very fast.

The specific problem I had to solve is to get all the IDs from several XML files and create SQL out of them. Then I used this SQL to create a temp table in my SQL server, like this:

SELECT t.* INTO #documentids FROM ( 
  SELECT 'e29e16f9-f77e-4dce-92e0-99fef71c868d' AS id UNION ALL
  SELECT '0cd0fef5-0037-492e-861f-593f7322c2a3' AS id UNION ALL
  SELECT 'e963a4da-2a60-43f1-b4d4-9b07b9d07d57' AS id UNION ALL
  -- <all the stuff created by powershell> 
  SELECT 'ca70d7b7-40c8-418d-8612-42fe2fd5bf28' AS id 
) t

But of course you can easily customize the code to meet your needs.

I tested with an 1.76 Gb XML on my dev laptop. It took 19.6 seconds. Not bad 🙂

Here’s the code I used:

#
# Source: DotJim blog (http://dandraka.com)
# Jim Andrakakis, January 2020
#
Clear-Host
$ErrorActionPreference='Stop'

$referencingassemblies = ("System.XML.dll")
Add-Type -ReferencedAssemblies $referencingassemblies -Language CSharp @"
using System;
using System.IO;
using System.Xml;
using System.Collections.Generic;
namespace DotJim.Powershell 
{
    public static class Xml
    {
        public static List<string> GetAttributes(string fileName, string localName, string template)
        {
            var list = new List<string>();
            XmlTextReader txtreaderObj = new XmlTextReader(fileName);
            txtreaderObj.WhitespaceHandling = WhitespaceHandling.None;
            while (txtreaderObj.Read())
            {
                if (txtreaderObj.LocalName.Equals(localName) &&
                txtreaderObj.IsStartElement())
                {
                     txtreaderObj.Read();
                     //Console.WriteLine(txtreaderObj.Value);
                     string s = txtreaderObj.Value;
                     string line = string.Format(template, s);
                     list.Add(line);
                }
            } 
            return list;
        }
    }
}
"@;

# Here I'm trying to create a big SELECT sql statement out of all the IDs in the xml
$template = "SELECT '{0}' AS id UNION ALL"
$fileList = @("C:\data\printdata1.xml", "C:\data\printdata1.xml")

$StartDate=(Get-Date)

foreach($fileName in $fileList)
{
    $list = [DotJim.Powershell.Xml]::GetAttributes($fileName, "DocumentId", $template)
    $list[$list.Count-1] = $list[$list.Count-1].Replace("UNION ALL", "")
    Out-File -FilePath $fileName.Replace(".xml", ".sql") -InputObject $list
}

$EndDate=(Get-Date)

$timeDiff = New-TimeSpan –Start $StartDate –End $EndDate
Write-Host "Finished processing $($fileList.Count) files in $($timeDiff.TotalSeconds) seconds"

Have fun coding!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s