When working with XML files, which I usually process with Powershell scripts, sometimes I stumble on the following error message:
Select-Xml : The file 'C:\data\lalala.xml' cannot be read: The input document has exceeded a limit set by MaxCharactersInDocument.
The problem here is that Powershell has a hard-coded limit to stop you from reading huge files. This limit is 536870912 bytes (0.5 Gb).
So what can you do if you need to process such a large file? C# to the rescue! You can add inline C# code that uses XmlTextReader, which is very fast.
The specific problem I had to solve is to get all the IDs from several XML files and create SQL out of them. Then I used this SQL to create a temp table in my SQL server, like this:
SELECT t.* INTO #documentids FROM ( SELECT 'e29e16f9-f77e-4dce-92e0-99fef71c868d' AS id UNION ALL SELECT '0cd0fef5-0037-492e-861f-593f7322c2a3' AS id UNION ALL SELECT 'e963a4da-2a60-43f1-b4d4-9b07b9d07d57' AS id UNION ALL -- <all the stuff created by powershell> SELECT 'ca70d7b7-40c8-418d-8612-42fe2fd5bf28' AS id ) t
But of course you can easily customize the code to meet your needs.
I tested with an 1.76 Gb XML on my dev laptop. It took 19.6 seconds. Not bad 🙂
Here’s the code I used:
# # Source: DotJim blog (http://dandraka.com) # Jim Andrakakis, January 2020 # Clear-Host $ErrorActionPreference='Stop' $referencingassemblies = ("System.XML.dll") Add-Type -ReferencedAssemblies $referencingassemblies -Language CSharp @" using System; using System.IO; using System.Xml; using System.Collections.Generic; namespace DotJim.Powershell { public static class Xml { public static List<string> GetAttributes(string fileName, string localName, string template) { var list = new List<string>(); XmlTextReader txtreaderObj = new XmlTextReader(fileName); txtreaderObj.WhitespaceHandling = WhitespaceHandling.None; while (txtreaderObj.Read()) { if (txtreaderObj.LocalName.Equals(localName) && txtreaderObj.IsStartElement()) { txtreaderObj.Read(); //Console.WriteLine(txtreaderObj.Value); string s = txtreaderObj.Value; string line = string.Format(template, s); list.Add(line); } } return list; } } } "@; # Here I'm trying to create a big SELECT sql statement out of all the IDs in the xml $template = "SELECT '{0}' AS id UNION ALL" $fileList = @("C:\data\printdata1.xml", "C:\data\printdata1.xml") $StartDate=(Get-Date) foreach($fileName in $fileList) { $list = [DotJim.Powershell.Xml]::GetAttributes($fileName, "DocumentId", $template) $list[$list.Count-1] = $list[$list.Count-1].Replace("UNION ALL", "") Out-File -FilePath $fileName.Replace(".xml", ".sql") -InputObject $list } $EndDate=(Get-Date) $timeDiff = New-TimeSpan –Start $StartDate –End $EndDate Write-Host "Finished processing $($fileList.Count) files in $($timeDiff.TotalSeconds) seconds"
Have fun coding!