I experienced an elevated level of frustration today at grep’s inability to find what I was looking for in a plain text file? It was particularly irritating as I have, like all years prior, made some new years resolutions and one of them was to not become irate too quickly.
As things go however, the end result taught me some valuable new things!
I receive a bunch of fairly large XML files that I need to search for particular lines of text. ‘No problem’ I say! Linux command line was built for this and grep is the perfect tool for the job in this case.
The XML files in question are Microsoft SCOM Management Packs. The look something like this:
<?xml version="1.0" encoding="utf-16"?><ManagementPack xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" ContentReadable="true" SchemaVersion="2.0" OriginalSchemaVersion="1.0" RevisionId="xxxxxxxxxxxxx"> <Manifest> <Identity> <ID>Microsoft.Exchange.Server.2003.Discovery</ID> <Version>6.0.6702.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Identity> <Name>Microsoft Exchange 2003 Discovery Management Pack</Name> <References> <Reference Alias="Exchange"> ....snipped...
I was particularly interested in the line with PublicKeyToken.
So it should be simple then:
hendri@techedemic:~/Documents/Work/XMLFiles/ $ grep -inr "Token" *.xml hendri@techedemic:~/Documents/Work/XMLFiles/ $ ## NO OUTPUT ????
I tried everything (that I knew at the time). dos2unix, tofrodos, . I even opened the files in VIM, made changes, and saved them again. Nada! Nothing I could think of worked. Last resort was to cat the file(s) and grep the output. This returned an error:
hendri@techedemic:~/Documents/Work/XMLFiles/ $ cat MP01.xml | grep -iv 'Token' Binary file (standard input) matches
Finally, something I could Google! As it turns out, the file encoding was wrong! The files were encoded in UTF-16LE format and not UTF-8 as is required for grep to function properly.
You can see the file encoding by running”
hendri@techedemic:~/Documents/Work/XMLFiles/ $ file -i *.xml ## which outputs something as follows: MP01.xml: application/xml; charset=utf-16le MP02.xml: application/xml; charset=utf-16le MP03.xml: application/xml; charset=utf-8 MP04.xml: application/xml; charset=utf-8
I was able to fix it by re-encoding the files:
## First, create a directory where I can put the output files. I wanted to keep the originals as-is for use on the applicable servers whence they came hendri@techedemic:~/Documents/Work/XMLFiles/ $ mkdir output ## Next, run a loop with 'iconv' (a file encoding conversion tool). You can run it without the -o option if you just want to output to STDOUT hendri@techedemic:~/Documents/Work/XMLFiles/ $ for i in *.xml; do iconv -f UTF-16LE -t UTF-8 $i -o output/$i; done ## Now you can just 'cd' into the 'output' directory and work with grep as you would expect.
Have a good year everyone!