.NET Framework Helps IT Find Data More Easily
With just a few lines of code, you can extract data from text files, including log files, using regular- expression capture groups. If you’ve used regular expressions to search for matching text, extracting text using the .NET Framework will be very intuitive. If you haven’t worked with regular expressions before, or (like me) you need a reference to remember all the symbols, check out the Microsoft Developer Network’s reference for help.
Finding Matching Lines
Imagine that you need to parse a log file (we’ll use C:\Windows\WgaNotify.log as an example, because it’s present on most computers) and list every file that was successfully copied. The WgaNotify.log file resembles the following:
[WgaNotify.log]
0.109: ========================================================
0.109: 2006/04/27 06:54:09.218 (local)
0.109: Failed To Enable SE_SHUTDOWN_PRIVILEGE
1.359: Starting AnalyzeComponents
1.359: AnalyzePhaseZero used 0 ticks
1.359: No c:\windows\INF\updtblk.inf file.
23.328: Copied file: C:\WINDOWS\system32\LegitCheckControl.dll
23.578: Copied file (delayed): C:\WINDOWS\system32\SETE.tmp
25.156: Return Code = 0
25.156: Starting process: C:\WINDOWS\system32\wgatray.exe /b
As you can see, two of the lines (shown in bold) contain useful information, and the rest can be ignored. You could use the following console application, which requires the System.IO and System.Text.RegularExpressions name spaces, to display just the lines that contain the phrase “Copied file”:
' Visual Basic
Dim inFile As StreamReader = File.OpenText("C:\Windows\wganotify.log")
Dim inLine As String
' Read each line of the log file
While (inLine = inFile.ReadLine()) IsNot Nothing
Dim r As New Regex("Copied file")
' Display the line only if it matches the regular expression
If r.IsMatch(inLine) Then
Console.WriteLine(inLine)
End If
End While
inFile.Close()
// C#
StreamReader inFile = File.OpenText(@"C:\Windows\wganotify.log");
string inLine;
// Read each line of the log file
while ((inLine = inFile.ReadLine()) != null)
{
Regex r = new Regex(@"Copied file");
// Display the line only if it matches the regular expression
if (r.IsMatch(inLine))
Console.WriteLine(inLine);
}
inFile.Close();
Running this console application would match the lines that contain information about the files copied and display the following:
23.328: Copied file: C:\WINDOWS\system32\LegitCheckControl.dll
23.578: Copied file (delayed): C:\WINDOWS\system32\SETE.tmp
If all you need to do is display matching lines in a text file, use FindStr. The following command displays the same output as the previous code sample:
FindStr /R "Copied file:" C:\Windows\WgaNotify.log
Capturing Specific Data
To extract portions of matching lines, specify capture groups by surrounding a portion of your regular expression with parentheses. For example, the regular expression "Copied file:\s*(.*$)" would place everything after the phrase “Copied file:”, followed by white space (the “\s” symbol), into a group. Remember, “.*” matches anything, and “$” matches the end of the line.
To match a pattern and capture a portion of the match, follow these steps:
- Create a regular expression, and enclose in parentheses the pattern to be matched. This creates a group.
- Create an instance of the System.Text.RegularExpressions.Match class using the static Regex.Match method.
- Retrieve the matched data by accessing the elements of the Match.Groups array. The first group is added to the first element, the second group is added to the second element, and so on.
The following example expands on the previous code sample to extract and display the filenames from the WgaNotify.log file:
' Visual Basic
Dim inFile As StreamReader = File.OpenText("C:\Windows\wganotify.log")
Dim inLine As String
' Read each line of the log file
While (inLine = inFile.ReadLine()) IsNot Nothing
' Create a regular expression
Dim r As New Regex("Copied file.*:\s+(.*$)")
' Display the group only if it matches the regular expression
If r.IsMatch(inLine) Then
Dim m As Match = r.Match(inLine)
Console.WriteLine(m.Groups(1))
End If
End While
inFile.Close()
// C#
StreamReader inFile = File.OpenText(@"C:\Windows\wganotify.log");
string inLine;
// Read each line of the log file
while ((inLine = inFile.ReadLine()) != null)
{
// Create a regular expression
Regex r = new Regex(@"Copied file.*:\s+(.*$)");
// Display the group only if it matches the regular expression
if (r.IsMatch(inLine))
{
Match m = r.Match(inLine);
Console.WriteLine(m.Groups[1]);
}
}
inFile.Close();
This code does a bit better, displaying just the filenames of the copied files:
C:\WINDOWS\system32\LegitCheckControl.dll
C:\WINDOWS\system32\SETE.tmp
Capturing Multiple Groups
You can also separate the folder and filename by matching multiple groups in a single line. The following slightly updated sample creates separate capture groups for the folder name and the filename, and then displays both values. Notice that the regular expression now contains two groups (indicated by two sets of parentheses), and the call to Console.WriteLine now references the first two elements in the Match.Groups array.
' Visual Basic
Dim inFile As StreamReader = File.OpenText("C:\Windows\wganotify.log")
Dim inLine As String
' Read each line of the log file
While (inLine = inFile.ReadLine()) IsNot Nothing
' Create a regular expression
Dim r As New Regex("Copied file.*:\s+(.*\\)(.*$)")
' Display the line only if it matches the regular expression
If r.IsMatch(inLine) Then
Dim m As Match = r.Match(inLine)
Console.WriteLine("Folder: " + m.Groups(1) + ", File: " + m.Groups(2))
End If
End While
inFile.Close()
// C#
StreamReader inFile = File.OpenText(@"C:\Windows\wganotify.log");
string inLine;
// Read each line of the log file
while ((inLine = inFile.ReadLine()) != null)
{
// Create a regular expression
Regex r = new Regex(@"Copied file.*:\s+(.*\\)(.*$)");
// Display the line only if it matches the regular expression
if (r.IsMatch(inLine))
{
Match m = r.Match(inLine);
Console.WriteLine("Folder: " + m.Groups[1] + ", File: " + m.Groups[2]);
}
}
inFile.Close();
The end result is that the console application captures the folder and filename separately, and outputs the following formatted data:
Folder: C:\WINDOWS\system32\, File: LegitCheckControl.dll
Folder: C:\WINDOWS\system32\, File: SETE.tmp
Using Named Capture Groups
You can make your regular expressions easier to read by naming the capture groups. To name a group, add “?<name>” after the open parenthesis. You can then access the named groups using Match.Groups[“name”]. The following example demonstrates using named groups with the Match.Result method, which allows you to format the results of a regular expression match. It produces exactly the same output as the previous code sample, but the code is easier to read.
' Visual Basic
Dim inFile As StreamReader = File.OpenText("C:\Windows\wganotify.log")
Dim inLine As String
' Read each line of the log file
While (inLine = inFile.ReadLine()) IsNot Nothing
' Create a regular expression
Dim r As New Regex("Copied file.*:\s+(?<folder>.*\\)(?<file>.*$)")
' Display the line only if it matches the regular expression
If r.IsMatch(inLine) Then
Dim m As Match = r.Match(inLine)
Console.WriteLine(m.Result("Folder: ${folder}, File: ${file}"))
End If
End While
inFile.Close()
// C#
StreamReader inFile = File.OpenText(@"C:\Windows\wganotify.log");
string inLine;
// Read each line of the log file
while ((inLine = inFile.ReadLine()) != null)
{
// Create a regular expression
Regex r = new Regex(@"Copied file.*:\s+(?<folder>.*\\)(?<file>.*$)");
// Display the line only if it matches the regular expression
if (r.IsMatch(inLine))
{
Match m = r.Match(inLine);
Console.WriteLine(m.Result("Folder: ${folder}, File: ${file}"));
}
}
inFile.Close();
The .NET Framework supports using capture groups with regular expressions to extract specific data from log files. Using capture groups, you can parse complex text files and isolate just the information you need. First, create a Regex object (part of the System.Text.RegularExpressions namespace) using a regular expression that includes one or more capture groups in parentheses. Then, call the Regex.Match method to compare the regular expression to the input string. Access your capture groups using the Match.Groups array, or format and output the capture groups by calling Match.Result.
PowerShell offers very similar functionality. For more information, read “Regular Expressions in Monad” at http://www.leeholmes.com/blog/RegularExpressionsInMonad.aspx.
Tony Northrup is a developer, security consultant and author with more than 10 years of professional experience developing applications for Microsoft Windows.