KQL Extract

The extract function in Kusto Query Language (KQL) is used to retrieve specific parts of a string based on a pattern. It allows you to pull out specific information from a text or string column by using regular expressions. This function is very helpful when you want to isolate parts of a string, such as extracting email addresses, IP addresses, or other patterns from a column that contains mixed or formatted data.


Concept of extract

  • The extract function lets you pull specific substrings from a larger string by specifying a regular expression pattern.
  • This is useful for parsing strings in columns that contain complex or structured text data, such as URLs, logs, or JSON-like formats.
  • Regular expressions (regex) allow you to define patterns to match specific text, making extract a powerful tool for text manipulation in KQL.

Syntax

The syntax for the extract function is as follows:

extract("<regex_pattern>", <capture_group>, <column>)
  • <regex_pattern>: The regular expression pattern that specifies what to extract.
  • <capture_group>: The number of the capture group in the regex pattern. Capture groups are portions of the pattern enclosed in parentheses () that specify what part of the pattern to extract.
  • The capture group is numbered starting at 1.
  • <column>: The name of the column containing the text you want to parse.

Example Syntax

extract(@"\b(\d{3})\b", 1, SomeColumn)

In this example:

  • \b(\d{3})\b is a regex pattern that matches any sequence of three digits surrounded by word boundaries.
  • 1 specifies that we want the first capture group, which is the sequence of digits.
  • SomeColumn is the column in which we are searching.

Usage and Examples

Let’s go over some examples to see how the extract function works in practical scenarios.

Example 1: Extracting Area Codes from Phone Numbers

Suppose you have a table Contacts with a column PhoneNumber that stores phone numbers in the format (123) 456-7890. If you want to extract the area code (the first three digits in parentheses), you can use extract as follows:

Contacts
| extend AreaCode = extract(@"\((\d{3})\)", 1, PhoneNumber)
  • Explanation: The pattern \((\d{3})\) matches the area code. Here:
  • \( and \) match the literal parentheses around the area code.
  • (\d{3}) matches a sequence of three digits inside the parentheses.
  • Result: The AreaCode column will contain the extracted area codes, such as 123.

Example 2: Extracting Email Domains

If you have a UserLogs table with an Email column and want to extract the domain from each email (e.g., example.com from user@example.com), you can use the following query:

UserLogs
| extend Domain = extract(@"@([a-zA-Z0-9.-]+)", 1, Email)
  • Explanation: The pattern @([a-zA-Z0-9.-]+) captures everything after the @ symbol.
  • @ matches the @ symbol.
  • ([a-zA-Z0-9.-]+) captures the domain, including letters, numbers, dots, and hyphens.
  • Result: The Domain column will contain the domain part of each email, such as example.com.

Example 3: Extracting IP Address Components

Suppose you have a table NetworkLogs with an IPAddress column, and each IP address is in the format 192.168.1.1. To extract just the first two parts of the IP (e.g., 192.168), you could do:

NetworkLogs
| extend IPPrefix = extract(@"^(\d+\.\d+)", 1, IPAddress)
  • Explanation: The pattern ^(\d+\.\d+) captures the first two parts of the IP.
  • ^ matches the start of the string.
  • (\d+\.\d+) captures the first two sets of digits separated by a period.
  • Result: The IPPrefix column will contain just the first two parts of the IP, such as 192.168.

Example 4: Extracting Timestamps from Log Messages

Imagine you have a Logs table with a Message column that includes timestamps embedded in text, like Started at 2023-10-05 14:45:30. You can extract just the date and time using:

Logs
| extend Timestamp = extract(@"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})", 1, Message)
  • Explanation: The pattern (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) matches the date and time format YYYY-MM-DD HH:MM:SS.
  • \d{4} matches the year (four digits).
  • \d{2}-\d{2} matches the month and day, each with two digits.
  • \d{2}:\d{2}:\d{2} matches the hour, minute, and second.
  • Result: The Timestamp column will contain just the extracted timestamp, such as 2023-10-05 14:45:30.

Example 5: Extracting Status Codes from URLs

If you have a table called WebLogs with a column URL containing URLs like /status/404/error, and you want to extract just the status code (404), you can use the following:

WebLogs
| extend StatusCode = extract(@"/status/(\d+)", 1, URL)
  • Explanation: The pattern /status/(\d+) looks for /status/ followed by one or more digits.
  • /status/ matches the literal string /status/.
  • (\d+) captures the status code.
  • Result: The StatusCode column will contain the status code, such as 404.

  • The extract function in KQL helps you extract specific parts of a string based on a pattern.
  • It’s especially useful for handling data with structured text where you only need specific components, such as dates, IP addresses, email domains, or other patterns.
  • Regular expressions allow you to define flexible patterns to match the exact part of the string you want to extract.

The extract function is a powerful tool for string manipulation, enabling you to parse and retrieve only the information you need from complex text data.

Author: tonyhughes