May 2005 - Posts

Regular Expressions in SQL Server

Can you use regular expressions in a WHERE clause?

Ken Henderson shows us how, in this recent blog post on the topic.

Given that this is a question that comes up very commonly in forums, I thought I should put a pointer here for anyone who missed the post and requires this functionality.

Along with this pointer, I'd like to gently remind readers of what Ken left out of the article: For larger tables, this technique will perform very poorly! There are two reasons for this. One, an index on the column being searched cannot be used to help satisfy a query if that column is being searched using a function. This is due to the fact that the column itself is indexed, not the output of the function as it applies to the column. SQL Server has no way to know how the function will return before actually calling it. The second problem is that there is overhead associated with invoking a COM object. Although this overhead is not huge when dealing with only a few rows, it will add up when querying a large table -- especially because every row in the table may need to be queried due to the fact that an index can't be used!

So while you should certianly add this to your emergency box of tricks, make sure to use it with caution and test carefully.

... And coming soon to a blog reader near you, a SQLCLR version of this technique -- I think this will be a good opportunity to performance test COM automation vs. CLR functions. Stay tuned...

Texas Hold 'em Hint #1

The other day I annouced the Texas Hold 'em SQL Challenge. I haven't gotten any feedback on it yet, so I have no idea if anyone is working on it, but I thought I'd get the ball rolling and come up with my own solution...

The first step I've decided to take in solving this problem is to figure out how, given two or more 5-card hands, to figure out a winner. What's necessary is a way of weighting each hand so that better hands will have a higher value. Or at least, that's the tactic I took.

I first checked out this list of poker hands from strongest to weakest. After staring at it for a while, I realized that some patterns can be derived from the list:

  • A two-pair is two pairs (wow, getting deep here, aren't we!)
  • A full house is a pair plus a three of a kind
  • A straight flush is a straight plus a flush
  • A royal flush is a type of straight flush with the highest cards in the deck

Obvious, yes; but the final point is important -- is there a way of weighting the cards such that we can determine, using only a single number, the weight of a given hand, independent of whether or not it falls into one of these categories? That is, how can we know that a straight flush with a high king is worth less than a straight flush with a high ace? And how can we know that a hand with a pair of 10s and a high jack beats a hand with a pair of 10s and a high 9?

The answer is to use our old friend, binary:

CardValue
22
34
48
516
632
764
8128
9256
10512
Jack1024
Queen2048
King4096
Ace8192 (1)

Note that the ace will be weighted as 1, should it happen to be part of a straight with a high 5.

So what does this binary weighting give us? The property that's especially useful in this case is that for every power of 2, it's impossible to reach that number by adding up any unique combination of powers of 2 less than that power. Restated in terms of this project, if a hand has only a high card, say a king, then that hand will be weighted higher than any other high card hand that doesn't have either a king or an ace.

But weighting alone isn't enough. This can be shown using the following two hands:

2 3 Q K A
2 3 Q K K

The weighted values of these hands are 14342 and 10246, respectively (2 + 4 + 2048 + 4096 + 8192 and 2 + 4 + 2048 + 4096 + 4096). However, the second is a better hand. How to solve this issue? Bonuses.

After conducting a completely non-scientific trial-and-error analysis, I arrived at the following bonus structure, which seems to work:

To weight a hand, take the base weight (i.e., the sum of the weights of each card in the hand) and add the following bonuses per condition:

  • Pair: 2000000000 + (8192 * pair card weight)
  • Three of a kind: 6000000000 + (131072 * three of a kind card weight)
  • Straight: 5500000000
  • Flush: 5800000000
  • Four of a kind: 9500000000 + (131072 * four of a kind card weight)

Note that "pair card weight," for example, refers to the weight of the card type that makes up the pair. So for a pair of twos, the calculation would be (8192 * 2).

This formula can be expanded for each type of hand. Two pair? Add 2000000000 + (8192 * the weight of the card for the first pair) + (8192 * the weight of the card for the second pair). Full house? Combine the pair with the three of a kind. Same with straight flush. And since the base weights are being added, the hands will naturally be weighted heavier for higher cards.

There may have been a more elegant solution, but this one does the trick for now.

But enough math, on to the SQL! What I wanted was a table containing every possible hand and its weights. But first I needed a table with all of the possible cards:

CREATE TABLE dbo.Cards
(
	CardID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
	Suit CHAR(1) NOT NULL,
	Card VARCHAR(2) NOT NULL,
	Weight INT NOT NULL
)

INSERT dbo.Cards (Suit, Card, Weight)
SELECT S.Suit, C.Card, C.Weight
FROM
(
	SELECT 'H'
	UNION ALL
	SELECT 'D'
	UNION ALL
	SELECT 'S'
	UNION ALL
	SELECT 'C'
) S (Suit)
CROSS JOIN
(
	SELECT CONVERT(VARCHAR(2), number), POWER(2, number-1)
	FROM master..spt_values
	WHERE type='p'
		AND number BETWEEN 2 AND 10
	UNION ALL
	SELECT 'J', 1024
	UNION ALL
	SELECT 'Q', 2048
	UNION ALL
	SELECT 'K', 4096
	UNION ALL
	SELECT 'A', 8192
) C (Card, Weight)
ORDER BY 
	Suit, 
	Weight
GO

Yes, I know that (Suit, Card) could have formed a natural primary key, but the table of hands is going ta have 2598960 rows -- which you can prove using the formulas from this website:

52 cards, 5 cards per hand:

            n!
n_C_k = ----------
        k!(n - k)!

      52!
= -----------
  5!(52 - 5)!

      52!
= -----------
    5!(47)!

  52 * 51 * 50 * 49 * 48
= ----------------------
     5 * 4 * 3 * 2 * 1

   311875200
= -----------
      120

= 2598960

...And to believe that my high school algebra teacher had the gall to flunk me!

Anyway, with 2.59 million rows, I'd rather just use a single-column surrogate. It makes like a lot easier. So how do we generate all of these rows?

First, we need every possible hand:

SELECT
	C1.CardId AS Card1,
	C2.CardId AS Card2,
	C3.CardId AS Card3,
	C4.CardId AS Card4,
	C5.CardId AS Card5
FROM dbo.Cards C1
JOIN dbo.Cards C2 ON 
	C2.CardId > C1.CardId
JOIN dbo.Cards C3 ON 
	C3.CardId > C2.CardId
JOIN dbo.Cards C4 ON
	C4.CardId > C3.CardID
JOIN dbo.Cards C5 ON
	C5.CardId > C4.CardId

Easy enough. And notice how nice and simple that surrogate makes life... Laziness is a virtue!

Next step, apply the weighting logic. This is quite a bit more complex, and I'm not going to spend a lot of time going into detail since it's currently quite late here and I need to work in the morning. But remember the following rules: If you have two or more of the same index in a hand (pair, three of a kind, four of a kind), that hand can't be a flush. And if you have only two of a given card, that hand can't have a three of a kind of that card. And if you have only three of a card, that hand can't have a four of a kind for that card!

Given these rules, I realized that you can test for all of these conditions and add up the results, without doing any kind of backtracking. This made the job quite a bit simpler. I did the following: First, calculate the base weight and figure out whether we have a straight, for any given row. Add up the results. Then "unpivot" (using a cross-join to five rows of numbers) and use aggregates to figure out what pairs, threes of a kind, fours of a kind, and flushes exist in each hand... Re-pivot, and stuff everything into a table. Oh, and I implemented some rudimentary batching so that 2.5 million rows wouldn't be inserted in a single transaction.

The complete SQL is here:

CREATE TABLE dbo.PokerHands
(
	Card1 INT NOT NULL,
	Card2 INT NOT NULL,
	Card3 INT NOT NULL,
	Card4 INT NOT NULL,
	Card5 INT NOT NULL,
	Weight BIGINT NOT NULL,
	CONSTRAINT PK_PokerHands PRIMARY KEY CLUSTERED
		(Card1, Card2, Card3, Card4, Card5)
)
GO

DECLARE @CardId INT
SET @CardId = 1

WHILE @CardId <= 52
BEGIN
	INSERT dbo.PokerHands
		(Card1, Card2, Card3, Card4, Card5, Weight)
	SELECT 
		y.Card1,
		y.Card2,
		y.Card3,
		y.Card4,
		y.Card5,
		-- 2
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 2 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 2)
			WHEN 3 THEN 6000000000 + (131072 * 2)
			WHEN 2 THEN 2000000000 + (8192 * 2)
			ELSE 0
		END +
		-- 3
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 4 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 4)
			WHEN 3 THEN 6000000000 + (131072 * 4)
			WHEN 2 THEN 2000000000 + (8192 * 4)
			ELSE 0
		END +
		-- 4
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 8 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 8)
			WHEN 3 THEN 6000000000 + (131072 * 8)
			WHEN 2 THEN 2000000000 + (8192 * 8)
			ELSE 0
		END +
		-- 5
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 16 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 16)
			WHEN 3 THEN 6000000000 + (131072 * 16)
			WHEN 2 THEN 2000000000 + (8192 * 16)
			ELSE 0
		END +
		-- 6
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 32 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 32)
			WHEN 3 THEN 6000000000 + (131072 * 32)
			WHEN 2 THEN 2000000000 + (8192 * 32)
			ELSE 0
		END +
		-- 7
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 64 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 64)
			WHEN 3 THEN 6000000000 + (131072 * 64)
			WHEN 2 THEN 2000000000 + (8192 * 64)
			ELSE 0
		END +
		-- 8
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 2 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 128)
			WHEN 3 THEN 6000000000 + (131072 * 128)
			WHEN 2 THEN 2000000000 + (8192 * 128)
			ELSE 0
		END +
		-- 9
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 256 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 256)
			WHEN 3 THEN 6000000000 + (131072 * 256)
			WHEN 2 THEN 2000000000 + (8192 * 256)
			ELSE 0
		END +
		-- 10
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 512 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 512)
			WHEN 3 THEN 6000000000 + (131072 * 512)
			WHEN 2 THEN 2000000000 + (8192 * 512)
			ELSE 0
		END +
		-- J
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 1024 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 1024)
			WHEN 3 THEN 6000000000 + (131072 * 1024)
			WHEN 2 THEN 2000000000 + (8192 * 1024)
			ELSE 0
		END +
		-- Q
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 2048 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 2048)
			WHEN 3 THEN 6000000000 + (131072 * 2048)
			WHEN 2 THEN 2000000000 + (8192 * 2048)
			ELSE 0
		END +
		-- K
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 4096 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 4096)
			WHEN 3 THEN 6000000000 + (131072 * 4096)
			WHEN 2 THEN 2000000000 + (8192 * 4096)
			ELSE 0
		END +
		-- A
		CASE 
			SUM(
				CASE y.CardWeight
					WHEN 8192 THEN 1
					ELSE 0
				END
			)
			WHEN 4 THEN 9500000000 + (131072 * 8192)
			WHEN 3 THEN 6000000000 + (131072 * 8192)
			WHEN 2 THEN 2000000000 + (8192 * 8192)
			ELSE 0
		END +
		--Flush
		CASE COUNT(DISTINCT y.CardSuit)
			WHEN 1 THEN 5800000000
			ELSE 0
		END +
		--Base Weight + Straight
		MIN(y.TotalWeight) AS Weight
	FROM
	(
		SELECT TOP 100 PERCENT
			C1.CardId AS Card1,
			C2.CardId AS Card2,
			C3.CardId AS Card3,
			C4.CardId AS Card4,
			C5.CardId AS Card5,
			--Straight
			CASE WHEN 
				(C2.Weight = 2 * C1.Weight
				AND C3.Weight = 2 * C2.Weight
				AND C4.Weight = 2 * C3.Weight
				AND (
					(C5.Weight = 2 * C4.Weight)
					OR 
					(C2.Card = '2' AND C5.Card = 'A')))
				THEN 
					CASE 
						WHEN (C2.Card = '2' AND C5.Card = 'A')
							THEN 5500000000 - 8191 --Make A = weight 1
						ELSE 5500000000
					END
				ELSE 0
			END +
			--Base weight
			C1.Weight + C2.Weight + C3.Weight + C4.Weight + C5.Weight
				AS TotalWeight,
			CASE x.number
				WHEN 1 THEN C1.Weight
				WHEN 2 THEN C2.Weight
				WHEN 3 THEN C3.Weight
				WHEN 4 THEN C4.Weight
				WHEN 5 THEN C5.Weight
			END AS CardWeight,
			CASE x.number
				WHEN 1 THEN C1.Suit
				WHEN 2 THEN C2.Suit
				WHEN 3 THEN C3.Suit
				WHEN 4 THEN C4.Suit
				WHEN 5 THEN C5.Suit
			END AS CardSuit
		FROM dbo.Cards C1
		JOIN dbo.Cards C2 ON 
			C2.CardId > C1.CardId
		JOIN dbo.Cards C3 ON 
			C3.CardId > C2.CardId
		JOIN dbo.Cards C4 ON
			C4.CardId > C3.CardID
		JOIN dbo.Cards C5 ON
			C5.CardId > C4.CardId
		CROSS JOIN
		(
			SELECT number
			FROM master..spt_values
			WHERE type='p'
				AND number BETWEEN 1 and 5
		) x
		WHERE C1.CardId = @CardId
		ORDER BY
			C1.CardId,
			C2.CardId,
			C3.CardId,
			C4.CardId,
			C5.CardId
	) y
	GROUP BY
		y.Card1,
		y.Card2,
		y.Card3,
		y.Card4,
		y.Card5

	SET @CardId = @CardId + 1
END

... And that's about it! Assuming I made no logic mistakes -- which of course is a very safe assumption because it's me and I never do that -- this is a relatively good start to the Challenge! So take it from here... Or at least until I post the next piece of the puzzle.

In retrospect, I think this would be a good use for a CLR UDT. If I have some time, I'll post that version in the coming days. Enjoy!

Comments off, again!

ATTENTION CMP: Thank you for acquiring SQL Junkies and syndicating our posts all over your sites. I really like the extra exposure.

But what I really don't like is that we're now even bigger targets for comment SPAM. I recieved over 30 last night. It was not fun cleaning them up.

Please fix this!!! There are plenty of solutions available all over the net, including very simple plug-ins for .Text if you don't want to upgrade to Community Server. But these SPAMs cannot continue unabated. They drastically lower the quality of any content they touch.

New stored proc posted: sp_foreachroutine

Like sp_MSforeachtable and sp_MSforeachdb? Then you'll love sp_foreachroutine!

Okay, that was cheesy. But it got the point across. Same as those undocumented MS-shipped procs, but it operates on routines (procedures/functions/views/triggers) instead.

Get it here!

The Texas Hold 'em SQL Challenge

I receive about 25 comment SPAMs a week advertising Texas Hold 'em.

So I figured, why not join the fun?

Here's the challenge: Come up with a schema that represents players, hands, and bets in a series of Texas Hold 'em games. Come up with a method of randomly populating the tables (i.e. dealing the cards). Bonus points for figuring out how to make the players bet intelligently -- i.e. when they have good hands! Finally, come up with a way to figure out how much cash each player has -- a score at any given point in the games.

This should be done entirely using SQL Server 2000 or SQL Server 2005. XML and/or CLR are fair game, of course.

Send the solutions to me via the Contact link. I'll evaluate the responses and post any complete solutions I get. And I'll find some good award(s) for the best solution(s).

PASS Summit sessions announced

Looks like the schedule for the PASS Summit in September has been announced.

Some good stuff in there... Including a session from the Original Player Hater himself, Joe Celko!

...And I'll be giving my talk on Thursday afternoon... Make sure not to miss that one!

SQL Server 2005 T-SQL: Aggregates and the OVER clause

A new feature added to SQL Server 2005 for the sake of the windowing functions is the OVER clause. Using this clause, you can specify ordering or partitioning for the windowing functions. For instance, to enumerate the names of all of the products in the AdventureWorks database that have a list price, along with their list prices and the rank of those prices compared to all of the other prices, the following query can now be used:

SELECT
	P.Name,
	P.ListPrice,
	DENSE_RANK() OVER (ORDER BY P.ListPrice DESC) AS PriceRank
FROM Production.Product P
WHERE 
	ListPrice > 0
ORDER BY 
	P.Name ASC



Name			List Price	PriceRank
-------------------------------------------------
All-Purpose Bike Stand	159.0000	44
AWC Logo Cap		8.9900		98
Bike Wash - Dissolver	7.9500		99
Cable Lock		25.0000		88
Chain			20.2400		93
Classic Vest, L		63.5000		66
Classic Vest, M		63.5000		66
Classic Vest, S		63.5000		66
Fender Set - Mountain	21.9800		91
Front Brakes		106.5000	55
Front Derailleur	91.4900		59
...

So what does this tell us? All-Purpose Bike Stand is the 44th most expensive item sold by AdventureWorks. AWC Logo Cap is the 98th most expensive item. And the Vests are tied for 66th most expensive. Which is why DENSE_RANK was used for this example! But really, this example is only here to demonstrate one use of the OVER clause. And this post isn't about windowing functions or rankings at all. That's another post for another day.

What this post is about is normal, non-windowing aggregate functions. Like SUM(). It turns out that the OVER clause can be used for them, too!

Pretend that you're an employee of AdventureWorks and your manager comes to you with a request: Write a query to return all of the products, their prices, their subcategories, and the average price for all products in the subcategory that any given product belongs to... Why? Perhaps the manager wants to re-categorize products based on whether they fall, percentage-wise, close to the same average price. Or maybe it just makes a good contrived example for showing this feature! Regardless...

Here's how you can solve this in SQL Server 2000:

SELECT
	P.Name AS ProductName,
	P.ListPrice,
	PS.Name AS ProductSubCategoryName,
	x.AveragePrice
FROM Production.Product P
JOIN Production.ProductSubCategory PS ON P.ProductSubCategoryID = PS.ProductSubCategoryID
JOIN
(
	SELECT 
		P2.ProductSubCategoryID,
		AVG(P2.ListPrice) AS AveragePrice
	FROM Production.Product P2
	WHERE 
		P2.ProductSubCategoryID IS NOT NULL
	GROUP BY 
		P2.ProductSubCategoryID
) x ON x.ProductSubCategoryID = P.ProductSubCategoryId
ORDER BY 
	P.Name

I don't know about you (since I have no clue who you are), but I personally have a difficult time reading this. If I came back to this query in six months, it would take me a few minutes to figue out what was going on. And doesn't it feel like there should be a more efficient way of expressing it?

...Well, now there is...

SELECT
	P.Name AS ProductName,
	P.ListPrice,
	PS.Name AS ProductSubCategoryName,
	AVG(P.ListPrice) OVER (PARTITION BY P.ProductSubCategoryID)
FROM Production.Product P
JOIN Production.ProductSubCategory PS ON P.ProductSubCategoryID = PS.ProductSubCategoryID
ORDER BY 
	P.Name

So what's going on here? Under the covers, SQL Server builds a subquery for the average, based on the partitioning column of the OVER clause -- which in this case is ProductSubCategoryID. It's a little bit less efficient in this case than the derived table approach, but a lot cleaner from a readability standpoint. Personally, I think it's a really cool feature, although I don't honestly see myself using it too often.

More ways to express yourself using SQL Server 2005. Madonna would be proud.