Alexander The Great

February 10, 2009

Human Language vs SQL Computation

Filed under: Science — alexanderthegreatest @ 11:26 pm
Tags: , , , ,

SQL Server offers a number of ways to get at your data. You don’t explicitly choose – it has options, and it picks the most applicable data access method for a query and the structure of the data. I think our state of the art knowledge of cognitive science doesn’t understand the specifics, and I was thinking about how this works as an analogy.

  1. Table scan – An entire table is read from disc, slow by slow, and evaluated against the query request. Ignoring the downsides, this is the most broadly applicable method in the toolkit. Any query could be satisfied with table scans.
  2. Index scan – Generally a range scan. In a customers table, this would apply to a query for all customers whose last name begins with a C. An index is a b-tree, so the server will traverse it, reading only the necessary parts, get the locations for the physical data rows, and read them, and only them, from disc. Obviously this is preferable, but not always possible.
  3. Index seek – Usually a single, particular row is demanded, and it’s the only row pulled off the disc. When this happens against a clustered index, you the best performance possible.

It’ll use a mix; if you ask for two predicates (state == ‘ca’ && last_name == ‘kaku‘) with an and, it will perform the most limiting first, then verify only that set against the most permissive. That’s the most economical solution available.

The English language doesn’t have a word for schadenfreude. Usually when somebody says language X doesn’t have a word for Y, you won’t normally lose money betting against them.  The implication is that people who use that language don’t understand the concept being that word, as if Americans couldn’t comprehend the sister of your girlfriend. But what if I said English doesn’t have a word for the sister of your father? You would say “That sounds like your aunt.” And you’d have the answer pretty quickly – obviously you’re “hitting an index” and only even considering a fraction of the words available that might mean that.

Could you do something like a table scan of/in the mind, if you wanted to? Say you want to list all the words you know, or even the names of all the people you know? I don’t think that’s possible. It’s not because we forget – this stuff comes back instantly, in context.  It just doesn’t seem to be organized to be available out of context.

March 20, 2008

The Prevayler, Klaus Wuestefeld, and Cooked Numbers

Filed under: Programming — alexanderthegreatest @ 12:16 pm
Tags: , , , ,

Actually, two-phase commit is unnecessary. —KlausWuestefeld

A 2 phase commit is a protocol for distributed transactions. A local transaction means work is done in a tentative way, and then committed, or made permanent. This is a single phase commit. When a transaction is run against more than one store or data source, this is insufficient. Consider this scenario

  1. Data must be changed in 3 related databases (each hosted on its own server or “box”) to processes a logical unit of work – say creating a new user account.
  2. A winter storm knocks down a power line, causing a server to loose power and go down.
  3. The 2 other servers are unable to get confirmation that the 3rd has successfully completed its update, so they roll back their part of the transaction, preserving a valid state across the system. (This must be possible even after a local commit.)

Ideally, the distributed transaction would not fail, and no rollback would be necessary. However, in the real world, this inevitability must be accounted for. A non computer analogy would be a space shuttle (or even a sea faring vessel) preparing to launch – if any system isn’t ready, the entire launch is aborted. This is very much necessary.

Consider this quote from the very same article

Actually the TwoPhaseCommit is the algorithm for resolving commit commands for distributed transactions. What prevayler does is a basic recovery mechanism, and it sweeps concurrency control under the carpet by serializing all concurrent acesses to the global object model. Well, it works in situations, but you have to be very careful in analyzing whether it works for your particular problem. If you have a distributed transaction situation, the lack of support for TwoPhaseCommit is can be a very serious deficiency, but again, depending on the particular situation it may not be too big a problem. Because what TwoPhaseCommit does at the infrastructure layer, transparent to the developer, could be done at the application layer with application specific business logic. I guess that’s what Klaus’ article it’s trying to say: that you can deal with the issues at the business logic level. However, depending on the situation that extra business logic you have to add can be extremeley simple or it can be extremely complex, and I’m affraid the examples he presented are not very representative of how and why one would be in a distributed transaction situation and how to handle it.

The wiki article introducing The Prevayler gives these statistics

-3,251 times faster than MySQL.
-9,983 times faster than ORACLE.

This is probably true for some operations.  For example, someone else in the article goes on to explain that Oracle is tuned for a mixture of read and write operations, while MySQL has less optimization for real world work loads.  While Oracle is in general much more performance delivering than MySQL, very specific cases can be set up to show the reverse.

The Prevayler stores its data entirely in memory, serializes all “transactions” (runs them step by step, one after another) so scale is severely limited, and real world data sets simply can not be processed with this Java toy.  Much like Safari, the world’s fastest browser, and one which ignores instructions (like in some cases anything to do with background) in order to achieve that speed – hardly useful!

Still, toy is perhaps too harsh a word, for there are many web applications with trivial needs that can be satisfied by such a system.  It’s annoying to read Klaus defend his work by saying, over and over again, that any feature not implemented by “The Prevayler” is unnecessary in any situation.  As shown with 2 phase commit, this type of arrogance is off putting, and makes a person want to stick with standardized, tested, non buggy RDBMS code.

March 19, 2008

C# and VB.NET Stored Procedures Suck

Filed under: Programming — alexanderthegreatest @ 1:59 pm
Tags: , , , , , , , ,

Somebody found the blog you’re reading by searching Google for evidence that “c# stored procedures suck“. (Probably this post, possibly this one.) This search phrase isn’t exactly true – C# procedures by themselves can be a good thing, under the rarest of circumstances. The pain comes when people use them in all the billions of ways they aren’t suited to. On the other hand, VB sprocs always suck. 🙂

First, you don’t work with them in the same way as normal procedures. At best you can think of them as encrypted procs, and that’s if you don’t have to deploy them. The way to do that is compiling through the command line (CSC.EXE /t:library your_csharp_code_file.cs), then you have to write SQL pointing to a location in the file system! First you run a Create Assembly [SysName] From [File Path To DLL] query and then you create your procedure pointing to the fully qualified name of the method in the code. Don’t forget to make it public accessibility and decorate it with the SqlProcedure attribute. Don’t even think about lieing if a function isn’t deterministic – SQL will believe you!!!

If you could type your C# or VB code into SQL Server Management Studio, hit F5, and have it compiled and deployed to the current database, and then sp_helptext a .NET proc, it wouldn’t hurt maintainability of code. Think of the poor DBA trying to track down the cause of bad performance when code they have no knowledge or view into is involved.

SQL Server has its own memory management, its own threading and scheduling, it’s own locking, and talks to the operating system. It’s a system. You can tap into it with the Common Language Runtime, but the CLR runs in a very different way, and the interop causes sometimes great overhead. The managed heap is there in all cases but God help you if your code uses threads or calls win32 directly.

When .NET Procs Don’t Suck

It seems like the answer would be “when you don’t know T-SQL very well” but that’s how production systems are brought to a limp. The CLR is absolutely not a replacement for T-SQL. It’s an add on for some very specific scenarios.

String Manipulation

As a programmer at a consulting firm I can’t even count how many times I’ve seen transact query language being used to parse out a comma delimited list. This is attrocious. The code will make you vomit, and the server isn’t that much happier, it’s like asking Babe Ruth to throw your dog a ball.

.NET code is a lot more elegant at this kind of task, and it’s optimized to do this sort of thing. Not strings in particular, actually, because they’re immutable to make threading easier. But logic and object manipulation are the domain of coding systems other than structured query language and it’s extensions.

XML

The CSV string should really be XML, whether you use regular SQL or decide to write in Java Sharp #. This way you don’t have to worry about character limits (expecially in unicode!). You don’t have to write a parser.
SQL Server 2000 and 2005 have support for XML, using Microsoft’s XML DOM or MSXML. Purely from a SQL code stand point, there’s an easy way and a cumbersome way to do this. You can grab the values directly with XPath, or you can use OpenXML and maintian a handle to the parsed document. Unless you do the second method, you’re forcing SQL to parse the XML over and over again. But that’s only realistic from within a single batch. The same goes for FOR XML AUTO in SELECT queries – this uses the DOM to create the XML result.
Of course .NET has a different and more granular caching mechanism in general. A developer can parse XML once, store some things for later use, and get to everything in its time. It can make use of the object data before generating XML from it. Any of these things, done correctly, can improve scalability by reducing SQL Server overhead. Especially when it comes to XML reparsing. Let the database server do what it’s good at, but take non RDBMS work somewhere else.

Custom Data Types

I don’t have much to say on this one myself, but Microsoft has an in depth code sample showing how to work with spacial data using SQL CLR sprocs.

In Conclusion

Use SQL CLR procs, be they in C# or VB, very sparingly. But don’t hesitate to use them when they’re the best tool in the kit. Microsoft put more arrows in our quiver, but we should still use them carefully, and have the right one on hand when we need to slay those dragons.

March 6, 2008

An Idiot Misunderstands SQL

Filed under: Modern Life,Programming,Science,Software — alexanderthegreatest @ 3:44 pm
Tags: , , , , ,

Since my Digital Point – Misunderstood post has proven such a hit, it occurs to me to share a hysterical post I found on SQL.NET. This idiot is writing “Death to TSQL” because of horrible SQL coding skills and because the man doesn’t seem to understand that SQL is not ASP. (Yes, you read that correctly – SQL and ASP Ancient.)

The article concludes that TSQL should never be used for stored procedures, that procs should never (“.EVER.”) use temp tables, and that the CLR (Common Language Runtime – aka .NET) should be used for 1980s style programming. Quoth the raven “In this case, it’s so overwhelmingly better than TSQL that I cannot recommend to anyone coding for SQL 2005 to use TSQL for stored procedures, even CRUD procs.

CRUD = Create Read Update Delete. This went out of style by 1990. A client application should not be allowed to perform primitive operations like these for any number of reasons. Performance – we want chunky, not chatty interprocess protocols. Simplicity and maintainability – the client application should not have to understand the full implementation in the database. This should be hidden from and for them. (This is half the reason we use sprocs in the first place!!) Data quality and integrity – if the network connection is severed midway through a run, while half the creates have run and some of the updates, the data is left in an indeterminate state.

To stack the deck, Payton Byrd makes some shocking and unforgivable errors

  1. T-SQL code designed to be as resource intensive and slow as possible.  Not the cursor use.  The test code comes with this warning “This may not be the most efficient way to do this, but I don’t specialize in TSQL stored procs” which should disqualify the results for any thinking person.
  2. One reason the CLR code is assumed to perform faster than the TSQL code is because the .NET procedure uses a StringBuilder.  In a related article, our anti hero tells us if the test data set were larger, the difference would be more pronounced, because of this.  A StringBuilder is a hack, a work around for .NET because the CLR handles strings in a funny way.  When your code creates and then alters a string, both copies live on in memory, because (due to threading) strings are immutable.  Always creating and repointing references is so tiresome for the host computer that a class was written to give programmers an entirely different way of concatenating strings.  SQL does not suffer this limitation.  One of the key assumptions holding up the conclusion (aka guess) giving this article reason to be is fundamentally flawed.
  3. Wild conclusions “The performance results are absolutely stunning, the CLR stored procedure is 14 times faster than the TSQL stored procedure
  4. More profound confusion “I’m willing to do more testing, but I think this pretty well proves the point that you can get much better performance from the CLR than from TSQL when you step beyond the simplest of stored procedures.”   Apparently the meaning of complex was lost on this writer.  Anybody who’s ever done any database work knows that complex != writing loop code for a set based language to put commas between values.

I’m not writing this to make fun of an incompetent developer.  I’m writing to point out flaws of logic that seem all too common these days.  The internet seems to have created a world where everyone is expected (even if only by themselves) to be an expert in some area.  When a supposed expert doesn’t understand something, the rule this day in age seems to be that one should make assumptions and run with them.  As Tripy would say, do you programming, then, when there’s no time left, do your analysis.

An even better way to cause yourself grief is to test an edge case and then assume the general world is described by some crazy scheme.  If I wrote that there’s no reason ever to drive anywhere because I took a train yesterday, arrived at the station a moment before the vessel departed, then found I saved 5 minutes by not parking, people would rightly assume I had lost my wits.

The Texas Sharp Shooter Fallacy describes exactly this.  Payton’s test case addresses a tremendously rare (and poorly coded) situation, one that might never be encountered by a senior database programmer.  From this highly specific test he makes sweeping general conclusions.  A test was engineered to provide support for a conclusion that had already been drawn.  The test tells us nothing, except that we should avoid Payton’s blog if we aim to learn.

March 3, 2008

SQL Server Deployment Checklist

Filed under: Science,Software — alexanderthegreatest @ 4:37 pm
Tags: , , , ,

John Hicks (of MSDN Blogs) has posted a very interesting article on SQL Server deployments that raised my eyebrows. It begins with the text “Every once in a while, a process gets too complex for humans to manage consistently.

The article goes into great detail about the steps that should be taken to performance optimize a new node in a SQL Server cluster, or to prepare a new server that’s being migrated to. This is extremely high scale stuff here, probably boring to mom n pop shops that need to store and instantly access less than perhaps 20 million records. (PHP users will find this article endlessly tiresome, because PHP users turn their nose up at any system meant to serve 25 or more concurrent users. Such is the demand of “enterprise level” evil corporate programming, a need of monsterish slave masters to rob the noble developers of the world of their identities and very souls. I’m not sure where this prejudice comes from, but it’s unescapable at any number of web forums.) The paper describes physical tuning only, completely ignoring the logical domain.

Some highlights

  1. Set your disc data allocation units (pages) to 64 KB to match the size of an extent.
  2. Set network packet size to 8 KB to take better advantage of large and few TCP packets. This is known as jumbo framing.
  3. Create a TEMPDB filegroup for each CPU core on the server.
  4. Set MAXDOP to correspond to your disc subsystem hardware. (I don’t understand why – maybe someone can explain this to me in the comment stream. Shouldn’t these settings be independant from one another? Disc bound transactions should allow the threads to block and be preempted by CPU bound transactions, no?)
  5. Use TraceFlags to control lock escalation – DANGER, WILL ROBERTSON!

February 2, 2008

Even ASP Ancient Can Do XML With Access

Filed under: Software — alexanderthegreatest @ 9:42 pm
Tags: , , , , , , ,

Access is a baby RDBMS or Relational Database Management System. It runs SQL, stores its data in tables, provides queries (select with projection, DML, even DDL) and has code integration in the sense that VBA (Visual Basic for Applications – embedded, crappy VB Scripting) functions you write can be used in a SQL query.

In all honesty, SQL Server has allowed this forever with UDFs but only recently with the CLR. This is a unique way that Access has lead SQL Server if you simply must write your functions in Visual Basic.

But I digress. The point is that even in absolutely ancient ASP, and with Access the baby brother to SQL Server, you can generate XML. Here’s how.

This is a legacy tip – SQL Server has better support for XML by far, though, without even needing an external language like ASP. And SQL Server is free – Access is not. This defies the conventional wisdom that SQL Server is more expensive than Access (true in web hosting, and true if you buy enterprise versions of SQL Server).

Blog at WordPress.com.