Alexander The Great

December 8, 2008

Html Agility Pack

Filed under: Programming — alexanderthegreatest @ 10:42 pm
Tags: , , , ,

Has anyone else used this? I’m teetering between it’s given name and Html Agitation Pack. Written by an MSDN guru and moved to CodePlex, this is the XmlDocument of the web.

An aside, people who use open source languages like Pearl and Ruby will be shocked to know how difficult it is for Microsoft developers to use html programatically. We’re able to consume xml very quickly, so long as it’s well formed, but any error in the markup renders the whole document unreadable. Microsoft’s design goal was to never guess at the developer’s intent, so, anything the least bit ambiguous is an exception. Agility Pack is an open source library for parsing html and making the guesses MS was unwilling to make, outside IE.

I’m finding it slow. The software has trouble with certain encodings, and, worse, it throws stack overflow exceptions! This means it makes far too heavy use of recursion. Genereally a loop (sometimes with a stack or a queue) will fix the problem, but it’s very heard to search for, in such a large code base.

Still, this hasn’t stopped others from finding creative uses for the library. The page localizer is a fascinating example. And here’s a converter, allowing LINQ over web pages!

March 27, 2008

HierarchyID in SQL Server 2008

Filed under: Programming — alexanderthegreatest @ 9:53 am
Tags: ,

http://blogs.msdn.com/manisblog/archive/2007/08/17/sql-server-2008-hierarchyid.aspx

Int has always been my hierarchy ID, but hey, new is good.  I guess it remains to be seen how this will perform in the real world.  It seems like there are a lot of new features in SQL 08, from this new data type (and why the hell would they put ID at the end of a data type? they don’t even do that for UniqueIdentifier, which really is an ID wrapped up in a data type!) there’s geo data and more.

We’ll probably wind up upgrading right around the time SQL Server 2011 is released, but I’m still interested in this new version and the direction SQL and Microsoft are going in general.

March 20, 2008

The Prevayler, Klaus Wuestefeld, and Cooked Numbers

Filed under: Programming — alexanderthegreatest @ 12:16 pm
Tags: , , , ,

Actually, two-phase commit is unnecessary. —KlausWuestefeld

A 2 phase commit is a protocol for distributed transactions. A local transaction means work is done in a tentative way, and then committed, or made permanent. This is a single phase commit. When a transaction is run against more than one store or data source, this is insufficient. Consider this scenario

  1. Data must be changed in 3 related databases (each hosted on its own server or “box”) to processes a logical unit of work – say creating a new user account.
  2. A winter storm knocks down a power line, causing a server to loose power and go down.
  3. The 2 other servers are unable to get confirmation that the 3rd has successfully completed its update, so they roll back their part of the transaction, preserving a valid state across the system. (This must be possible even after a local commit.)

Ideally, the distributed transaction would not fail, and no rollback would be necessary. However, in the real world, this inevitability must be accounted for. A non computer analogy would be a space shuttle (or even a sea faring vessel) preparing to launch – if any system isn’t ready, the entire launch is aborted. This is very much necessary.

Consider this quote from the very same article

Actually the TwoPhaseCommit is the algorithm for resolving commit commands for distributed transactions. What prevayler does is a basic recovery mechanism, and it sweeps concurrency control under the carpet by serializing all concurrent acesses to the global object model. Well, it works in situations, but you have to be very careful in analyzing whether it works for your particular problem. If you have a distributed transaction situation, the lack of support for TwoPhaseCommit is can be a very serious deficiency, but again, depending on the particular situation it may not be too big a problem. Because what TwoPhaseCommit does at the infrastructure layer, transparent to the developer, could be done at the application layer with application specific business logic. I guess that’s what Klaus’ article it’s trying to say: that you can deal with the issues at the business logic level. However, depending on the situation that extra business logic you have to add can be extremeley simple or it can be extremely complex, and I’m affraid the examples he presented are not very representative of how and why one would be in a distributed transaction situation and how to handle it.

The wiki article introducing The Prevayler gives these statistics

-3,251 times faster than MySQL.
-9,983 times faster than ORACLE.

This is probably true for some operations.  For example, someone else in the article goes on to explain that Oracle is tuned for a mixture of read and write operations, while MySQL has less optimization for real world work loads.  While Oracle is in general much more performance delivering than MySQL, very specific cases can be set up to show the reverse.

The Prevayler stores its data entirely in memory, serializes all “transactions” (runs them step by step, one after another) so scale is severely limited, and real world data sets simply can not be processed with this Java toy.  Much like Safari, the world’s fastest browser, and one which ignores instructions (like in some cases anything to do with background) in order to achieve that speed – hardly useful!

Still, toy is perhaps too harsh a word, for there are many web applications with trivial needs that can be satisfied by such a system.  It’s annoying to read Klaus defend his work by saying, over and over again, that any feature not implemented by “The Prevayler” is unnecessary in any situation.  As shown with 2 phase commit, this type of arrogance is off putting, and makes a person want to stick with standardized, tested, non buggy RDBMS code.

March 19, 2008

C# and VB.NET Stored Procedures Suck

Filed under: Programming — alexanderthegreatest @ 1:59 pm
Tags: , , , , , , , ,

Somebody found the blog you’re reading by searching Google for evidence that “c# stored procedures suck“. (Probably this post, possibly this one.) This search phrase isn’t exactly true – C# procedures by themselves can be a good thing, under the rarest of circumstances. The pain comes when people use them in all the billions of ways they aren’t suited to. On the other hand, VB sprocs always suck. 🙂

First, you don’t work with them in the same way as normal procedures. At best you can think of them as encrypted procs, and that’s if you don’t have to deploy them. The way to do that is compiling through the command line (CSC.EXE /t:library your_csharp_code_file.cs), then you have to write SQL pointing to a location in the file system! First you run a Create Assembly [SysName] From [File Path To DLL] query and then you create your procedure pointing to the fully qualified name of the method in the code. Don’t forget to make it public accessibility and decorate it with the SqlProcedure attribute. Don’t even think about lieing if a function isn’t deterministic – SQL will believe you!!!

If you could type your C# or VB code into SQL Server Management Studio, hit F5, and have it compiled and deployed to the current database, and then sp_helptext a .NET proc, it wouldn’t hurt maintainability of code. Think of the poor DBA trying to track down the cause of bad performance when code they have no knowledge or view into is involved.

SQL Server has its own memory management, its own threading and scheduling, it’s own locking, and talks to the operating system. It’s a system. You can tap into it with the Common Language Runtime, but the CLR runs in a very different way, and the interop causes sometimes great overhead. The managed heap is there in all cases but God help you if your code uses threads or calls win32 directly.

When .NET Procs Don’t Suck

It seems like the answer would be “when you don’t know T-SQL very well” but that’s how production systems are brought to a limp. The CLR is absolutely not a replacement for T-SQL. It’s an add on for some very specific scenarios.

String Manipulation

As a programmer at a consulting firm I can’t even count how many times I’ve seen transact query language being used to parse out a comma delimited list. This is attrocious. The code will make you vomit, and the server isn’t that much happier, it’s like asking Babe Ruth to throw your dog a ball.

.NET code is a lot more elegant at this kind of task, and it’s optimized to do this sort of thing. Not strings in particular, actually, because they’re immutable to make threading easier. But logic and object manipulation are the domain of coding systems other than structured query language and it’s extensions.

XML

The CSV string should really be XML, whether you use regular SQL or decide to write in Java Sharp #. This way you don’t have to worry about character limits (expecially in unicode!). You don’t have to write a parser.
SQL Server 2000 and 2005 have support for XML, using Microsoft’s XML DOM or MSXML. Purely from a SQL code stand point, there’s an easy way and a cumbersome way to do this. You can grab the values directly with XPath, or you can use OpenXML and maintian a handle to the parsed document. Unless you do the second method, you’re forcing SQL to parse the XML over and over again. But that’s only realistic from within a single batch. The same goes for FOR XML AUTO in SELECT queries – this uses the DOM to create the XML result.
Of course .NET has a different and more granular caching mechanism in general. A developer can parse XML once, store some things for later use, and get to everything in its time. It can make use of the object data before generating XML from it. Any of these things, done correctly, can improve scalability by reducing SQL Server overhead. Especially when it comes to XML reparsing. Let the database server do what it’s good at, but take non RDBMS work somewhere else.

Custom Data Types

I don’t have much to say on this one myself, but Microsoft has an in depth code sample showing how to work with spacial data using SQL CLR sprocs.

In Conclusion

Use SQL CLR procs, be they in C# or VB, very sparingly. But don’t hesitate to use them when they’re the best tool in the kit. Microsoft put more arrows in our quiver, but we should still use them carefully, and have the right one on hand when we need to slay those dragons.

March 13, 2008

SQL Danger

Filed under: Evolution,Programming,Software — alexanderthegreatest @ 3:53 pm
Tags: , , , , , ,

SQL Server supports a cooperative, non-preemptive threading model in which the threads voluntarily yield execution periodically, or when they are waiting on locks or I/O. The CLR supports a preemptive threading model. If user code running inside SQL Server can directly call the operating system threading primitives, then it does not integrate well into the SQL Server task scheduler and can degrade the scalability of the system. The CLR does not distinguish between virtual and physical memory, but SQL Server directly manages physical memory and is required to use physical memory within a configurable limit.

ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/denet9/html/d280d359-08f0-47b5-a07e-67dd2a58ad73.htm

If you’re scratching your head, thank your lucky stars you don’t need to understand this gibberish. I’ve been having to focus on SQL Server and .NET Integration, also known as SQL CLR integration. A lot of people have made fairly bad choices in the very, very recent past. Just because you can write your stored procedures now in Visual Basic doesn’t mean you should. Obviously it cuts both ways, and that doesn’t mean you shouldn’t – the trouble seems to come when people can’t decide whether they should or not.

Having to help guide that decision, I found the 1st paragraph above, which has some frightening guidance on the matter. It seems to suggest never to use .NET at first glance, but that isn’t really the case. What it really says is long running code that might interfere with SQL’s thread scheduling can be bad. Code that spawns new threads really shouldn’t be hosted in SQL. PInvoke calls can hurt.

But XML processing is an example of something that does none of those things. And while SQL has good XML support, for some operations, .NET is better. Also the set based nature of SQL compared to the procedural and object oriented nature of C# mean you have better control over caching in .NET, so by porting this type of sproc, you can parse an XML document in memory once, instead of once for every time you need to access it.

After all, it’s not SyBase in 1997!

March 7, 2008

Hungarian Notation

Filed under: Evolution,Programming,Software — alexanderthegreatest @ 3:44 pm
Tags: , , ,

Hungarian Notation is a naming convention whose main rule is that the prefix in a variable name should be longer than the name of the variable itself. For example, should you find yourself needing a string called Foo, you might call the variable that refers to it as gpnzstrFoo. You would know this because

  • g = global variable
  • p = pointer, because in C++ everything is a pointer
  • nz = null terminated (sometimes sz)
  • str = string value
  • Foo = the actual name of the variable

This is a mouthful and a lot to memorize. Further, it means that when you refactor your code and change the type (you might encapsulate string handling into a SuperString class as a schoolboy example) you have to either change the name of the variable which is happens in O(n) time, with n being the number of uses. Modern development environments offer refactoring services to safely accomplish this, but they also provide other services that make HN redundant and unnecessary.

Hungarian gives an illuminating view of the history of software development, and of Windows in particular. In the old days one would pull down the Windows “header file” (#include “windows.h”) and delve into its contents, using Microsoft as an example of successful large scale development efforts done right. In fact Microsoft is often credited with birthing Hungarian, in order to make scrolling work in Word and Excel. Today, like Windows 3.1, HN is relegated to the scrap heap of history.

Here is the part that might have fell web site producers call me a spammer – I’m going to copy and paste from a treatise on Hungarian Notation, some text which explains the fallbacks of this convention much better than I ever could. (Borrowing liberally from other sources, sometimes called “scraping”, indicates a lazy webmaster and is typically associated with made for adsense spam blogs. In this case, however, I’m simply trying to point my dear readers to a helpful resource, for those of you who are interested in this topic.)

My problem with Hungarian Notation is more fundamental and stylistic – I think it encourages sloppy, sprawling, poorly decomposed code and careless, ill-coordinated maintenance. Simply put, if your namespace is so polluted that you need a cheap trick like HN to keep track of your variables, you’ve made a terrible mistake somewhere. You should never have so many variables and constants visible at one time; if you do, then you need to review either your data structures, or your functions.

This is especially true for code written under either the OO or FP methodologies, as a primary goal in each is to isolate (in OO) or eliminate (in FP) those variables that are not immediately needed.

HN notation also presents a problem in the case of OO, in that it interacts very poorly with polymorphism. It is actually undesirable to know the class of an object in many instances, but at the same time marking it with the parent class tag is misleading and contradicts the goal of HN. As for marking globals as separate from locals, why on earth do you have any globals in the first place? 😉 — JayOsako

I agree that HungarianNotation is bad because it’s a crutch for overly large namespaces. That’s why I use single character variable names to force myself to write clear code. — EricUlevik

My Response: 26 variables? At once? How can you manage that?

Seriously, HungarianNotation is only a symptom of a larger problem. Too many programmers see the solution for excessive state to be more state; they end up with variables to track variables to track variables, all bound up with arcane rules for naming that are supposed to be indicative of their meaning (which it rarely is). The point isn’t so much that you have to limit the number of variables, but rather that large proliferations of variables in the local namespace is a sign of a design mistake. Why would they all be visible at once? Aren’t there some abstractions they can bound into? Why do you need this particular bit of state visible here, anyway?

The rule should be simple: if you don’t need to see it, you shouldn’t be able to. Period.

Like VisualTools, HungarianNotation is not SoftwareEngineering, its TheIllusionOfSoftwareEngineering. It’s an easy and comfortable fix that looks good because you can DoItNow? and you don’t have to do all the hard thinking that comes from trying to do the RightThing. The time you ‘save’ using it will be more than spent later trying to maintain the resulting morass. — JayOsako

This is very well put, and pretty much sums up my feelings on the matter. — KevlinHenney

Any interested programmers are encouraged to read the original page. This bit cracked me up, since we’ve mentioned the historical implications of the subject

What I think you really want is for your interactive development environment to give you a rollover (ToolTip?, for you Windows junkies, or minibuffer message for you Emacs junkies) that shows you the declaration of this variable at every use point when your cursor goes over it. — RusHeywood

Visual Slick Edit (Windows) does this.

VC6 has something approaching this. They copied it from VB – when you’re typing code to call a method, it pops up with a prompt window showing the names of the arguments. It’s not perfect, but it’s a step in the right direction. — RogerLipscombe

March 6, 2008

An Idiot Misunderstands SQL

Filed under: Modern Life,Programming,Science,Software — alexanderthegreatest @ 3:44 pm
Tags: , , , , ,

Since my Digital Point – Misunderstood post has proven such a hit, it occurs to me to share a hysterical post I found on SQL.NET. This idiot is writing “Death to TSQL” because of horrible SQL coding skills and because the man doesn’t seem to understand that SQL is not ASP. (Yes, you read that correctly – SQL and ASP Ancient.)

The article concludes that TSQL should never be used for stored procedures, that procs should never (“.EVER.”) use temp tables, and that the CLR (Common Language Runtime – aka .NET) should be used for 1980s style programming. Quoth the raven “In this case, it’s so overwhelmingly better than TSQL that I cannot recommend to anyone coding for SQL 2005 to use TSQL for stored procedures, even CRUD procs.

CRUD = Create Read Update Delete. This went out of style by 1990. A client application should not be allowed to perform primitive operations like these for any number of reasons. Performance – we want chunky, not chatty interprocess protocols. Simplicity and maintainability – the client application should not have to understand the full implementation in the database. This should be hidden from and for them. (This is half the reason we use sprocs in the first place!!) Data quality and integrity – if the network connection is severed midway through a run, while half the creates have run and some of the updates, the data is left in an indeterminate state.

To stack the deck, Payton Byrd makes some shocking and unforgivable errors

  1. T-SQL code designed to be as resource intensive and slow as possible.  Not the cursor use.  The test code comes with this warning “This may not be the most efficient way to do this, but I don’t specialize in TSQL stored procs” which should disqualify the results for any thinking person.
  2. One reason the CLR code is assumed to perform faster than the TSQL code is because the .NET procedure uses a StringBuilder.  In a related article, our anti hero tells us if the test data set were larger, the difference would be more pronounced, because of this.  A StringBuilder is a hack, a work around for .NET because the CLR handles strings in a funny way.  When your code creates and then alters a string, both copies live on in memory, because (due to threading) strings are immutable.  Always creating and repointing references is so tiresome for the host computer that a class was written to give programmers an entirely different way of concatenating strings.  SQL does not suffer this limitation.  One of the key assumptions holding up the conclusion (aka guess) giving this article reason to be is fundamentally flawed.
  3. Wild conclusions “The performance results are absolutely stunning, the CLR stored procedure is 14 times faster than the TSQL stored procedure
  4. More profound confusion “I’m willing to do more testing, but I think this pretty well proves the point that you can get much better performance from the CLR than from TSQL when you step beyond the simplest of stored procedures.”   Apparently the meaning of complex was lost on this writer.  Anybody who’s ever done any database work knows that complex != writing loop code for a set based language to put commas between values.

I’m not writing this to make fun of an incompetent developer.  I’m writing to point out flaws of logic that seem all too common these days.  The internet seems to have created a world where everyone is expected (even if only by themselves) to be an expert in some area.  When a supposed expert doesn’t understand something, the rule this day in age seems to be that one should make assumptions and run with them.  As Tripy would say, do you programming, then, when there’s no time left, do your analysis.

An even better way to cause yourself grief is to test an edge case and then assume the general world is described by some crazy scheme.  If I wrote that there’s no reason ever to drive anywhere because I took a train yesterday, arrived at the station a moment before the vessel departed, then found I saved 5 minutes by not parking, people would rightly assume I had lost my wits.

The Texas Sharp Shooter Fallacy describes exactly this.  Payton’s test case addresses a tremendously rare (and poorly coded) situation, one that might never be encountered by a senior database programmer.  From this highly specific test he makes sweeping general conclusions.  A test was engineered to provide support for a conclusion that had already been drawn.  The test tells us nothing, except that we should avoid Payton’s blog if we aim to learn.

Create a free website or blog at WordPress.com.