One of the most prevalent misconceptions about SharePoint is that it can be freely used as a database. While it certainly doesn’t have most of the features a “proper” database would have, it still offers the possibility to organize data in various lists (acting like data tables of sorts) and to establish relations between the lists via lookup fields. It even has its own query language (CAML). To top that off, SharePoint offers multitude of enterprise-y features that make it so much more appealing to business-oriented decision makers, with the ability to create pretty advanced data-driven applications without the need for involving a developer with his or her preconceptions about technical limitations.
To some extent this way of thinking is justified; after all, SharePoint is designed to be just that kind of business-oriented platform where non-developers can create tools for working with data, complete with interface, business logic and even workflows. But as an organization starts to integrate SharePoint as a core part of its business processes, the amount of data can easily skyrocket. And what has been working just fine before suddenly proves to be liability weighing down the whole system.
The following article is intended to be the first on the subject of SharePoint performance. It focuses on issues related to handling large amounts of data and seeks to provide guidelines on how you might go about looking for viable solution.
The 5000 Problem
Before we continue, we should make one thing clear: SharePoint is designed to handle large amounts of data. Backed by adequate hardware, it can support tens of thousands of users and millions of documents. The amount of data alone isn’t the real issue here; it’s the approach to handling the data that proves to be the culprit.
It should come as no surprise, really. Handling large amounts of data is always challenging, but in case of SharePoint it’s even more so. Partly it’s because of the overhead caused by SharePoint features like permission handling (which, by the way, can also greatly contribute to decreased performance). But partly it’s because physically all SharePoint content is basically stored in a single database table.
SharePoint 2010 introduced a limiting feature called List View Threshold. It’s set per Web application and can be accessed through Central Administration. By default it is set to the value of 5000, but can be changed by farm administrator. Initially the setting was misunderstood by SharePoint community and it was common practice to simply increase it as needed – and that was often. The threshold limits the number of items that can be processed by a single read or write operation, including the items rendered by the out-of-the-box List View Web Part. Normally the Web part uses pagination, so no more than a couple of dozen items are displayed (i.e. retrieved from the database) in one page request, even if there are hundred thousand items on the list. But if a user needs to find specific items on such list and sets a filter on one of the columns, he’ll receive an error message. In a business environment a situation like that would be quickly escalated, and a pressure would be put on the IT department to solve the issue ASAP. And the quickest solution is to log into the Central Administration, increase the threshold’s value, and voilà – problem solved. At least until it arises once again when the new threshold is exceeded.
What might not be apparent is what exactly the List View Threshold does. It all comes down to how SQL Server works. A list item is essentially a row in the UserData table. (Well, not exactly, but let’s make it simple for the clarity’s sake). When an item is retrieved or updated, SQL Server locks the relevant record in the table so that there’s no risk of the record being changed mid-query by another request made by another user in the same time. This means that until the query finishes executing, no other query is allowed to do anything with the locked row. Now, if many users work simultaneously on the same set of items, even doing read operations, it still may have negative impact on the performance of their part of the system. Still, other users working elsewhere won’t be really impacted in any way.
Problem starts when the number of items (rows) processed by one query exceeds said 5000. What SQL Server does in the situation is to lock the entire table instead of specific rows. Now, remember what we’ve said about the whole content being stored in a single table? This means that when the entire table is locked by a query, it suddenly impacts the whole system. So even if you only have one large list which is used by only a single department in the organization, with List View Threshold set too high it will become a burden for every other user, even those who don’t have access to said list, and applications that were working just fine yesterday suddenly start having problems.
In other words, simply increasing List View Threshold is hardly a good solution. Granted, it may be the only solution, at least temporarily, but implementing it doesn’t absolve SharePoint personnel from seeking better ones.
How to Deal With It?
So what would those better solutions be? Frankly speaking, there might not be any perfect way to both incorporate all user requirements and allow SharePoint to run efficiently all across the board. When working on large data stores, it’s often about tradeoffs and compromises, which is hardly something welcome on both sides of development process.
1. Data Views and Queries
If you expect to have large amounts of items on a specific list, you have to make sure you understand how users will work with the data. The idea is to avoid using universal views containing all list items that users can then filter and sort by whatever column they need at the moment. While the approach is perfectly viable with smaller lists (i.e. containing less than 5000 items), with large lists you’ll be facing a conundrum: increase the List View Threshold, or allow users to receive error messages? With neither option too appealing, from technical standpoint the latter would be much preferred. Since the system should serve business needs, however, it’s usually not possible.
Often, though, users do not really need the ability to sort or filter the whole list by any given column. Usually they will only use several specific columns in their daily work, with others being simply a useful addition from time to time.
One way to deal with the limit is to formulate your CAML queries in the “proper” way: just make sure any given query returns less than 5000 items. Alas, it’s not always simple, or even achievable. Say, users need to filter their SharePoint list-stored invoices by a Due Date. If they are only interested in a specific date, then there should be no problem, as it’s very doubtful that more than 5000 invoices will be due on a single day. But there’s no guarantee your view would work if users want to have all invoices where Due Date is equal or less than the specified date. Potentially users can query for invoices from the last 5 years, which obviously would be blocked by the threshold.
Here we go back to understanding the core of users’ needs. Why would they want to have such an extensive filter? One thing that comes to mind is a report. Reports, however, should not be made using the same features employed for day-to-day data management. A dedicated feature should be implemented to carry the burden of preparing reports. So if users request filters that would potentially return thousands upon thousands of items, most likely what they really need is a report. In this case, increasing List View Threshold should not be considered to accommodate the requirement.
There is, however, another case that, while perfectly reasonable, might indeed become a problem. Namely: returning items where the value of a specific field lies in a given range. To stay by the Due Date example, users might require having the ability to seek invoices of a given type (Invoice Type column) with Due Date between specified dates. Even if the query is expected to return only 10 or 20 items, it still may be blocked by the threshold.
The case requires a bit of an explanation. First of all, all columns you want to allow users to filter list by should be indexed. This is done via list settings, and its done best before the list becomes crowded with data (building an index for a huge list can take a while). This allows SQL Server to search items via an index instead of performing a full scan of the UserData table (which essentially entails looking at every single record therein in order to determine whether it fulfills the search criteria or not). Speed is one benefit of the approach, but also the 5000 limit is omitted since it’s the far smaller index that’s being queried instead of the much larger table. Obviously, if a list column has simply a great number of values, the index can itself become locked by the SQL Server, with the same results we’d have if we were using no index at all. Remember, throttling occurs not because of how many items a query would return, but how many records need to be looked at by SQL Server in order to determine the ones that should be returned.
But index is only one part of the solution. The other is the way your CAML query is built. Unfortunately there is no way to set up a query by a range of dates (or numbers) other than use the And operator. Let’s look at how our query might look like:
<Where> <And> <Eq> <FieldRef Name='InvoiceType' /> <Value Type='Text'>Final Invoice</Value> </Eq> <And> <Geq> <FieldRef Name='DueDate' /> <Value Type='DateTime' IncludeTimeValue='FALSE'>2015-06-01T12:00:00Z</Value> </Geq> <Leq> <FieldRef Name='DueDate' /> <Value Type='DateTime' IncludeTimeValue='FALSE'>2015-06-15T12:00:00Z</Value> </Leq> </And> </And> </Where>
Now, we can have the following three situations:
- none of the fields we query by are indexed,
- only one field is indexed,
- both fields are indexed.
Chances are that in all three cases the query would be throttled. While it would certainly be understood in the first two situations, the last one, where both of the fields used in the query are indexed, can easily misunderstood.
By using the And operator we are not really creating a single condition. What happens under the hood is that SQL Server looks for rows fulfilling one of the criteria (InvoiceType=”Final Invoice” in our example), then the results table is searched for rows fulfilling the next criteria (DueDate >= 2015-06-01), then the next one (DueDate <= 2015-06-15) and so on. So any number of distinct indices might not prove to be useful if number of rows returned by the first condition is too large. What should help, however, is creating compound indices.
Sadly, there’s a catch. SharePoint doesn’t allow creating compound indices with just any type of fields, and neither Text nor DateTime are among those. So even though we search by those two fields, we can’t create an index that would indeed help with throttling. And even if we use fields that allow us to create a compound index (a User field, for instance), we can only select one other field to be included in the index. So if we need to query by three of more fields, we can’t be 100% sure the query will work.
What to do in such situation? Frankly, there is no good solution. You can try changing the order of the conditions within the query to increase the chance of the first condition returning a manageable number of results. For instance in our example above we first look for invoices of the “Final Invoice” type, which most likely is a bad idea. Presumably there would be tons of invoices of the type. Instead, we could try to move the Due Date condition up and make it first. Good idea would be to talk to end users on how they intend to do their searching. Chances are they’ll be mostly interested in invoices with Due Date closer to present date. In this case it would make sense to make the Geq clause the first one, as there would be less of a chance to have large number of invoices with Due Date in the future than in the past. So our final query could look like that:
<Where> <And> <And> <Geq> <FieldRef Name='DueDate' /> <Value Type='DateTime' IncludeTimeValue='FALSE'>2015-06-01T12:00:00Z</Value> </Geq> <Leq> <FieldRef Name='DueDate' /> <Value Type='DateTime' IncludeTimeValue='FALSE'>2015-06-15T12:00:00Z</Value> </Leq> </And> <Eq> <FieldRef Name='InvoiceType' /> <Value Type='Text'>Final Invoice</Value> </Eq> </And> </Where>
Although there is no guarantee that the query will work with just any search conditions (truthfully, the guarantee is pretty much that it won’t work), we can minimize the risk and try make it run well in most day-to-day situations.
If we really need to run a more extensive search that we can’t fit into a working CAML query, the alternative solution would be to use search.
2. Search with Search
SharePoint is designed and built around search engine. One can argue it’s one of its core features, which, curiously enough, is often overlooked and underutilized. One of the reasons can be that it takes some time for new data to appear in search results, the other may be the way the results are presented. Experience shows business users often prefer tabular, datasheet-like view (provided by a standard list) instead of a list of boxes with snippets of information that don’t provide much needed information.
Yet search is the right tool to use when extracting specific data from the entire store. It builds and maintains its own index of data stored within SharePoint, which both allows for more efficient querying and doesn’t require too much involvement of the content database. Even better, it can be set up on a dedicated server. This both mitigates problems with query throttling and makes the system more responsive. The downside, though (apart from presenting the results, which admittedly can be quite extensively customized) is that maintaining the search index does require the search service to crawl through content database which in and of itself is rather resource-heavy operation. SharePoint 2013 introduced the mechanism of continuous crawl which is designed to shorten the time from when data is entered to when it’s available in search results, but it still is not immediate. While the crawl interval can be decreased to shorten the time when new data is included in search results, setting it too low imposes additional load to the overall SharePoint farm, which can be rather counterproductive if we’re talking about performance.
3. Use Client Applications
Since SharePoint is part of the Microsoft Office suite, it integrates neatly with various other Office applications, of which especially Access and Excel are worth some consideration in the subject of working with large lists. If you base on out-of-the-box field types and don’t have any important custom actions or similar features deployed on your list, it may indeed be a good idea to think about employing external applications to serve as user interface. The major benefit of this approach is that all the sorting, grouping and filtering operations are performed client-side, without utilizing any server resources, and more importantly without employing SQL Server in the process. While the initial data retrieval itself can be quite costly, it’s still only a single operation, with further data updates retrieving only updated items instead of the whole list. The additional benefits are that the applications offer much greater capabilities than barebones SharePoint does, with more advanced filtering, custom queries, visualizations, etc. Since virtually no operations are performed on the server (they occur basically only when adding, updating or removing items), the server is relieved of much of the work, and no List View Thresholds apply.
4. Rewrite Your Code
If you use CAML queries from within your code-behind, there is a way to override the threshold, which is a useful thing if you simply have to have a query that does not conform to the limit. You will need to have a custom code for that, though, so you won’t be able to make, say, a list view to work with the override.
The SPQuery class has a QueryThrottleMode property which you can set to the value of SPQueryThrottleOption.Override. It will cause the query to conform to the List View Threshold for Auditors and Administrators instead of the standard List View Threshold. One other thing you’ll need to do is to enclose the SPList.GetItems method call in SPSecurity.RunWithElevatedPrivileges. Note that this will not cause your query to execute without any limitations. It will simply cause it to conform to a different setting, which still might be too low for your needs (by default the List View Threshold for Auditors and Administrators is set to 20000).
5. Move the Cumbersome Data Elsewhere
If you cannot implement any of the solutions discussed above, you can try ejecting the offending lists to a different place, where they’ll still work slowly, but at least other users won’t be bothered by them.
The most obvious location would be a different Web application, since this is the scope of the List View Threshold setting. You can designate a Web application to host your large data stores, increase the threshold value and take the load off your main Web application.
Having whole Web application(s) dedicated for handling large data can create a whole host of new problems, however. Think of permissions, aggregating data across different Web applications, search indexing and results, and dozens other potential issues. But there is a somewhat more lightweight alternative.
SharePoint allows you to store data of each site collection in a separate content database. It’s not done by default, as content databases are normally created only for Web applications. Still, you can do this manually. The idea would be to move all your large lists to a dedicated site collection, and provide the site collection with its own content database. You’d still need to keep the high threshold for the entire Web application, which is undeniably a downside, but on the other hand the offending lists would be stored in a separate database (and a separate UserData table would get locked instead), and you’d still be in the scope of a single Web application. This, in turn, would make your life easier if you’d like to set up search service, for instance.
These are definitely suboptimal solutions; still, business environment often creates needs for immediate rather than optimal solutions, but at least it could give you some time to come up with something more proper.
SharePoint performance is an extensive subject, in theory rather well understood, but in practice often not considered well enough. While SharePoint has an enterprise-y feeling to it, it doesn’t mean one can do anything in any way and everything will simply work. It requires a conscious approach to implementing users’ needs. Working with large data repositories is one of common scenarios in an organization, and one of rather problematic from the perspective of a SharePoint developer. Even using only standard, out-of-the-box SharePoint functionalities doesn’t guarantee success. Still, there are some things that can be done even when your farm already is experiencing difficulties. What I hoped to achieve with this article is to provide an overview of those things. Increasing the List View Threshold is not a good idea. It can be useful as a short-term relieve, but it really shouldn’t be treated as a remedy, especially since you’ll be condemned to increase it time after time when the amount of data stored in SharePoint increases and old threshold values are surpassed. That’s a vicious circle that impacts SharePoint more and more with each and every turn.
Obviously, there’s way more to performance than a single configuration setting. We’ll be discussing more aspects of the subject in future. Indeed there are many different ways one can ask for trouble while only seeking a way to fulfill various needs of users. Still SharePoint offers so much that those difficulties shouldn’t overshadow what can be gained by mastering its capabilities.