Because these solutions utilize the SharePoint's Alerts mechanism, it is vitally important that the Alert infrastructure functions properly, however this is rarely the case in our Farm.
Our topology is a Medium Server Farm with 2 WFEs, 1 App Server, and a SQL cluster for the databases.
The issue that I have been dealing with is that for an as yet undetermined reason, the server that has established the TimerLock for a particular Site Collection's content database will no longer be able to send emails.
I am able to identify the server which has the TimerLock by running the following SQL command against the content database for the site collection containing the list on which alerts are set:
USE content_database
SELECT * FROM timerlock WITH (nolock)
These are Immediate Alerts that I am working with so in order to see if they get queued up, I query the eventcache table, again in the content database for the site collection containing the list on which alerts are set:
USE content_databaseMy observations show that when the Timer Job runs (owstimer.exe), these events are processed and subsequently removed from the eventcache table leading me to believe that everything is working fine.
SELECT * FROM eventcache WITH (nolock) WHERE EventData is not null
I've had my network security guy take a look at the firewall traffic, and he can see that traffic from the SharePoint server with the TimerLock to the SMTP server makes it through the firewall without issue, however no email is ever received for the alerts. It should be noted that at the same time that this server fails to send email, the other WFE may have a TimerLock for a different Site Collection's database, and those Alerts send email just fine!
Although everything appears to be working as it should, I am intermittently left without alert emails.
The one thing that seems to work is to reset the local cache on the server that has the Timer Lock. This is accomplished by performing the following actions:
On the server with the Timer Lock:
- Stop the Windows SharePoint Services Timer service
- Navigate to "C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\
and delete all the .xml files - DO NOT DELETE THE cache.ini FILE! - Open the cache.ini file in notepad and change the number value to 1 then save the file.
- Start the Windows SharePoint Services Timer service.
UPDATE:
I am happy to report that since I added a scheduled job to run the following batch script every morning at 4:45 am, Alerts have been runnining without fail. In order for this to work, you need to make a copy of the cache.ini file with the number value set to 1 and placed it in the C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\ directory.
net stop "Windows SharePoint Services Timer"
del /F /Q "C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\a58ec05c-344f-487c-a8e6-cf0365b86458\*.*"
xcopy "C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\cache.ini" "C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\a58ec05c-344f-487c-a8e6-cf0365b86458\*.ini" /Y
net start "Windows SharePoint Services Timer"
Reference
More information about clearing the file cache is available from http://support.microsoft.com/kb/939308