Saturday, January 30, 2010

DSCC Stability and Performance issues with DSEE 6.x and 7

We have too many LDAP instances. That's a different story I won't get into here. However, we use the DSCC as one method to manager our 172 6.3.x and 7 ldap instances. We run our 2 DSCC servers on Solaris 10 zones. Our first DSCC is currently using v6.3 and configured with the SMCWebServer. The 2nd is a recently implemented DSCC v7 running on the Java WebServer v7. The registry is replicated between the two DSCC's servers.

Once we got to around 100 ldap instances we started having slow logins, bad performance, and stability issues on the DSCC. Performance was so bad on the v6 DSCC that we would have to restart the smcwebserver before each login otherwise the logins would fail. Our implementation of the 2nd v7 DSCC on the Sun Web Server was based on a hope that there would be performance and stability improvements. A second major reason for moving to the Java Web Server platform and v7 of the DSCC was to move to a more familiar platform and away from requiring sudo access to run the registry or to restart the smcwebserver. Nothing our Unix admins hate more than having to give us sudo access.

We encountered the same performance and stability problems on the v7 DSCC. However, we had a more familiar platform (which we had control over vs the SMCWebServer) to troubleshoot our problems on and indeed found an easy fix.

It seems that when you login for the first time after a startup to the DSCC it will try to connect to every LDAP instance to update its cache. For us, the login normally takes up to 20 seconds.  LDAP servers that are behind a firewall seem to take longer to respond and appear in the console as "stopped". After a few refreshes their status changes to "Started".  Once in the console and working on tasks, other behavior that would appear is that all Server Groups would disappear.

Typical info and warning messages we see during the login time are:

 warning (18830): NmcAndAmcComputer[ou=Netegrity,dc=company,dc=com,ldapserver2:1389] skipped ldap://ldapserver2:1390 because it's not replicated !

 warning (18830): NmcAndAmcComputer[ou=Netegrity,dc=company,dc=com,ldapserver3:1389] failed to find ldap://ldapserver3:1390 in the topology cache !

 info (18830): for host trying to POST /dscc7/dcc7Module/Login, service-j2ee reports: SuffixReplStateInfoComputer[dc=company,dc=com] ignored invalid RUV found in dc=company,dc=com@ldapserver:1390

The problem we did not notice on the SMWebServer but discovered via the Java Web Server logs was that the java process was running out of threads as it tried to connect all 172 ldap instances. In short, our LWP setting was too low.

Example error messages:

warning ( 2196): CORE3283: stderr: Exception in thread "DSLoader[ldapserver1:1389]" Exception in thread "DSLoader[ldapserver1:1389]" java.lang.OutOfMemoryError: unable to create new native thread

[ERROR] Uncaught application exception com.iplanet.jato.command.CommandException: Handler method "handleRefreshHrefRequest" threw an exception Root cause = [java.lang.OutOfMemoryError: unable to create new native thread]

Furthermore, by watching the webservd process via prstat, you would be able to see the point where the threads would run out and the java errors would occur.

CPU Process/NLWP
0.0% webservd/1337

Total: 36 processes, 1723 lwps

The original LWP setting was 800. It was increased 2000.

#prctl -P -t privileged -n zone.max-lwps -i zone ldapserver1
zone: 4: ldapserver1
zone.max-lwps privileged 2000 -

This setting required a reboot, but our performance problems have been resolved. Both versions of the DSCC still take about 20 seconds for a login, but are very stable and no longer require constant restarts of either webserver.

An obvious fix to someone more familiar with the SMCWebServer? One of my biggest complaints with the DSEE 6 architecture was its requirements of root access for certain components, such as the DSCC registry and the integration with the SMCWebServer. In my experience, large enterprises tend to separate middleware such as Directory Services, Web Access Management, and Identity Management from Unix Administrators and requiring root access usually results in a level of compliancy hurdles and auditing requirements that shouldn't be required. I am very pleased that with DSEE v7, the DSCC can run without issues and without any root privileges.