High Throughput Computing with HTCondor

The Condor system is being developed at the University of Wisconsin for management of distributed resources, computational and otherwise. We briefly review the SAMGrid architecture and its interaction with Condor, which was presented earlier. We then present our experiences using the system in production, which have two distinct aspects. At the global level, we deployed Condor-G, the Grid-extended Condor, for the resource brokering and global scheduling of our jobs. At the heart of the system is Condor’s Matchmaking Service. As a more recent work at the computing element level, we have been benefitting from the large computing cluster at the University of Wisconsin campus. The architecture of the computing facility and the philosophy of Condor’s resource management have prompted us to improve the application infrastructure for D0 and CDF, in aspects such as parting with the shared file system or reliance on resources being dedicated. As a result, we have increased productivity and made our applications more portable and Grid-ready. We include some statistics gathered from our experience.

It contains both global arguments as well as specific configuration to each entry point. At least one entry point must be specified in the configuration file. The tags of the XML configuration file are described below. Each is given a designation: You must change or examine this in order for the Factory to function correctly. The installer provides a good default, but you should examine this attribute to make sure it is correct for your installation.

The installer-provided default is likely correct for your installation. Change this only if your particular configuration requires special treatment or fine-tuning. Global arguments Global arguments are common to all entry points but can be overridden by individual entry point configuration. The main tag of the Factory configuration.

To restrict the display to jobs of interest, a list of zero or more restriction options may be supplied. Each restriction may be one of: If no restrictions are present in the list to specify an owner, the job matches the restriction list if it matches at least one restriction in the list. If ownerrestrictions are present, the job matches the list if it matches one of the ownerrestrictions andat least one non-ownerrestriction. The attributes of the job ClassAd may be displayed by means of the -formatoption, which displays attributes with a printf 3 format.

Multiple -formatoptions may be specified in the option list to display several attributes of the job.

Enable the API Understanding the HTCondor cluster architecture HTCondor is deployed as a cluster of nodes, with one central manager node that acts as a resource matchmaker, one or more submit hosts where users can submit jobs through a scheduler, and an arbitrary number of compute nodes that retrieve and execute work from the job queues. HTCondor cluster architecture The central manager node maintains a database of compute nodes in the cluster along with the system characteristics of those nodes.

When new compute nodes are provisioned, they register into the central manager node and await further instructions. Each submit host runs a task scheduler and provides tools for users to submit jobs to the scheduler for execution. A job could be running a program with a particular set of input values or a program a hundred times with differing input values.

Jobs are composed of tasks, and each task is a unit of computation. Users can submit multiple jobs, and a submit host can support multiple users. The jobs are queued by the scheduler software that runs on the submit host. The scheduler can queue jobs based on workload, user, or priority. When jobs are queued, the scheduler requests resources from the central manager and submits resource requirements based on the jobs in its queues. The central manager matches the scheduler’s request to the resources the central manager has available, and then returns the set of compute nodes that best satisfy the request.

There are twi different reference we use to condor router. They are as follows: I am not sure which reference you were speaking, but these them!! Circuits and Systems, The routing algorithm is based on congestion costs assigned on vertices: The idea is to do as little scheduling in advance as possible and to only feed jobs to the sites as they consume them. Meanwhile, the jobs waiting to be routed are ordinary vanilla universe jobs, so they may run in the local Condor pool or in other pools via flocking.

Except for having your excess jobs queue up in the vanilla universe job queue, you can get a similar effect by submitting all of your jobs as grid universe jobs and using Condor-G matchmaking.

