notes 9.69 KB
Newer Older
David Anderson's avatar
David Anderson committed
1 2 3 4
Abstractions

--------------------
Client files
David Anderson's avatar
David Anderson committed
5
two main files:
David Anderson's avatar
David Anderson committed
6

David Anderson's avatar
David Anderson committed
7 8 9 10 11 12 13 14
prefs.xml
    user preferences.
    includes list of projects; for each:
        master URL
        authenticator
        project-specific prefs
        resource share
    prefs mod time
David Anderson's avatar
David Anderson committed
15 16 17

client_state.xml
    hostid
David Anderson's avatar
David Anderson committed
18 19 20 21 22 23 24
    per-project info
        list of sched servers for project
        project name
        hostid
        next_request_time
        rpc_seqno (specific to this host)
        work info
David Anderson's avatar
David Anderson committed
25 26
    files, WUs, results etc.

David Anderson's avatar
David Anderson committed
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
NOTES
- On startup, if there's no prefs.xml, the client prompts
  for a master URL and authenticator,
  and creates an initial prefs.xml with a zero mod time
  (so that any web-created prefs file will override)
- We need to safeguard against a buggy scheduling server
  sending back an incomplete or empty prefs file.
  Suggestions:
    1) verify that at least the responding project is present in the prefs;
       (or contain at least 1 project)
    2) back up the old prefs file (prefs.xml.date)
- prefs.xml has priority over client_state.xml
  If there's a project in prefs with no counterpart in client_state,
  a new entry in client_state is created.
  Entries in client_state absent from prefs are deleted.

- to "clone" an installation on a new computer,
  just need to copy the core client (or run the installer)
  then copy the account.xml file.

- a scheduler request can specify that no client_state.xml
  was found, so a new host record should be created.
David Anderson's avatar
David Anderson committed
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181

--------------------
When does client contact scheduling server?
Each result has a max notification delay,
so when a client completes it there's a deadline for notification.

Contact a scheduling server if:
- you're below the low-water mark in work for that project,
  or you have a result past its deadline
- AND there's no delay in effect for that project.
    A delay may be explicitly returned by the scheduling server,
    or may be because of exponential backoff after failed attempts.
--------------------
Given that we can estimate the time it will take to get back
a result from a given host, it might be possible to assign
deadlines to results, and only send them to hosts that are fast enough
--------------------
Client logging
write events to log file:
    start/stop client
    start/finish file xfer
    start/finish application execution
    start/finish scheduling server call
    error messages

logging flag is part of preferences
--------------------
file xfer commands
    implemented as WU/result pairs whose app is "file_xfer".
    Can have just one input file, one output.
    Application servers can leaves these in a "message" directory,
    where the scheduling server can find them and give to
    client next time they contact.
--------------------
result states in client
    don't have files yet
    have files, not started
    have files, started
    completed, sending output files
    output files sent
    output files sent, some sticky files deleted

--------------------
result attributes in DB, sched server
    state:
        unsent
        sent, in progress
        timed out
    file state
        all output files are openly available
        (i.e. have been uploaded)

WU attributes in DB, sched server
    input file state (set by app server)
        all input files are available
        not all input files available

--------------------
Client logic
    ["network xfer" object encapsulates a set of file xfers in progress]
    ["processor" object: one for each CPU]

    read config file
    loop
        check user activity - turn off computations if needed
        start a computation if possible
            all necessary files present,
            and workunit not done or in progress.
        check processes (fail, done)
        start new network xfers if possible
        xfer 16KB if possible (use select)
            if xfer complete, update state
        if estimated work below low-water mark
            while estimated work below high-water mark
                pick project with work due, OK dont_contact_until
                contact a control server; request high-current work
                if can't get connection, update dont_contact_until
            end
        end
    end
--------------------
Application logic
--------------------
Control RPC protocol
--------------------
Web site functions
--------------------
Startup scenarios

- How a user initially signs up:
Visit the project's URL.
Create an account:
    enter email address
    wait for password to arrive in email.
    download installer
    installer installs agent, initial config file
    run agent; type in password.

- How a user adds a project
Same as above, but don't download agent.
Go to "home" web site and add project.

- How a user removes a project
Go to "home" web site and remove project

------------------------------
Versions

Core client:

When and how does a scheduler tell a core agent
that a newer version can/should be downloaded?

How is compatibility between application agents
and core agents represented?
--------------------------------------
Distributed storage

Projects can use clients for storage using "sticky" files
(which are either sent to clients, or generated by the client).

The core client is free to delete sticky files any time.

Scheduler requests include a list of the sticky files held by the host.
This list is stored in a blob in the host record.

Scheduler replies can include <file_info> tags
instructing the client to download files.
These files need not be associated with applications or workunits.

Scheduler replies can include <file_info> tags
instructing the client to upload

David Anderson's avatar
David Anderson committed
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
--------------------------------
Preferences

CPU usage
    don't run or communicate if on batteries
    don't run or communicate if user is active
    confirm before making network connection
    minimum, maximum work buffer

Disk usage
    use at most X GB
    leave at least X GB free
    leave at least X% free

Projects
    For each project:
David Anderson's avatar
David Anderson committed
198 199
        user name
        project's master URL
David Anderson's avatar
David Anderson committed
200 201 202 203 204 205
        email address
        authenticator
        resource %
        show email address on web site?
        accept emails from project?
        project-specific prefs
David Anderson's avatar
David Anderson committed
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261
------------------------------
retry policies:
general issues:
    when and where to retry?
    when to declare overall failure?
    what to do if overall failure?
    what needs to be saved in state file?

file xfer
    download
        round-robin through URLs with random exponential backoff
        after connection failure or HTTP error.
        2X from 1 minute up to 256 minutes
        Overall failure after 1 week since last successful xfer
            flag result as "file download failed",
            abort other file xfers,
            delete other files.
            write log entry
        State file:
            record time of last successful xfer
    upload
        same as for download?
        Use HTTP features to find file size on server

scheduler RPC
    order projects according to hosts's "debt" to them.
    Attempt to contact them in this order.
    For each project, try all URLs in sequence with no delay.
    If still need more work after a given RPC,
    keep going to next project.
    If still not enough work after a given "round",
        do exponential backoff
        2X from 1 minute up to 256 minutes
        If reach 256 minutes, show error message to user and write to log
    nothing saved in state file
------------------
Core/App connection
    two unidirectional message streams.
    files "core_to_app.xml" and "app_to_core.xml".

    core->app:
        initially:
            requested frequency of app->core messages
            app preferences
            name of graphics shared-mem segment
            recommended graphics parameters
                frame rate
                size
            recommended checkpoint period
            whether to do graphics
        thereafter:
            recommended graphics params
    app->core
        percent done
        I just checkpointed
        CPU time so far
David Anderson's avatar
David Anderson committed
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309
-------------------
File upload security
Each project has a "file upload key pair"
Scheduling server has private key;
data servers have public key.
The key pair may be changed periodically;
data servers have to store old and new during transitions

- in DB, result XML has format
    <result>
        <file_info>
            <max_size>123123</max_size>
        </file_info>
        ...
    </result>

- RPC reply: result XML info has format
    <result>
        <file_info>
        ...
        <expiration>...</expiration>        (added by server)
    </result>
    <result_signature>
        <name>foo</name>
        ...         (digital signature of <result> element; added by server)
    </result_signature>

- Client stores:
    for each result (in state file, in memory)
        exact text of <result> tag
        signature

- On file upload, client sends
    <result> element (exact text)
    <result_signature>
        ...
    </result_signature>
    <filename>blah</filename>
    <offset>123</offset>
    <total_size>1234</total_size>
    <data_start/>
    ... data

- file upload handler does:
    parse header (up to <data_start/>)
    validate signature of <result>
    verify that filename is in list of file_infos
    verify that total size is within limit
David Anderson's avatar
David Anderson committed
310 311 312 313 314 315 316 317 318 319 320 321 322 323 324

----------------------------
Project main URL scheme

Each project advertises (and is identified by) a "root URL".
This URL returns a browser-visible "root page" describing the project,
linking to the registration, etc.
It also contains (in elements inside HTML comments)
one or more <scheduler_server_url> elements,
each containing the URL of a scheduling server

When the core client initially runs, it fetches and parses
the root page, and records the scheduling server URLs.
Whenever it can't contact any scheduling server, it reloads
the root page; the scheduling server URLs may have changed.