diff options
Diffstat (limited to 'doc/manuals/chapters/proxy_cache.adoc')
-rw-r--r-- | doc/manuals/chapters/proxy_cache.adoc | 333 |
1 files changed, 333 insertions, 0 deletions
diff --git a/doc/manuals/chapters/proxy_cache.adoc b/doc/manuals/chapters/proxy_cache.adoc new file mode 100644 index 0000000..48ea4ce --- /dev/null +++ b/doc/manuals/chapters/proxy_cache.adoc @@ -0,0 +1,333 @@ +== Distributed GSM / GSUP Proxy Cache: Remedy Temporary Link Failure to Home HLR + +The aim of the Proxy Cache is to still provide service to roaming subscribers even if the GSUP link to the home HLR is +temporarily down or unresponsive. + +If a subscriber from a remote site is currently roaming at this local site, and the link to the subscriber's home HLR +has succeeded before, the GSUP proxy cache can try to bridge the time of temporary link failure to that home HLR. + +Tasks to take over from an unreachable home HLR: + +- Cache and send auth tuples on Send Auth Info Request. +- Acknowledge periodic Location Updating. +- ...? + +=== Design Considerations + +==== Authentication + +The most critical role of the home HLR is providing the Authentication and Key Agreement (AKA) tuples. If the home HLR +is not reachable, the lack of fresh authentication challenges would normally cause the subscriber to be rejected. To +avoid that, a proxying HLR needs to be able to provide AKA tuples on behalf of the home HLR. + +In short, the strategy of the D-GSM proxy cache is: + +- Try to keep a certain number of unused full UMTS AKA tuples in the proxy cache at all times. +- When the MSC requests more tuples, dispense some from the cache, and fill it back up later on, as soon as a good link + is available. +- When the tuple cache in the proxy HLR runs dry, 3G RAN becomes unusable. But 2G RAN may fall back to GSM AKA, if the + proxy HLR configuration permits it: resend previously used GSM AKA auth tuples to the MSC, omitting UMTS AKA items + from the Send Auth Info Result, to force the MSC to send a GSM AKA challenge on 2G. + +The remaining part of this section provides detailed reasoning for this strategy. + +The aim is to attach a subscriber without immediate access to the authentication key data. + +Completely switching off authentication would be an option on GERAN (2G), but this would mean complete lack of +encryption on the air interface, and is not recommended. On 3G and later, authentication is always mandatory. + +The key data is known only to the USIM and the home HLR. The HLR generates distinct authentication tuples, each +containing a cryptographic challenge (RAND, AUTN) and its expected response (SRES, XRES). The MSC consumes one tuple +per authentication: it sends the challenge to the subscriber, and compares the response received. + +The proxy could generate fresh tuples if the cryptographic key data (Ki,K,OP/OPC) from the home HLR was shared with the +proxy HLR. Distributed GSM does not do this, because: + +- The key data is cryptographically very valuable. If it were leaked, any and all authentication challenges would be + fully compromised. + +- In D-GSM, each home site shall retain exclusive authority over the user data. It should not be necessary to share the + secret keys with any remote site. + +So, how about resending already used auth tuples to the MSC when no fresh ones are available? Resending identical +authentication challenges makes the system vulnerable to relatively trivial replay-attacks, but this may be an +acceptable fallback in situations of failing links, if it means being able to provide reliable roaming. + +But, even if a proxying HLR is willing to compromise cryptographic security to improve service, this can only work with +GSM AKA: + +- In GSM AKA (so-called 2G auth), tuples may be re-used any amount of times without a strict need to generate more + authentication challenges. The SIM will merely calculate the (same) SRES response again, and authentication will + succeed. It is bad security to do so, but it is a choice the core network is free to make. + +- UMTS AKA (Milenage or so-called 3G auth, but also used on 2G GERAN) adds mutual authentication, i.e. the core network + must prove that it is authentic. Specifically to thwart replay-attacks that would spoof a core network, UMTS AKA + contains an ongoing sequence number (SQN) that is woven into the authentication challenge. An SQN may skip forward by + a certain number of counts, but it can never move backwards. If a USIM detects a stale SQN, it will request an + authentication re-synchronisation (by passing AUTS in an Authentication Failure message), after which a freshly + generated UMTS AKA challenge is strictly required -- not possible with an unresponsive home HLR. + +Post-R99 (1999) 2G GERAN networks are capable of UMTS AKA, so, not only 3G, but also the vast majority of 2G networks +today use UMTS AKA -- and so does Osmocom, typically. Hence it is desirable to fully support UMTS AKA in D-GSM. + +[options="header"] +|=== +| RAN | authentication is... 2+| available AKA types +| GERAN (2G) | optional | GSM AKA | UMTS AKA +| UTRAN (3G) | mandatory | - | UMTS AKA +|=== + +UMTS AKA will not allow re-sending previously used authentication tuples. But a UMTS capable SIM will fall back to GSM +AKA if the network sent only a GSM AKA challenge. If the proxy HLR sends only GSM AKA tuples, then the MSC will request +GSM authentication, and re-sending old tuples is again possible. However, a USIM will only fall back to GSM AKA if the +phone is attaching on a 2G network. For 3G RAN and later, UMTS AKA is mandatory. So, as soon as a site uses 3G or newer +RAN technology, there is simply no way to resend previously used authentication tuples. + +The only way to have unused UMTS AKA tuples in the proxy HLR is to already have them stored from an earlier time. The +idea is to request more auth tuples in advance whenever the link is available, and cache them in the proxy. When the MSC +uses up some tuples from the proxy HLR, the proxy cache can fill up again in its own time, by requesting more tuples +from the home HLR at a time of good link. Then, the next time the subscriber needs immediate action, it does not matter +whether the home HLR is directly reachable or not. + +In an aside, since OsmoMSC already caches a number of authentication tuples, one option would be to implement this in +OsmoMSC, and not in the proxy HLR: the MSC could request new tuples long before its tuple cache runs dry. However, the +OsmoMSC VLR state is volatile, and a power cycle of the system would lose the tuple cache; if the home HLR is +unreachable at the same time of the power cycle, roaming service would be interrupted. The proxy cache in the HLR is +persistent, so roaming can continue immediately after a power cycle, even if the home HLR link is down. + +==== Location Updating + +Any attached subscriber periodically repeats a Location Updating procedure, typically every 15 minutes. If a home HLR is +unreachable at the time of the periodic Location Updating, a roaming subscriber would assume that it is detached from +the network, even though the local site it is roaming at is still fully operational. + +The aim of D-GSM is to keep subscribers attached even if the remote home HLR is temporarily unreachable. The simplest +way to achieve that is by directly responding with a Update Location Result to the MSC. + +In addition to accepting an Update Location, a proxy HLR should also start an Insert Subscriber Data procedure, as a +home HLR would do. For a periodic Location Updating, the MSC should already know all of the information that an Insert +Subscriber Data would convey (i.e. the MSISDN), and there would be no strict need to resend this data. But if a +subscriber quickly detaches and re-attaches (e.g. the device rebooted), the MSC has discarded the subscriber info from +the VLR, and hence the proxy HLR should also always perform an Insert Subscriber Data. (On the GSUP wire, a periodic LU +is indistinguishable from an IMSI-Attach LU.) + +Furthermore, the longer the proxy HLR's cache keeps a roaming subscriber's data after an IMSI Detach, the longer it is +possible for the subscriber to immediately re-attach despite the home HLR being temporarily unreachable. + +If a subscriber has carried out a GSUP Update Location with the proxy HLR while the home HLR was unreachable, it is not +strictly necessary to repeat that Update Location message to the home HLR later. The home HLR does keep a timestamped +record of an Update Location from a proxy HLR if seen, but that has no visible effect on serving the subscriber: + +- If the home HLR still thinks that the subscriber is currently attached at the home site, it will respond to mslookup + requests. But the actual site the subscriber is roaming at will have a younger age, and its mslookup responses will + win. + +- If the home HLR has no record of the subscriber being attached recently, or has a record of being attached at another + remote site, it does not respond to mslookup requests for that subscriber. If it records the new proxy LU, it still + does not respond to mslookup requests since the subscriber is attached remotely, i.e. there is no difference. + +It is thinkable to always handle an Update Location in the proxy HLR, and never even attempt to involve the home HLR in +case the proxy cache already has data for a given subscriber, but then the proxy HLR would never notice a changed MSISDN +or authorization status for this subscriber. It is best practice to involve the home HLR whenever possible. + +==== IMSI Detach + +If a GSUP client reports a detaching IMSI when the home HLR is not reachable, simply respond with an ack. + +It is not required to signal the home HLR with a detach once the link is back up. A home HLR anyway flags a remotely +roaming subscriber as attached-at-a-proxy, and there is literally no difference between telling a home HLR about a +detach or not. + +(TODO: is there even a GSUP message that a VLR should send on IMSI Detach? see OS#4374) + +[[proxy_cache_umts_aka_resync]] +==== UMTS AKA Resync + +When the SQN between USIM and AUC (subscriber and home HLR) have diverged, the Send Authentication Info Request from the +MSC contains an AUTS IE. This means that a resynchronization between USIM and AUC (the home HLR) is necessary. All of +the UMTS AKA tuples in the proxy cache are now unusable, and the home HLR must respond with fresh tuples after doing a +resync. This also means that either the home HLR must be reachable immediately, or GSM AKA fallback must be allowed for +the subscriber to remain in roaming service. + +In short: + +- A UMTS AKA resync is handled similarly to the attaching of a so far unknown subscriber. +- With the exception that previous GSM AKA tuples may be available to try a fallback to re-using older tuples. + +Needless to say that avoiding the need for UMTS AKA resynchronization is an important aspect of D-GSM's resilience +against unreliable links. + +In UMTS AKA, there is not one single SQN, but there are a number SQN slots, called IND slots or IND buckets. The IND +bitlen configured on the USIM determines the amount of slots available. The IND bitlen is usually 5, i.e. 2^5^ = 32 +slots. Monotonously rising SQN are only strictly enforced within each slot, so that each site should maintain a +different IND slot. OsmoHLR determines distinct IND slots based on the IPA unit name. As soon as more than 16 sites +(with an MSC and SGSN each) are maintained, IND slots may be shared between distinct sites, and administrative care +should be taken to choose wisely which sites share the same slots: those that least share a common user group. + +On 2G RAN, it may be possible to fall back to GSM AKA after a UMTS AKA resync request. +TODO: test this + +Either way, the AUTS that was received from the MSC definitely needs to find its way to the home HLR, and, ideally, the +immediately returned auth tuples from the home HLR should be used to attach the subscriber. + +=== CS and PS + +Each subscriber may have multiple HLR subscriptions from distinct CN Domain VLRs at any time: Circuit Switched (MSC) and +Packet Switched (SGSN) attach separately and perform Update Location Requests that are completely orthogonal, as far as +the HLR is concerned. + +Particularly the UMTS AKA tuples, which use distinct IND slots per VLR, need to be cached separately per CN Domain. + +Hence it is not enough to maintain one cache per subscriber. A separate auth tuple cache and Mobility Management state +has to be kept for each VLR that is requesting roaming service for a given subscriber. + +=== Intercepting GSUP Conversations + +Taking over GSUP conversations in the proxy HLR is not as trivial as it may sound. Here are potential problems and how +to fix them. + +[[proxy_cache_gsup_mm_messages]] +==== Which GSUP Conversations to Intercept + +For the purpose of providing highly available roaming despite unreliable links to the home HLR, it suffices to intercept +Mobility Management (MM) related GSUP messages, only: + +- Send Auth Info Request / Result +- Update Location Request / Result +- Insert Subscriber Data Request / Result +- PurgeMS Request / Result (?) + +An interesting feature would be to also intercept specific USSD requests, like returning the own MSISDN or IMSI more +reliably, or handling services that only make sense when served by the local site. At the time of writing, this is seen +as a future extension of D-GSM and not considered for implementation. + +==== Determining Whether a Home HLR is Responsive + +Normally, all GSUP messages are merely routed via the proxy HLR and are handled by the home HLR. The idea is that the +proxy HLR jumps in and saves a GSUP conversation when the home HLR is not answering properly. + +The simplest method to decide whether a home HLR is currently connected would be to look at the GSUP client state. +However, a local flag that indicates an established GSUP connection does not necessarily mean a reliable link. +There are keep-alive messages on the GSUP/IPA link, and a lost connection should reflect in the client state, so that a +lost GSUP link definitely indicates an unresponsive home HLR. But for various reasons (e.g. packet loss), the link might +look intact, but still a given GSUP message fails to get a response from the home HLR. + +A more resilient method to decide whether a home HLR is responsive is to keep track of every MM related GSUP +conversation for each subscriber, and to jump in and take over the GSUP conversation as soon as the response is taking +too long to arrive. However, choosing an inadequate timeout period would either mean responding after the MSC has +already timed out (too slow), or completely cutting off all responses from a high-latency home HLR (too fast). + +Also, if the proxy HLR has already responded to the MSC, but a slow home HLR's response arrives shortly after, +forwarding this late message to the MSC on top of the earlier response to the same request would confuse the GSUP +conversation. + +So, the proxy HLR just jumping into the GSUP conversation when a specific delay has passed is fairly complex and error +prone. A better idea is to always intercept MM related GSUP conversations: + +[[proxy_cache_gsup_conversations]] +==== Solution: Distinct GSUP Conversations + +A solution that avoids all of the above problems is to *always* take over *all* MM related conversations (see +<<proxy_cache_gsup_mm_messages>>), as soon as the proxy has sufficient data to service them by itself; at the same time, +the proxy HLR should also relay the same requests to the home HLR, and acknowledge its responses, after the fact. + +If the proxy cache already has a complete record of a subscriber, the proxy HLR can always directly accept an Update +Location Request, including an Insert Subscriber Data. A prompt response ensures that the MSC does not timeout its GSUP +request, and reduces waiting time for the subscriber. + +To ensure that the proxy HLR's data on the subscriber doesn't become stale and diverge from the home HLR, the proxy +asynchronously also forwards an Update Location Request to the home HLR. In most normal cases, there will be no +surprises, and the home HLR will continue with an Insert Subscriber Data Request containing already known data, and an +Update Location Result accepting the LU. + +If the home HLR does not respond, the proxy HLR ignores that fact -- the home HLR is not reachable, and the aim is to +continue to service the subscriber for the time being. + +But, should the home HLR's Insert Subscriber Data Request send different data than the proxy cache sees on record, the +proxy HLR can trigger another Insert Subscriber Data Request to the MSC, to correct the stale data sent before. + +Similarly, if the home HLR rejects the Update Location Request completely, the proxy HLR can tell the MSC to detach the +subscriber with a Cancel Location Request message, as soon as it notices the rejection. + +Note that a UMTS AKA resynchronization request invalidates the entire auth tuple cache and needs to either be sent to +the home HLR immediately, if available, or the AUTS from the USIM must later reach the home HLR to obtain fresh UMTS AKA +tuples for the cache. See <<proxy_cache_umts_aka_resync>>. + +=== Message Sequences + +==== Normal Roaming Attach + +On first attaching via a proxy HLR, when there is no proxy state for the subscriber yet, the home HLR must be reachable. + +The normal procedure takes place without modification, except that he proxy HLR keeps a copy of the first auth tuples it +forwards from the home HLR back to the MSC (marked as used) (1). This is to have auth tuples available for resending +already used tuples in a fallback to GSM AKA, in case this is enabled in the proxy HLR config. + +After the Location Updating has completed successfully, the proxy HLR fills up its auth tuple cache by additional Send +Auth Info Requests (2). As soon as unused auth tuples become available, the proxy HLR can discard already used tuples +from (1). + +.Normal attaching of a subscriber that is roaming here +["mscgen"] +---- +include::proxy_cache_attach.msc[] +---- + +==== MSC Requests More Auth Tuples + +As soon as the MSC has run out of fresh auth tuples, it will ask the HLR proxy for more. Without proxy caching, this +request would be directly forwarded to the home HLR. Instead, the proxy HLR finds unused auth tuples in the cache and +directly sends those (3). Even if there is a reliable link, the home HLR is not contacted at this point. + +Directly after completing the Send Auth Info Result, the proxy HLR finds that less tuples than requested by the D-GSM +configuration are cached, and asks the home HLR for more tuples, to fill up the cache (4). If there currently is no +reliable link, this will fail, and the proxy HLR will retry periodically (5) / upon GSUP reconnect. + +.When the MSC has used up all of its auth tuples, but the proxy HLR still has unused auth tuples in the cache +["mscgen"] +---- +include::proxy_cache_more_tuples.msc[] +---- + +==== Running Out of Auth Tuples + +When all fresh tuples from the proxy HLR have been used up, and the home HLR remains unreachable, the proxy HLR normally +fails and rejects the subscriber (default configuration). + +If explicitly enabled in the configuration, the proxy HLR will attempt to fall back to GSM AKA and resend already spent +tuples, deliberately omitting UMTS AKA parts (6). + +Note that an attempt to refill the tuple cache in the proxy HLR always happens asynchronously. If there are no tuples, +that means the link to the home HLR is currently broken, and there is no point in trying to contact it now. Tuples will +be obtained as soon as the link is established again. + +.When the MSC has used up all of its auth tuples and the proxy HLR tuple cache is dry +["mscgen"] +---- +include::proxy_cache_tuple_cache_dry.msc[] +---- + +==== Periodic Location Updating + +Each subscriber performs periodic Location Updating to ensure that it is not implicitly detached from the network. When +the proxy HLR already has a proxy cache for this subscriber, all information to complete the periodic Location Updating +is already known in the proxy HLR. If the link to the home HLR is unresponsive, the proxy HLR mimicks the Insert +Subscriber Data Request that the home HLR would normally send, using the cached MSISDN, and then sends the Update +Location Result. The subscriber remains attached without a responsive link to the home HLR being required. + +.Periodic Location Updating when the MSC still has unused auth tuples +["mscgen"] +---- +include::proxy_cache_periodic_lu.msc[] +---- + +==== UMTS AKA Resync + +The AUTS from a UMTS AKA resync needs to reach the home HLR sooner or later, and a resync renders all UMTS AKA tuples in +the cache stale. + +.Cached tuples become unusable from a UMTS AKA resynchronisation request from the USIM. +["mscgen"] +---- +include::proxy_cache_umts_aka_resync.msc[] +---- |