Friday, 11 August 2017

I Moved to a New ISP and Mail Stopped Flowing - What Now?

Hi There,

A customer of mine has a Hybrid Exchange environment. It recently moved from one ISP to another and mail stopped flowing.

DNS, firewall and NAT rules have been updated. The customer waited for long enough for the DNS to propagate. Still no joy.

What was missing? Searching the Internet returned no meaningful results.

For those of you who are going through the same experience, once you moved to the new ISP, updated your DNS records and changed your firewall and NAT rules, simply re-run the Hybrid Configuration Wizard to populate configuration items with the new details to restore functionality.

The same applies when the public IP address changes, regardless of the reason.

Have a nice day :-)

Password Sync - No Recent Synchronization

Hi There,

This is another one of many posts about AAD Connect failing to synchronise passwords, this time with some additional clarifications.

The error:


The context:

  • The admin configured his own account in the AD-DS connector in the management agent.
  • The admin changed his password over time. AD sync broke.
  • A new service account has been created, dedicated for AD access, and configured the connector to use it to correct the above problem. AD sync started working again.
What didn't happen is permission wasn't granted for the new account to synchronise passwords. User properties were synchronised, but not password hashes.

There were informational 611 events in the Application event log by Directory Synchronization:


The relevant bit: RPC Error 8453 : Replication access was denied. There was an error calling _IDL_DRSGetNCChanges

This is due to the fact that the connector account did not have the following permissions - see https://msdn.microsoft.com/en-us/library/azure/dn757602.aspx:
  • Replicating Directory Changes
  • Replicating Directory Changes All
These permissions are granted on the domain root.

Open Active Directory Users and Computers and in the View menu enable Advanced Features.


Right-click on the domain name, Properties, Security. Add the account and grant the permissions:


Wait for the next synchronization cycle or kick one off manually. Passwords should now sync successfully.

One last thing: the account you have to give permissions to is NOT what's configured in the Microsoft Azure AD Sync service:



Instead, the permissions have to be granted to the account configured on the AD connector:


References:
Happy syncing!

Friday, 3 March 2017

Where Has My Licence Gone?!

"Who removed my licence?"
"Probably Microsoft. Your 30-day grace period has likely expired..."

Hi All,

Recently one of my clients reported that some users lost their O365 licence. They were working yesterday and no longer could log on today - the licence was wiped. Completely. No prior notice.

What was going on?

I searched the audit log. I could see the admin re-assigning the licence so that the user can work, but no trace of its removal.

To cut it short, it ended up on Microsoft's laps. After a couple of weeks of log analysis and investigation, it turned out that following a hybrid mailbox on-boarding, the O365 licence failed to be applied to a small group of users. Even though unlicensed, they were able to connect and use their e-mail because they were in the 30-day grace period. Once the 30-day grace period ended, the mailbox has been disconnected

Why didn't I see an entry in the audit log? Simple: A licence removal event didn't occur because there was no licence to remove in the first place - remember, the license assignment failed. Going back as far as when the mailbox was migrated, we could see an audit log entry for the failed attempt assigning a licence. This can happen when scripting it, and it is easily missed, especially when you have lots of accounts.

While this resolved the mystery, it also revealed a couple of shortcomings of O365:

  • We couldn't tell from the O365 audit log entries what licences were assigned or removed from the user. Microsoft pointed out during the case that O365 and Azure maintain separate audit logs, and the Azure log is more detailed. Not so the O365 log. You can see the activity though and who actioned it. NOTE: It may take up to 12 hours for the action to appear in the log.
  • O365 does NOT alert about the imminent end of the 30-day grace period.
  • Microsoft has very little documentation about the 30-day grace period.
On the topic of documentation, the engineer who worked on the case passed me this link, which states (clutter removed by me):
Assume that you have a hybrid deployment of Microsoft Exchange Online in Microsoft Office 365 and on-premises Microsoft Exchange Server...
If a license is not assigned to the user, the mailbox may be disconnected...
This issue occurs if the mailbox was migrated to Exchange Online as a regular user mailbox ... If the user isn't licensed, and if the 30-day grace period has ended, the mailbox is disconnected...

Now that I know what to look for, I've come across this link which states:
After you create a new mailbox using the Exchange Management Shell, you have to assign it an Exchange Online license or it will be disabled when the 30-day grace period ends.

Takeway #1: Always check licences after a mailbox on-boarding in a hybrid migration.

Takeway #2: Monitor your users regularly for their licensed status. Automatic alerting may be flaky, so if you are a developer then you may want to rock up an application and use the audit APIs to extract data and send alerts.

Takeawy #3: You need to search the correct audit log. There are a couple Security and Compliance centers in different places on the portal. The one you're after is under Admin centers | Security & Compliance, then in the new window navigate to Search & Investigation | Audit Log Search


Once there, select to search for the Update user and Changed user license activities in the User administration activities section:


Happy auditing!

Friday, 10 February 2017

Email and UPN do not belong to the same namespace

Hi There,

Just recently I helped out in a case where hybrid Exchange users with the mailbox in the cloud were failing to retrieve autodiscover configuration data.

It was an environment with multiple federated domains, and mailboxes split between on-premises and O365. On-premises mailboxes worked well, only O365 mailboxes were failing, and, more bizarrely, one of the domains worked while the others didn't.

As we know, mail clients for onboarded hybrid mailboxes go through a double autodiscover process:

  1. The first iteration discovers the on-premises mail user, which then redirects the client to the cloud mailbox.
  2. The client then goes through a second autodiscover process, this time against the cloud mailbox.

Looking at the Microsoft Remote Connectivity Analyzer output, I noticed that the first iteration succeeded. The issue was with the second pass. The error:

X-AutoDiscovery-Error: LiveIdBasicAuth:LiveServerUnreachable:<X-forwarded-for:40.85.91.8><ADFS-Business-105ms><RST2-Business-654ms-871ms-0ms-ppserver=PPV: 30 H: CO1IDOALGN268 V: 0-puid=>LiveIdSTS logon failure '<S:Fault xmlns:S="http://www.w3.org/2003/05/soap-envelope"><S:Code><S:Value>S:Sender</S:Value><S:Subcode><S:Value>wst:InvalidRequest</S:Value></S:Subcode></S:Code><S:Reason><S:Text xml:lang="en-US">Invalid Request</S:Text></S:Reason><S:Detail><psf:error xmlns:psf="http://schemas.microsoft.com/Passport/SoapServices/SOAPFault"><psf:value>0x800488fc</psf:value><psf:internalerror><psf:code>0x8004786c</psf:code><psf:text>Email and UPN do not belong to the same namespace.%0d%0a</psf:text></psf:internalerror></psf:error></S:Detail></S:Fault>'<FEDERATED><UserType:Federated>Logon failed "User@domain.tld".;

Needless to say, I checked on-premises and cloud user and mailbox properties, UPNs, addresses, connector address spaces, proxy addresses, the lot, to no avail. It was all configured correctly.

Then I checked ADFS. There I found a couple of errors, so I turned on debug trace. Nothing obvious there either, so I turned it off.

I tested ADFS logon via https://sts.domain.tld/adfs/ls/IdpInitiatedSignon.aspx: Successful.

ADFS was working well per se. Somehow, it was getting incorrect details from O365.

I also suspected that somehow the ADFS proxy was breaking the SSL stream - I dealt with a similar situation before. However this idea has been dropped when the original (3rd party) proxy was replaced with Microsoft's native Web Application Proxy and the issue remained.

To recap:

  • Users had matching e-mail and UPN
  • ADFS itself was working
  • Multiple federated domains
  • ADFS Proxy (WAP) ruled out


Then I decided to re-federate the domains. Since I had very little (a.k.a. none whatsoever) information on how it was set up initially, and no details on any deployment history, tabula rasa seemed in order.

So I converted the domains to Managed, then re-federated them. Since there are multiple federated domains, I used the Convert-MsolDomainToFederated cmdlet with the SupportMultipleDomain switch.

Coffee time, to allow some time to pass to do its things (it's a distributed environment and things don't happen instantly).

<suspense>Then came the test</suspense> ...  Lo and behold: everything started to work! Every federated domain, every service.

In summary: The hybrid environment and user accounts were configured correctly, yet the wrong details were passed by O365 to ADFS. Either the trusts were incorrectly configured, or the federation metadata got corrupted. Whatever it was, re-federating the domains fixed it.

Sunday, 29 January 2017

The Curious Case of GroupPolicy Error 1053

Recently I was involved in a case where a user was losing his Taskbar icons and email signatures. He had all possible folders redirected to a remote network share over a WAN link. The icons are stored in the AppData (Roaming) structure and I wanted to bring it back to the local computer (a.k.a. cancel redirection for the AppData folder) without affecting other folders.

I knew I had to fiddle with some GPOs, and this is where the fun started.

I created a new GPO, applied security to only scope it to this one user, and applied GPUPDATE. It bombed out:


The Details of the same GPO suggests that the user is disabled:

ErrorCode: 1331
ErrorDescription: This user can't sign in because this account is currently disabled.


I new network connectivity wasn't an issue. I checked DNS, although computer settings were applied successfully. This uncovered a whole heap of inconsistencies which I corrected, however it didn't affect my GPOs. I even turned on USERENV logging - nothing useful in there either, and neither in the GroupPolicy log.

I new the account wasn't disabled because I was logged on with that account.

I was banging my head against the wall. Then I came across a forum where someone suggested to clear the credentials vault.

Checked the vault, and surprises surprise, found the credentials of another user, and yes, that user's account was DISABLED! Enabled the account, and GPUPDATE now reported that the credentials are invalid. Indeed, I noticed Security-Kerberos Warning 14 events in the user's System log to which I didn't give much attention before:


I was getting somewhere.

So apparently, once upon a time, my user had to access another user's resources for whatever reason. However, in time, the other user moved on, his password was changed and the account was disabled. The credentials in my user's Credentials Manager vault were not refreshed or removed.

There was no trace anywhere in any log that I could find of the account in the vault: not in the USERENV log, not in the domain controller's Security log - nowhere. Not one hint.

Clearing the obsolete account's credentials from my user's Credential Manager fixed it: the GPOs are now applying happily.



Takeaway point: GroupPolicy Error 1053 events in the System log with ErrorCode 1331 is likely caused by wrong credentials of another user in Credential Manager. Go check your vault and clear/update anything suspicious.

Thursday, 1 December 2016

Potential Security Hole In RMS with File Classification Infrastructure

Brace yourself for a long post on what to expect when embarking on protecting files with RMS and Windows Server File Classification Infrastructure (FCI). It isn't pretty, no matter how I look at it.

The Short Story

A client had a need to protect files and information against data leak. Being an Office 365 tenant also, the obvious choice was Azure RMS.

I implemented Azure RMS in conjunction with File Classification Infrastructure. It came together nicely, however during testing it became evident that Microsoft still has a long way to go to deliver a dependable solution.

I found that RMS in combination with FCI is unreliable and exposes unprotected information to the risk of being leaked.

The Flaw

A deadly combination of PowerShell scripts and uncorrelated tasks leads to failures due to file sharing violations, and ultimately to unprotected files and potential data leak.



While my implementation used Azure RMS, I dare say that a full-blown on-premise AD RMS solution would be plagued by the same flaw.

The Nitty-Gritty

The official Microsoft implementation document is available at https://docs.microsoft.com/en-us/information-protection/rms-client/configure-fci. Says the article (highlights from me):


I did just that. Then I started testing. I set the schedule for the task (remember, "continuous operation" was also enabled).

The first thing that I noticed was that some of my test files remained unprotected after the task ran. In the Application event log I noticed the following error:



Tried the easy way out, checking the HTML report:

Error: The command returned a non-zero exit code.

Useless as a windshield wiper on a submarine.


Every time the scheduled RMS script ran, I got random files being listed as failing. Most weird...

NOTE: Before anyone starts to agitate, I want to note that I dropped a copy of the RMS-Protect-FCI.ps1 file in the test folder. This is NOT the copy that's being run to protect files, but just a copy of it for testing generic protection.

I checked the log mentioned in the event. The result: same as above (useless):


More digging reveals that additional logs are created in the same folder as the one above. I noticed that the log in the screenshot (see further below) correlates to the one above in that it has been created roughly 5 minutes after the above log - consistently approx. 5 minutes delay during my tests (your tests may deem different delays depending on server performance but it should be consistent between test runs).

In the log I was getting the following:

Item Message="The properties in the file were modified during this classification session so they will not be updated. Another attempt to update the properties will be made during the next classification session."

and

Rule="Check File State Before Saving Properties"


These are different files, but nevertheless I had them in the same test folder and they correlate with previous events. We are getting somewhere but not quite there yet. The information is insufficient to draw any conclusions.

For the record, log locations are configured as follows:


A gut feeling prompted me to look at the Event Viewer. I turned on the Command Line field to see what and how is launched, and started watching what was chewing CPU power and how it was launched. I was in for a rude shock: the PowerShell task configured on the Action tab of the File Management Task properties was being re-launched as soon as it exited, in an endless loop. It was obvious that it was my RMS script. The following screenshot cannot convey the dynamics, but if you watch Task Manager for a while you'll see how PowerShell is (re)protecting all files in the folder, in a never ending loop:


Then I tried to manually protect a file. It produced a file corruption error:

Protect-RMSFile : Error protecting Shorter.doc with error: The file or directory is corrupted and unreadable. HRESULT: 0x80070570

I instantly kicked off CHKDSK, but it came back clean. I tried to open the file in question - successful. Apparently there was no corruption, I could open the file, yet RMS failed due to file or folder corruption. Go figure.


Then I wanted to see what the file protection status was in general. I ran the following command as documented at https://docs.microsoft.com/en-us/information-protection/rms-client/configure-fci:

foreach ($file in (Get-ChildItem -Path C:\RMSTest -Force | where {!$_.PSIsContainer})) {Get-RMSFileStatus -f $file.PSPath}

To my surprise, I got a file access conflict error:

Get-RMSFileStatus : The process cannot access the file because it is being used by another process. HRESULT: 0x80070020

I ran this command numerous times during my tests, but I've never seen this error before. A subsequent run of the same command came back clean. Is it backup? Is it the antivirus? It was in the middle of the night so no-one was accessing it except my script. I was puzzled...


A bit later, I was watching Task Manager while the scheduled task was in progress, with "continuous operation" also enabled. I noticed that I had two RMS PowerShell tasks running concurrently:


One of them disappeared as soon as the scheduled task completed, while the one spawned by what I thinks is "continuous operation", remained. This may explain the random sharing violations and errors in the logs above.

I also noticed that while new files that I created via Word or Excel were protected almost instantly as a result of having continuous operation enabled, files that I copied from elsewhere have not been protected neither by the "continuous operation" task nor by the scheduled task.

I also want to point out that the PowerShell script can be scheduled no faster than once a day. Therefore if it fails to protect a file, then there is at least a 24-hour window until it runs again, during which unprotected files can be leaked.

As I mentioned it before, I've been watching Task Manager. I noticed that the RMS PowerShell script took up significant processing power. The official Microsoft documentation at RMS protection with Windows Server File Classification Infrastructure (FCI) (also linked above) states that the script is run against every file, every time:


Knowing the effects an iterative file scan has on disk performance, I kicked off a Perfmon session that shows CPU utilization and % Disk Time, just to check. Then, while watching it, I turned off "continuous operation" in File Server Resource Manager:



The effect was staggering: the CPU almost instantly took a holiday and the disk too was much happier:



Putting the puzzle together:
  • The File Server Resource Manager runs a scheduled task and a continuous task concurrently.
  • I am getting random files failing to be protected (see SRMREPORTS error event 960 and associated logs above)
  • I am getting random sharing violation errors.
  • I had one corruption report, although I cannot be sure it was the result of my setup.
  • I was expecting that the "continuous operation" mode is event-driven (it is triggered only when a new file is dropped in the folder). What Task Manager indicates though is that it is an iterative task in an endless loop that constantly scans the disk. This seems to be corroborated by Perfmon data.
  • The "continuous operation" PowerShell task is a resource hog.

Choices, Choices...

It looks like having "continuous operation" enabled for instant protection of new files works okay. Therefore you would want it turned on to ensure that chances of leaking sensitive data are minimised. However, as I have observed, it conflicts with the scheduled file protection task, leaving random files unprotected.

On the other hand, I also observed that if I turn off continuous operation, then, besides the fact that the system breathes a lot easier, the scheduled task protects files more reliably. On the downside, it leaves a significant time window during which files aren't protected and thus sensitive information can be compromised.


In Conclusion

My observations lead me to the conclusion that when using PowerShell scripts to RMS-protect files, the File Server Resource Manager fails to coordinate scheduled and continuous operation tasks and scripts, and thus it results in random sharing violations and ultimately unprotected files. As a result, the primary purpose of RMS, that of protecting files, falls flat on its face, giving the unsuspecting or malicious user plenty of time to leak potentially sensitive data.

Additionally, the way PowerShell scans the file system and (re)protects files, results in significant performance degradation even on lightly utilised systems.

While RMS is a great piece of technology that mitigates the issue of leaking sensitive data, it seems that it has a long way ahead to become a dependable tool when it comes to integrating it with File Classification Infrastructure.

PS: All this is based on observation and trial and error, and hence it may be incorrect in some respects. Information is scarce out there. I welcome any comments, suggestions and pointers to additional information.