Location: List Archives

List Archives

This forum is an archive of all posts to our mailing list over the past few years.  The forum is set read only therefore to contribute you will need to join our list community.  See more info about this here.

List Archives

Subject: [ActiveDir] Ntds.dit file corruption
Prev Next
You are not authorized to post a reply.

Page 4 of 4<< < 1234
AuthorMessages
AD000001290User is Offline

Posts:0

12/08/2005 9:10 AM  
Maybe I should flip the question around a little...
What are the changes made in exch2k3 sp1 (involving ECC corrections) and
why were they deemed necessary, given what I have read from
joe/Brett/Eric et al)??

The changes appear to be superfluous. We do not appear to need such a
(further) check re: AD/ESE(?)

What am I missing guys?

neil
PS Great thread so far :)

-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of joe
Sent: 07 December 2005 01:55
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption

Good post ~Eric, thanks for chiming in.

I see where you are coming from with the corruption at the distributed
level. In terms of corruption at that level I see it as corruption but
just can't get myself to see it as AD corruption. I am not sure if I can
put it down in words why. I just don't. :)

joe

-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Eric Fleischman
Sent: Tuesday, December 06, 2005 5:42 PM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption
I would generally not call USN rollback a corruption either, but I think
Dean make a fair and quasi-valid point that if you consider the
distributed system, yes such a thing is a corruption. Feel free to shim
in a "AD Distributed System Logical Layer" in the above stack, between
AD Logical Layer and App Logical Layer. I'm waffling on this point
though, as somethign smells differnent that other types of corruption.
I'm going to think about that for a long time ... in fact Eric yes the
~Eric) is at my door and says he would consider it corruption, so there
is a long debate in my future as well ...
Over lunch, Brett and I discussed this some more. My contention is that
USN rollback would be a form of corruption under a somewhat broad
definition.
The reality is that there is a layer that Brett mentioned which actually
has a two parts when looked at from a high level. Namely, this layer:
> AD Logical Layer

The first piece could be thought of as local logical layer. That is,
data hierarchy, conforming to the code assumptions of how it should be,
data conforming to the schema as defined, etc. This is a layer of data
that clearly need be proper (leaving the definition of proper to another
day), else we are in some sort of corrupt state. Brett and I both agree
on this I'm pretty sure.

However, there is then distributed systems corruption. In AD, one of the
services we aim to provide is convergence. If we do not converge, we
define this divergence as at a minimum "bad", perhaps "corrupt."
USN rollback breaks our convergence guarantees, it breaks replication
such that you will not attain convergence in the system. I would as such
consider it a form of corruption.

Over Teriyaki a few minutes ago, Brett posited the question "well if USN
rollback is corruption, what else?" Valid question. I would concede that
if USN rollback is considered distributed systems corruption, so too
would be other conditions which yield divergence. Perhaps this is a
slippery slope that goes too far. I need to think about this some more.

I would also toss out there that corruption should not be confused with
"forever broken." There are many states in which the directory can exist
where it is functional, but in some way broken. Such divergences can
typically be repaired with administrative action, so long as it is a
savvy administrator. :) If we are willing to assume that divergence is
corruption, I'd tend to believe that most people on this list have
recovered from some form of corruption before. The worse the corruption,
the more help you likely want to recover from it. :)

Anyway, we'll likely debate this for a few months, as we usually do on
such points. More thoughts to come as we debate further.

~Eric

-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Brett Shirley
Sent: Tuesday, December 06, 2005 12:04 PM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption

I wouldn't say that, joe ...

Lets take another hypothetical real quick, lets say you have a column
for the RDN of an AD object (well we do) and that value is NULL. From
AD's perspective this object is well not really an object, it would be
corrupt, and might even crash lsass.exe (I don't know, it might).

However, from ESE's persepctive though, the table/row/column is valid,
it has a particular column that doesn't have a value. A column which I
might add is declared "optional" (real term is tagged) in the ESE layer
"schema"
(real term is catalog). ESE is simply a store of data, it passes no
judgement on the data as long as it fits the schema guidelines for the
column.

Joe, is the DB corrupt? An AD object without an RDN?

----

I have tendency to think in layers and sources of corruption.
App Logical Layer
AD Logical Layer
ESE Logical Layer
[ESE] Physical Layer

Corruptions coming top down through that stack are protected by the
schema configuration/constraints of that layer (as joe astutely pointed
out).

Corruptions coming bottom up, from disk sub-system hardware, are
protected by whatever mechanisms those layers have.

----

Dropping back to the above hypothetical as an ESE dev I can say to the
AD devs that until they can prove that ESE actually lost thier column,
that it's most likely some sort of AD transactional problem, and the
source is an AD bug. If I am feeling unbusy I will debug at the AD
logical layer, because I know what it's supposed to look like.

----

Coming back to the original issue of replicating _this kind_ of
corruption a normal corruption coming bottom up, because the bits we
(ESE) sent down the disk subsystem, were not the exact bits we got back
later from the sub-systems is almost always detected by the fact that
ESE checksums _every byte_ of it's database pages ... and at this point
everyone should be very thankful Win2k3 AD isn't on SQL 2000, because it
has few such protections, though SQL 2005 finally caught up, 10 years
after the fact, it's such a legacy DB, really ... anyway.

When the corruption comes up from the bottom, what happens is ESE
detects the data is not checksumming, logs an event, and returns a -1018
error (in this case), and starts rejecting DB operations (such as
JetSeek() /
JetRetrieveColumn()) that involve that corrupt database page. AD then
responds to these failed DB ops with can't authenticate a user, AD can't
return the results of a search, or AD can't read or apply data during
replication (those 3 at least probably being the most common). In short
the system starts limping, without affecting the rest of the distributed
system.

----

Coming back to jose's worry of old hardware injecting bad data into the
distributed system. Fortunately, when the disk subsystem goes bad, ESE
does a pretty good job of protecting you, but there are other sources of
corruption, besides corruption, an especially insidious one is the bit
flip in memory (and yes I see these too) which injects itself in the
middle of the above stack. This kind of corruption can both end up
making it's way down to the disk subsystem (with a valid ESE checksum),
and up and out to the distributed system.

>From the perspective of older hardware though, I would _hypothesize_
that
if you're going to have something go bad the disk or the memory over
time, keep in mind the disk is the only part of the computer that has a
moving part. I would expect disks to go bad first.

----

I would generally not call USN rollback a corruption either, but I think
Dean make a fair and quasi-valid point that if you consider the
distributed system, yes such a thing is a corruption. Feel free to shim
in a "AD Distributed System Logical Layer" in the above stack, between
AD Logical Layer and App Logical Layer. I'm waffling on this point
though, as somethign smells differnent that other types of corruption.
I'm going to think about that for a long time ... in fact Eric yes the
~Eric) is at my door and says he would consider it corruption, so there
is a long debate in my future as well ...

>From a storage developers perspective, what someone usually calls
corruption, is when the data layer they own or lower returns the wrong
result.

>From a non-storage developers perspective, what someone usually calls
corruption, is when the data layer below them returns the wrong result.

----

I'll wax more philosophically on it later ....

Cheers,
BrettSh

On Tue, 6 Dec 2005, Dean Wells wrote:

> Great topic and, IMO, great answer ... I've only a few comments in
addition
> to Joe's reply (inline).
> --
> Dean Wells
> MSEtechnology
> * Email: dwells @msetechnology.com
> http://msetechnology.com
>
>
>
> _____
>
> From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
> [mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of joe
> Sent: Tuesday, December 06, 2005 8:56 AM
> To: ActiveDir@xxxxxxxxxxxxxxxxxx
> Subject: RE: [ActiveDir] Ntds.dit file corruption
>
>
> I may get into trouble with this post as Brett/Eric/Dean/Steve correct
me...
> But that will be good.
>
> [DAW]
> I'm fairly certain Bratt will have something to say on this one (in
his
> shoes, I know I would).
> [/DAW]
>
> I will start with trying to differentiate between types of
corruption... My
> idea of AD corruption is underlying table corruption. However some
people
> may consider bad (really unexpected) values in AD to be corruption.
The last
> isn't corruption, AD is simply a store of data, it passes no judgement
on
> the data as long as it fits the schema guidelines for the attribute.
If you
> have the DN of a user in the siteObject attribute that isn't
corruption, it
> isn't good, but it is valid for the schema. Or if you have binary data
in a
> unicode string, again, not corruption (a unicode string IS binary
data).
> That being said, if apps (including parts of AD itself) hit unexpected
data,
> you will have some issues even if it isn't truly "corruption" it may
as well
> be in some cases. In fact, table corruption is probably better than
> unexpected data in many cases.
>
> You might be able to argue that a USN rollback is corruption but I
still
> don't consider it so. Valid data, just out of step.
>
> [DAW]
> That's an interesting one. If you treat the distributed database as a

> whole, then USN rollback is indeed a form of corruption even though
each
> instance may deem itself consistent and intact.
> [/DAW]
>
> Again corruption to me is in the underlying tables. Since AD doesn't
> replicate the table structures, you can't pass that table corruption
around.
> Once AD realizes that some portion of the database is corrupt which
would
> probably be recognized by ESE saying, "that isn't right" and not
passing
> info back up to higher levels, but instead passing an error.
>
> joe
>
>
>
>
> _____
>
> From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
> [mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of
> neil.ruston@xxxxxxxxxxxxx
> Sent: Tuesday, December 06, 2005 3:49 AM
> To: ActiveDir@xxxxxxxxxxxxxxxxxx
> Subject: RE: [ActiveDir] Ntds.dit file corruption
>
>
> Is this guaranteed? How can we/you be sure that the system will
recognise
> the corruptions and therefore not replicate them? Surely this is akin
to the
> new feature added to e2k3 sp1, but which is (sadly) missing from AD(?)
>
> I must be missing a subtle point - please show me the light :)
>
>
> neil
>
> _____
>
> From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
> [mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Steve Linehan
> Sent: 05 December 2005 19:26
> To: ActiveDir@xxxxxxxxxxxxxxxxxx
> Subject: RE: [ActiveDir] Ntds.dit file corruption
>
>
> We do not replicate corruption so if you have local corruption as
noted
> below there is no worry that it would replicate around to other
servers in
> the environment.
>
> Thanks,
>
> -Steve
>
> _____
>
> From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
> [mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Phil Renouf
> Sent: Monday, December 05, 2005 1:04 PM
> To: ActiveDir@xxxxxxxxxxxxxxxxxx
> Subject: Re: [ActiveDir] Ntds.dit file corruption
>
>
> Will Read Only DC's take care of this? I don't know much about them
yet, but
> it makes sense that if the copy of the dit that a DC has is RO that it
won't
> try to replicate that anywhere and would only be the recipient of
> replication. Anyone with more knowledge about how RO DC's will work to

> comment on that?
>
> Phil
>
>
> On 12/5/05, Medeiros, Jose wrote:
>
> Well at least the corruption occurred on just a single DC. One thing
that
> has bugged me about Active Directory is not being able to select if
you want
> a DC in a remote office to not have the ability to replicate back in a
large
> enterprise environment. Since most remote offices only have a few
people at
> the location and a DC is usually placed for improvised logon and
> authentication time, many companies will either use a very low end
server or
> a very old decommissioned one from their production data center (
Which is
> probably close to useable life ). I am always concerned that once the
> NTDS.DIT file becomes corrupt it will replicate the corruption to the
other
> DC's in the Forrest.
>
> Maybe I am just being a worry wort and this really is not an issue.
>
>
>
> Sincerely,
> Jose Medeiros
> ADP | National Account Services
> ProBusiness Division | Information Services
> 925.737.7967 | 408-449-6621 CELL
>
>
>
>
> -----Original Message-----
> From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>
> [mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx]On Behalf Of Susan Bradley,

> CPA aka Ebitz - SBS Rocks [MVP]
> Sent: Monday, December 05, 2005 8:53 AM
> To: ActiveDir@xxxxxxxxxxxxxxxxxx
> Subject: Re: [ActiveDir] Ntds.dit file corruption
>
>
> I did? :-) I think I still said all I know is what the poster said
:-)
>
> I think I need a course in event log reading because even with the
logs,
> and the default size of the logs, I still don't see a smoking gun.
The
> directory services one is filled with events 'post' blow up.
>
> What is interesting is that it seems to me big server land goes .. oh
> yeah... ntds.dit corruption... and sbsland freaks out. Either we do
> indeed need to ensure we have a secondary DC or we need to park a
second
> copy of a system state offsite [say at the vap/var]
>
> Brett Shirley wrote:
> > She replied offline, very likely a single bit flip, tragedy, they
aren't
> > one release later (Longhorn), where this would've probably been
> > non-disruptively handled, logged, and possibly self-healed:
> > http://blogs.technet.com/efleis/archive/2005/01.aspx
> >
> > Anyway, this kind of thing is usually hardware ...
> >
> > While there are much better disk sub-system testers, one that is
freely
> > available to any box with Exchange is jetstress. You might give
that a
> > try. If you can reproduce the event / error with jetstress I would
not
> > use that box in production.
> >
> > If you do reproduce the issue several times (several times is key,
as you
> > want a trend before you start playing the variable game), some
things
> > you might vary (one at a time):
> >
> > - Try making sure you have the latest driver and motherboard /
controller
> > firmware. Then see if you can reproduce.
> >
> > - Try a different RAID configuration, such as RAID1/RAID1+0 if
you're on
> > RAID5.
> >
> > - Try swapping out the hard drives, one at a time.
> >
> > - Adding the jetstress files to the exclude list in the Anti-Virus
> > software. (A low probablility, I've never heard of Anit-Virus
causing this
> > paticular type of error, and I can't imagine the mistake an
anti-virus
> > product would have to have to cause this side effect)
> >
> > - If you can reproduce it several times, you could followup with
Dell.
> > Good luck.
> >
> > I'm not sure if I answered your question ...
> >
> > Cheers,
> > BrettSh
> >
> >
> > On Sun, 4 Dec 2005, Eric Fleischman wrote:
> >
> >
> >> Going back to the original post, I'm not sure I fully understand
the
> >> problem yet. Susan, can you define "ntds.dit file corruption" for
us?
> >> What sort of corruption? What errors/events lead you to believe
this?
> >> Specifically, I'm interested in errors from NTDS ISAM or ESE if you

> >> have any.
> >>
> >>
> >>
> >> ________________________________
> >>
> >> From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx on behalf of Susan
Bradley, CPA
> aka Ebitz - SBS Rocks [MVP]
> >> Sent: Sat 12/3/2005 10:58 PM
> >> To: ActiveDir@xxxxxxxxxxxxxxxxxx
> >> Subject: [ActiveDir] Ntds.dit file corruption
> >>
> >>
> >>
> >> SBS box [with Windows 2003 sp1 since September]
> >>
> >> RE: [ActiveDir] Database Corruption:
> >>
http://www.mail-archive.com/activedir@xxxxxxxxxxxxxxxxxx/msg32676.html
> >>
> >> We have a SBS 2003 sp1 box with a corrupt ntds.dit that the
Consultant
> >> and PSS have been banging on. Could not get the services back
running,
> >> changed the RPC service to local system and some service came back
up [I
> >> don't have all the details but the consultant opened a support case
of
> >> SRX051202605433].
> >>
> >> Bottom line they are about going to give up and start a restore but

> >> before they do that I'd like to get the view of the AD gods and
> >> goddesses around here. From all that I've seen, read, seen in the
SBS
> >> newsgroup, the corruption of ntds.dit is rare to nil and an
underlying
> >> cause is hardware issues [raid, disk subsystem]. This doesn't just

> >> happen.
> >>
> >> The VAP asked if not properly excluding the ad databases from the
a/v
> >> would cause this/trigger this and my expectation is 'no', given
that I
> >> doubt the majority of us in SBSland properly set up exclusions
> >> Virus scanning recommendations on a Windows 2000 or on a Windows
Server
> >> 2003 domain controller:
> >> http://support.microsoft.com/default.aspx?scid=kb;en-us;822158
> >>
> >> If this were my hardware and box, I'd be putting this sucker on the

> >> operating table and getting an autopsy before putting it back
online.
> >>
> >> Are we right in being paranoid now about this hardware? For you
guys in
> >> big server land you'd just slide over another box into that server
role.
> >>
> >> ---------------------------------------
> >> Stupid question alert....
> >>
> >> Okay so we know that having a secondary/additional domain
controller is
> >> a good thing even in SBSland...but question.... many times the
second
> >> server in SBSland is a terminal server box because we do not
support TS
> >> in app mode on our PDCs. So we've established that having a domain
> >> controller and a terminal server is a security issue [see Windows
> >> Security resource kit, NIST Terminal services hardening guide, etc
> >> etc....] If our second server is a member server handing out TS
> >> externally, should that be a candidate for the additional DC? Are
the
> >> issues of TS on a DC ... true for 'any' DC? Would it be better
than to
> >> Vserver/VPC a Win2k3 inside a workstation in the network if a third

> >> server box was not feasible?
> >>
> >> List info : http://www.activedir.org/List.aspx
>
> >> List FAQ : http://www.activedir.org/ListFAQ.aspx
> >> List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/
>
> >>
> >>
> >>
> >>
> >
> > List info : http://www.activedir.org/List.aspx
> > List FAQ : http://www.activedir.org/ListFAQ.aspx
> > List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/
> >
> >
>
> --
> Letting your vendors set your risk analysis these days?
> http://www.threatcode.com
>
> List info : http://www.activedir.org/List.aspx
>
> List FAQ : http://www.activedir.org/ListFAQ.aspx
> List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/
>
>
>
>
>
> List info : http://www.activedir.org/List.aspx
> List FAQ : http://www.activedir.org/ListFAQ.aspx
>
> List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/
>
>
>
> PLEASE READ: The information contained in this email is confidential
and
> intended for the named recipient(s) only. If you are not an intended
> recipient of this email please notify the sender immediately and
delete your
>
> copy from your system. You must not copy, distribute or take any
further
> action in reliance on it. Email is not a secure method of
communication and
> Nomura International plc ('NIplc') will not, to the extent permitted
by law,
>
> accept responsibility or liability for (a) the accuracy or
completeness of,
> or (b) the presence of any virus, worm or similar malicious or
disabling
> code in, this message or any attachment(s) to it. If verification of
this
> email is sought then please request a hard copy. Unless otherwise
stated
> this email: (1) is not, and should not be treated or relied upon as,
> investment research; (2) contains views or opinions that are solely
those of
>
> the author and do not necessarily represent those of NIplc; (3) is
intended
> for informational purposes only and is not a recommendation,
solicitation or
>
> offer to buy or sell securities or related financial instruments.
NIplc
> does not provide investment services to private customers. Authorised
and
> regulated by the Financial Services Authority. Registered in England
> no. 1550505 VAT No. 447 2492 35. Registered Office: 1 St
Martin's-le-Grand,
> London, EC1A 4NP. A member of the Nomura group of companies.
>

List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/
List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/

List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/

PLEASE READ: The information contained in this email is confidential and
intended for the named recipient(s) only. If you are not an intended
recipient of this email please notify the sender immediately and delete your
copy from your system. You must not copy, distribute or take any further
action in reliance on it. Email is not a secure method of communication and
Nomura International plc ('NIplc') will not, to the extent permitted by law,
accept responsibility or liability for (a) the accuracy or completeness of,
or (b) the presence of any virus, worm or similar malicious or disabling
code in, this message or any attachment(s) to it. If verification of this
email is sought then please request a hard copy. Unless otherwise stated
this email: (1) is not, and should not be treated or relied upon as,
investment research; (2) contains views or opinions that are solely those of
the author and do not necessarily represent those of NIplc; (3) is intended
for informational purposes only and is not a recommendation, solicitation or
offer to buy or sell securities or related financial instruments. NIplc
does not provide investment services to private customers. Authorised and
regulated by the Financial Services Authority. Registered in England
no. 1550505 VAT No. 447 2492 35. Registered Office: 1 St Martin's-le-Grand,
London, EC1A 4NP. A member of the Nomura group of companies.

List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive: http://www.mail-archive.com/activedir%40mail.activedir.org/
sbradcpaUser is Offline

Posts:299

12/08/2005 9:29 AM  
I thought it was Windows 2003 sp1 that had additional database
correction stuff?
neil.ruston@xxxxxxxxxxxxx wrote:
Maybe I should flip the question around a little...
What are the changes made in exch2k3 sp1 (involving ECC corrections) and
why were they deemed necessary, given what I have read from
joe/Brett/Eric et al)??
The changes appear to be superfluous. We do not appear to need such a
(further) check re: AD/ESE(?)

What am I missing guys?

neil
PS Great thread so far :)

-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of joe
Sent: 07 December 2005 01:55
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption

Good post ~Eric, thanks for chiming in.
I see where you are coming from with the corruption at the distributed
level. In terms of corruption at that level I see it as corruption but
just can't get myself to see it as AD corruption. I am not sure if I can
put it down in words why. I just don't. :)

joe

-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Eric Fleischman
Sent: Tuesday, December 06, 2005 5:42 PM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption
I would generally not call USN rollback a corruption either, but I think
Dean make a fair and quasi-valid point that if you consider the
distributed system, yes such a thing is a corruption. Feel free to shim
in a "AD Distributed System Logical Layer" in the above stack, between
AD Logical Layer and App Logical Layer. I'm waffling on this point
though, as somethign smells differnent that other types of corruption.
I'm going to think about that for a long time ... in fact Eric yes the
~Eric) is at my door and says he would consider it corruption, so there
is a long debate in my future as well ...
Over lunch, Brett and I discussed this some more. My contention is that
USN rollback would be a form of corruption under a somewhat broad
definition.
The reality is that there is a layer that Brett mentioned which actually
has a two parts when looked at from a high level. Namely, this layer:

AD Logical Layer


The first piece could be thought of as local logical layer. That is,
data hierarchy, conforming to the code assumptions of how it should be,
data conforming to the schema as defined, etc. This is a layer of data
that clearly need be proper (leaving the definition of proper to another
day), else we are in some sort of corrupt state. Brett and I both agree
on this I'm pretty sure.

However, there is then distributed systems corruption. In AD, one of the
services we aim to provide is convergence. If we do not converge, we
define this divergence as at a minimum "bad", perhaps "corrupt."
USN rollback breaks our convergence guarantees, it breaks replication

such that you will not attain convergence in the system. I would as such
consider it a form of corruption.

Over Teriyaki a few minutes ago, Brett posited the question "well if USN
rollback is corruption, what else?" Valid question. I would concede that
if USN rollback is considered distributed systems corruption, so too
would be other conditions which yield divergence. Perhaps this is a
slippery slope that goes too far. I need to think about this some more.

I would also toss out there that corruption should not be confused with
"forever broken." There are many states in which the directory can exist
where it is functional, but in some way broken. Such divergences can
typically be repaired with administrative action, so long as it is a
savvy administrator. :) If we are willing to assume that divergence is
corruption, I'd tend to believe that most people on this list have
recovered from some form of corruption before. The worse the corruption,
the more help you likely want to recover from it. :)

Anyway, we'll likely debate this for a few months, as we usually do on
such points. More thoughts to come as we debate further.

~Eric

-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Brett Shirley
Sent: Tuesday, December 06, 2005 12:04 PM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption

I wouldn't say that, joe ...

Lets take another hypothetical real quick, lets say you have a column
for the RDN of an AD object (well we do) and that value is NULL. From
AD's perspective this object is well not really an object, it would be
corrupt, and might even crash lsass.exe (I don't know, it might).

However, from ESE's persepctive though, the table/row/column is valid,
it has a particular column that doesn't have a value. A column which I
might add is declared "optional" (real term is tagged) in the ESE layer
"schema"
(real term is catalog). ESE is simply a store of data, it passes no
judgement on the data as long as it fits the schema guidelines for the
column.

Joe, is the DB corrupt? An AD object without an RDN?

----

I have tendency to think in layers and sources of corruption.
App Logical Layer
AD Logical Layer
ESE Logical Layer
[ESE] Physical Layer

Corruptions coming top down through that stack are protected by the
schema configuration/constraints of that layer (as joe astutely pointed
out).

Corruptions coming bottom up, from disk sub-system hardware, are
protected by whatever mechanisms those layers have.

----

Dropping back to the above hypothetical as an ESE dev I can say to the
AD devs that until they can prove that ESE actually lost thier column,
that it's most likely some sort of AD transactional problem, and the
source is an AD bug. If I am feeling unbusy I will debug at the AD
logical layer, because I know what it's supposed to look like.

----

Coming back to the original issue of replicating _this kind_ of
corruption a normal corruption coming bottom up, because the bits we
(ESE) sent down the disk subsystem, were not the exact bits we got back
later from the sub-systems is almost always detected by the fact that
ESE checksums _every byte_ of it's database pages ... and at this point
everyone should be very thankful Win2k3 AD isn't on SQL 2000, because it
has few such protections, though SQL 2005 finally caught up, 10 years
after the fact, it's such a legacy DB, really ... anyway.

When the corruption comes up from the bottom, what happens is ESE
detects the data is not checksumming, logs an event, and returns a -1018
error (in this case), and starts rejecting DB operations (such as
JetSeek() /
JetRetrieveColumn()) that involve that corrupt database page. AD then
responds to these failed DB ops with can't authenticate a user, AD can't
return the results of a search, or AD can't read or apply data during
replication (those 3 at least probably being the most common). In short
the system starts limping, without affecting the rest of the distributed
system.

----

Coming back to jose's worry of old hardware injecting bad data into the
distributed system. Fortunately, when the disk subsystem goes bad, ESE
does a pretty good job of protecting you, but there are other sources of
corruption, besides corruption, an especially insidious one is the bit
flip in memory (and yes I see these too) which injects itself in the
middle of the above stack. This kind of corruption can both end up
making it's way down to the disk subsystem (with a valid ESE checksum),
and up and out to the distributed system.
From the perspective of older hardware though, I would _hypothesize_

that
if you're going to have something go bad the disk or the memory over
time, keep in mind the disk is the only part of the computer that has a
moving part. I would expect disks to go bad first.

----

I would generally not call USN rollback a corruption either, but I think
Dean make a fair and quasi-valid point that if you consider the
distributed system, yes such a thing is a corruption. Feel free to shim
in a "AD Distributed System Logical Layer" in the above stack, between
AD Logical Layer and App Logical Layer. I'm waffling on this point
though, as somethign smells differnent that other types of corruption.
I'm going to think about that for a long time ... in fact Eric yes the
~Eric) is at my door and says he would consider it corruption, so there
is a long debate in my future as well ...
From a storage developers perspective, what someone usually calls

corruption, is when the data layer they own or lower returns the wrong
result.
From a non-storage developers perspective, what someone usually calls

corruption, is when the data layer below them returns the wrong result.

----

I'll wax more philosophically on it later ....

Cheers,
BrettSh

On Tue, 6 Dec 2005, Dean Wells wrote:


Great topic and, IMO, great answer ... I've only a few comments in

addition

to Joe's reply (inline).
--
Dean Wells
MSEtechnology
* Email: dwells @msetechnology.com
http://msetechnology.com


_____

From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of joe
Sent: Tuesday, December 06, 2005 8:56 AM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption
I may get into trouble with this post as Brett/Eric/Dean/Steve correct

me...


But that will be good.
[DAW]
I'm fairly certain Bratt will have something to say on this one (in

his

shoes, I know I would).
[/DAW]

I will start with trying to differentiate between types of

corruption... My

idea of AD corruption is underlying table corruption. However some

people

may consider bad (really unexpected) values in AD to be corruption.

The last

isn't corruption, AD is simply a store of data, it passes no judgement

on

the data as long as it fits the schema guidelines for the attribute.

If you

have the DN of a user in the siteObject attribute that isn't

corruption, it

isn't good, but it is valid for the schema. Or if you have binary data

in a

unicode string, again, not corruption (a unicode string IS binary

data).

That being said, if apps (including parts of AD itself) hit unexpected

data,

you will have some issues even if it isn't truly "corruption" it may

as well


be in some cases. In fact, table corruption is probably better than
unexpected data in many cases.
You might be able to argue that a USN rollback is corruption but I

still


don't consider it so. Valid data, just out of step.
[DAW]
That's an interesting one. If you treat the distributed database as a



whole, then USN rollback is indeed a form of corruption even though

each


instance may deem itself consistent and intact.
[/DAW]
Again corruption to me is in the underlying tables. Since AD doesn't
replicate the table structures, you can't pass that table corruption

around.

Once AD realizes that some portion of the database is corrupt which

would

probably be recognized by ESE saying, "that isn't right" and not

passing


info back up to higher levels, but instead passing an error.
joe


_____

From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of
neil.ruston@xxxxxxxxxxxxx

Sent: Tuesday, December 06, 2005 3:49 AM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption
Is this guaranteed? How can we/you be sure that the system will

recognise

the corruptions and therefore not replicate them? Surely this is akin

to the

new feature added to e2k3 sp1, but which is (sadly) missing from AD(?)

I must be missing a subtle point - please show me the light :)
neil

_____

From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Steve Linehan
Sent: 05 December 2005 19:26
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption
We do not replicate corruption so if you have local corruption as

noted

below there is no worry that it would replicate around to other

servers in

the environment.

Thanks,

-Steve

_____

From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Phil Renouf
Sent: Monday, December 05, 2005 1:04 PM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: Re: [ActiveDir] Ntds.dit file corruption
Will Read Only DC's take care of this? I don't know much about them

yet, but

it makes sense that if the copy of the dit that a DC has is RO that it

won't


try to replicate that anywhere and would only be the recipient of
replication. Anyone with more knowledge about how RO DC's will work to



comment on that?

Phil
On 12/5/05, Medeiros, Jose wrote:
Well at least the corruption occurred on just a single DC. One thing

that

has bugged me about Active Directory is not being able to select if

you want

a DC in a remote office to not have the ability to replicate back in a

large

enterprise environment. Since most remote offices only have a few

people at


the location and a DC is usually placed for improvised logon and
authentication time, many companies will either use a very low end

server or

a very old decommissioned one from their production data center (

Which is


probably close to useable life ). I am always concerned that once the
NTDS.DIT file becomes corrupt it will replicate the corruption to the

other

DC's in the Forrest.

Maybe I am just being a worry wort and this really is not an issue.

Sincerely,
Jose Medeiros
ADP | National Account Services
ProBusiness Division | Information Services
925.737.7967 | 408-449-6621 CELL


-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx

[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx]On Behalf Of Susan Bradley,



CPA aka Ebitz - SBS Rocks [MVP]
Sent: Monday, December 05, 2005 8:53 AM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: Re: [ActiveDir] Ntds.dit file corruption
I did? :-) I think I still said all I know is what the poster said

:-)

I think I need a course in event log reading because even with the


logs,

and the default size of the logs, I still don't see a smoking gun.

The

directory services one is filled with events 'post' blow up.

What is interesting is that it seems to me big server land goes .. oh
yeah... ntds.dit corruption... and sbsland freaks out. Either we do
indeed need to ensure we have a secondary DC or we need to park a

second

copy of a system state offsite [say at the vap/var]

Brett Shirley wrote:

She replied offline, very likely a single bit flip, tragedy, they

aren't


one release later (Longhorn), where this would've probably been
non-disruptively handled, logged, and possibly self-healed:

http://blogs.technet.com/efleis/archive/2005/01.aspx

Anyway, this kind of thing is usually hardware ...

While there are much better disk sub-system testers, one that is


freely

available to any box with Exchange is jetstress. You might give

that a

try. If you can reproduce the event / error with jetstress I would

not

use that box in production.

If you do reproduce the issue several times (several times is key,


as you

want a trend before you start playing the variable game), some

things

you might vary (one at a time):

- Try making sure you have the latest driver and motherboard /

controller


firmware. Then see if you can reproduce.
- Try a different RAID configuration, such as RAID1/RAID1+0 if

you're on

RAID5.

- Try swapping out the hard drives, one at a time.

- Adding the jetstress files to the exclude list in the Anti-Virus
software. (A low probablility, I've never heard of Anit-Virus

causing this

paticular type of error, and I can't imagine the mistake an

anti-virus

product would have to have to cause this side effect)

- If you can reproduce it several times, you could followup with

Dell.

Good luck.

I'm not sure if I answered your question ...

Cheers,
BrettSh
On Sun, 4 Dec 2005, Eric Fleischman wrote:

Going back to the original post, I'm not sure I fully understand

the

problem yet. Susan, can you define "ntds.dit file corruption" for


us?

What sort of corruption? What errors/events lead you to believe

this?

Specifically, I'm interested in errors from NTDS ISAM or ESE if you



have any.

________________________________

From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx on behalf of Susan

Bradley, CPA

aka Ebitz - SBS Rocks [MVP]

Sent: Sat 12/3/2005 10:58 PM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: [ActiveDir] Ntds.dit file corruption

SBS box [with Windows 2003 sp1 since September]

RE: [ActiveDir] Database Corruption:


http://www.mail-archive.com/activedir@xxxxxxxxxxxxxxxxxx/msg32676.html

We have a SBS 2003 sp1 box with a corrupt ntds.dit that the

Consultant

and PSS have been banging on. Could not get the services back


running,

changed the RPC service to local system and some service came back

up [I

don't have all the details but the consultant opened a support case

of

SRX051202605433].

Bottom line they are about going to give up and start a restore but




before they do that I'd like to get the view of the AD gods and
goddesses around here. From all that I've seen, read, seen in the

SBS

newsgroup, the corruption of ntds.dit is rare to nil and an


underlying

cause is hardware issues [raid, disk subsystem]. This doesn't just



happen.

The VAP asked if not properly excluding the ad databases from the

a/v

would cause this/trigger this and my expectation is 'no', given


that I


doubt the majority of us in SBSland properly set up exclusions
Virus scanning recommendations on a Windows 2000 or on a Windows

Server

2003 domain controller:
http://support.microsoft.com/default.aspx?scid=kb;en-us;822158

If this were my hardware and box, I'd be putting this sucker on the



operating table and getting an autopsy before putting it back


online.

Are we right in being paranoid now about this hardware? For you

guys in

big server land you'd just slide over another box into that server

role.

---------------------------------------
Stupid question alert....

Okay so we know that having a secondary/additional domain

controller is

a good thing even in SBSland...but question.... many times the

second

server in SBSland is a terminal server box because we do not


support TS


in app mode on our PDCs. So we've established that having a domain
controller and a terminal server is a security issue [see Windows
Security resource kit, NIST Terminal services hardening guide, etc
etc....] If our second server is a member server handing out TS
externally, should that be a candidate for the additional DC? Are

the

issues of TS on a DC ... true for 'any' DC? Would it be better


than to

Vserver/VPC a Win2k3 inside a workstation in the network if a third



server box was not feasible?

List info : http://www.activedir.org/List.aspx



List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive:

http://www.mail-archive.com/activedir%40mail.activedir.org/






List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive:

http://www.mail-archive.com/activedir%40mail.activedir.org/



--
Letting your vendors set your risk analysis these days?
http://www.threatcode.com

List info : http://www.activedir.org/List.aspx

List FAQ : http://www.activedir.org/ListFAQ.aspx

List archive:

http://www.mail-archive.com/activedir%40mail.activedir.org/




List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx

List archive:

http://www.mail-archive.com/activedir%40mail.activedir.org/


PLEASE READ: The information contained in this email is confidential


and


intended for the named recipient(s) only. If you are not an intended
recipient of this email please notify the sender immediately and

delete your

copy from your system. You must not copy, distribute or take any


further

action in reliance on it. Email is not a secure method of


communication and

Nomura International plc ('NIplc') will not, to the extent permitted

by law,

accept responsibility or liability for (a) the accuracy or


completeness of,

or (b) the presence of any virus, worm or similar malicious or


disabling

code in, this message or any attachment(s) to it. If verification of


this

email is sought then please request a hard copy. Unless otherwise


stated


this email: (1) is not, and should not be treated or relied upon as,
investment research; (2) contains views or opinions that are solely

those of

the author and do not necessarily represent those of NIplc; (3) is


intended

for informational purposes only and is not a recommendation,

solicitation or

offer to buy or sell securities or related financial instruments.


NIplc

does not provide investment services to private customers. Authorised


and


regulated by the Financial Services Authority. Registered in England
no. 1550505 VAT No. 447 2492 35. Registered Office: 1 St


Martin's-le-Grand,


London, EC1A 4NP. A member of the Nomura group of companies.



List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/
List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/

List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/

PLEASE READ: The information contained in this email is confidential and
intended for the named recipient(s) only. If you are not an intended
recipient of this email please notify the sender immediately and delete your
copy from your system. You must not copy, distribute or take any further
action in reliance on it. Email is not a secure method of communication and
Nomura International plc ('NIplc') will not, to the extent permitted by law,
accept responsibility or liability for (a) the accuracy or completeness of,
or (b) the presence of any virus, worm or similar malicious or disabling
code in, this message or any attachment(s) to it. If verification of this
email is sought then please request a hard copy. Unless otherwise stated
this email: (1) is not, and should not be treated or relied upon as,
investment research; (2) contains views or opinions that are solely those of
the author and do not necessarily represent those of NIplc; (3) is intended
for informational purposes only and is not a recommendation, solicitation or
offer to buy or sell securities or related financial instruments. NIplc
does not provide investment services to private customers. Authorised and
regulated by the Financial Services Authority. Registered in England
no. 1550505 VAT No. 447 2492 35. Registered Office: 1 St Martin's-le-Grand,
London, EC1A 4NP. A member of the Nomura group of companies.

List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive: http://www.mail-archive.com/activedir%40mail.activedir.org/


List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive: http://www.mail-archive.com/activedir%40mail.activedir.org/
AD000001290User is Offline

Posts:0

12/08/2005 9:42 AM  
I was referring to this KB:
http://support.microsoft.com/?kbid=867626
neil
-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Susan Bradley,
CPA aka Ebitz - SBS Rocks [MVP]
Sent: 08 December 2005 09:28
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: Re: [ActiveDir] Ntds.dit file corruption

I thought it was Windows 2003 sp1 that had additional database
correction stuff?

neil.ruston@xxxxxxxxxxxxx wrote:

>Maybe I should flip the question around a little...
>
>
>What are the changes made in exch2k3 sp1 (involving ECC corrections)
>and why were they deemed necessary, given what I have read from
>joe/Brett/Eric et al)??
>
>The changes appear to be superfluous. We do not appear to need such a
>(further) check re: AD/ESE(?)
>
>What am I missing guys?
>
>neil
>PS Great thread so far :)
>
>-----Original Message-----
>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of joe
>Sent: 07 December 2005 01:55
>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>Subject: RE: [ActiveDir] Ntds.dit file corruption
>
>Good post ~Eric, thanks for chiming in.
>
>I see where you are coming from with the corruption at the distributed
>level. In terms of corruption at that level I see it as corruption but
>just can't get myself to see it as AD corruption. I am not sure if I
>can put it down in words why. I just don't. :)
>
> joe
>
>-----Original Message-----
>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Eric
>Fleischman
>Sent: Tuesday, December 06, 2005 5:42 PM
>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>Subject: RE: [ActiveDir] Ntds.dit file corruption
>
>
>I would generally not call USN rollback a corruption either, but I
>think Dean make a fair and quasi-valid point that if you consider the
>distributed system, yes such a thing is a corruption. Feel free to
>shim in a "AD Distributed System Logical Layer" in the above stack,
>between AD Logical Layer and App Logical Layer. I'm waffling on this
>point though, as somethign smells differnent that other types of
corruption.
>I'm going to think about that for a long time ... in fact Eric yes the
>~Eric) is at my door and says he would consider it corruption, so there

>is a long debate in my future as well ...
>
>
>Over lunch, Brett and I discussed this some more. My contention is that

>USN rollback would be a form of corruption under a somewhat broad
>definition.
>The reality is that there is a layer that Brett mentioned which
>actually has a two parts when looked at from a high level. Namely, this
layer:
>
>
>>AD Logical Layer
>>
>>
>
>The first piece could be thought of as local logical layer. That is,
>data hierarchy, conforming to the code assumptions of how it should be,

>data conforming to the schema as defined, etc. This is a layer of data
>that clearly need be proper (leaving the definition of proper to
>another day), else we are in some sort of corrupt state. Brett and I
>both agree on this I'm pretty sure.
>
>However, there is then distributed systems corruption. In AD, one of
>the services we aim to provide is convergence. If we do not converge,
>we define this divergence as at a minimum "bad", perhaps "corrupt."
>USN rollback breaks our convergence guarantees, it breaks replication
>such that you will not attain convergence in the system. I would as
>such consider it a form of corruption.
>
>Over Teriyaki a few minutes ago, Brett posited the question "well if
>USN rollback is corruption, what else?" Valid question. I would concede

>that if USN rollback is considered distributed systems corruption, so
>too would be other conditions which yield divergence. Perhaps this is a

>slippery slope that goes too far. I need to think about this some more.
>
>I would also toss out there that corruption should not be confused with

>"forever broken." There are many states in which the directory can
>exist where it is functional, but in some way broken. Such divergences
>can typically be repaired with administrative action, so long as it is
>a savvy administrator. :) If we are willing to assume that divergence
>is corruption, I'd tend to believe that most people on this list have
>recovered from some form of corruption before. The worse the
>corruption, the more help you likely want to recover from it. :)
>
>Anyway, we'll likely debate this for a few months, as we usually do on
>such points. More thoughts to come as we debate further.
>
>~Eric
>
>
>
>-----Original Message-----
>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Brett Shirley
>Sent: Tuesday, December 06, 2005 12:04 PM
>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>Subject: RE: [ActiveDir] Ntds.dit file corruption
>
>I wouldn't say that, joe ...
>
>Lets take another hypothetical real quick, lets say you have a column
>for the RDN of an AD object (well we do) and that value is NULL. From
>AD's perspective this object is well not really an object, it would be
>corrupt, and might even crash lsass.exe (I don't know, it might).
>
>However, from ESE's persepctive though, the table/row/column is valid,
>it has a particular column that doesn't have a value. A column which I

>might add is declared "optional" (real term is tagged) in the ESE layer

>"schema"
>(real term is catalog). ESE is simply a store of data, it passes no
>judgement on the data as long as it fits the schema guidelines for the
>column.
>
>Joe, is the DB corrupt? An AD object without an RDN?
>
>----
>
>I have tendency to think in layers and sources of corruption.
> App Logical Layer
> AD Logical Layer
> ESE Logical Layer
> [ESE] Physical Layer
>
>Corruptions coming top down through that stack are protected by the
>schema configuration/constraints of that layer (as joe astutely pointed

>out).
>
>Corruptions coming bottom up, from disk sub-system hardware, are
>protected by whatever mechanisms those layers have.
>
>----
>
>Dropping back to the above hypothetical as an ESE dev I can say to the
>AD devs that until they can prove that ESE actually lost thier column,
>that it's most likely some sort of AD transactional problem, and the
>source is an AD bug. If I am feeling unbusy I will debug at the AD
>logical layer, because I know what it's supposed to look like.
>
>----
>
>Coming back to the original issue of replicating _this kind_ of
>corruption a normal corruption coming bottom up, because the bits we
>(ESE) sent down the disk subsystem, were not the exact bits we got back

>later from the sub-systems is almost always detected by the fact that
>ESE checksums _every byte_ of it's database pages ... and at this point

>everyone should be very thankful Win2k3 AD isn't on SQL 2000, because
>it has few such protections, though SQL 2005 finally caught up, 10
>years after the fact, it's such a legacy DB, really ... anyway.
>
>When the corruption comes up from the bottom, what happens is ESE
>detects the data is not checksumming, logs an event, and returns a
>-1018 error (in this case), and starts rejecting DB operations (such as
>JetSeek() /
>JetRetrieveColumn()) that involve that corrupt database page. AD then
>responds to these failed DB ops with can't authenticate a user, AD
>can't return the results of a search, or AD can't read or apply data
>during replication (those 3 at least probably being the most common).
>In short the system starts limping, without affecting the rest of the
>distributed system.
>
>----
>
>Coming back to jose's worry of old hardware injecting bad data into the

>distributed system. Fortunately, when the disk subsystem goes bad, ESE

>does a pretty good job of protecting you, but there are other sources
>of corruption, besides corruption, an especially insidious one is the
>bit flip in memory (and yes I see these too) which injects itself in
>the middle of the above stack. This kind of corruption can both end up

>making it's way down to the disk subsystem (with a valid ESE checksum),

>and up and out to the distributed system.
>
>>From the perspective of older hardware though, I would _hypothesize_
>that
>if you're going to have something go bad the disk or the memory over
>time, keep in mind the disk is the only part of the computer that has a

>moving part. I would expect disks to go bad first.
>
>----
>
>I would generally not call USN rollback a corruption either, but I
>think Dean make a fair and quasi-valid point that if you consider the
>distributed system, yes such a thing is a corruption. Feel free to
>shim in a "AD Distributed System Logical Layer" in the above stack,
>between AD Logical Layer and App Logical Layer. I'm waffling on this
>point though, as somethign smells differnent that other types of
corruption.
>I'm going to think about that for a long time ... in fact Eric yes the
>~Eric) is at my door and says he would consider it corruption, so there

>is a long debate in my future as well ...
>
>>From a storage developers perspective, what someone usually calls
>corruption, is when the data layer they own or lower returns the wrong
>result.
>
>>From a non-storage developers perspective, what someone usually calls
>corruption, is when the data layer below them returns the wrong result.
>
>----
>
>I'll wax more philosophically on it later ....
>
>Cheers,
>BrettSh
>
>
>
>On Tue, 6 Dec 2005, Dean Wells wrote:
>
>
>
>>Great topic and, IMO, great answer ... I've only a few comments in
>>
>>
>addition
>
>
>>to Joe's reply (inline).
>>--
>>Dean Wells
>>MSEtechnology
>>* Email: dwells @msetechnology.com
>> http://msetechnology.com
>>
>>
>>
>> _____
>>
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of joe
>>Sent: Tuesday, December 06, 2005 8:56 AM
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: RE: [ActiveDir] Ntds.dit file corruption
>>
>>
>>I may get into trouble with this post as Brett/Eric/Dean/Steve correct
>>
>>
>me...
>
>
>>But that will be good.
>>
>>[DAW]
>>I'm fairly certain Bratt will have something to say on this one (in
>>
>>
>his
>
>
>>shoes, I know I would).
>>[/DAW]
>>
>>I will start with trying to differentiate between types of
>>
>>
>corruption... My
>
>
>>idea of AD corruption is underlying table corruption. However some
>>
>>
>people
>
>
>>may consider bad (really unexpected) values in AD to be corruption.
>>
>>
>The last
>
>
>>isn't corruption, AD is simply a store of data, it passes no judgement
>>
>>
>on
>
>
>>the data as long as it fits the schema guidelines for the attribute.
>>
>>
>If you
>
>
>>have the DN of a user in the siteObject attribute that isn't
>>
>>
>corruption, it
>
>
>>isn't good, but it is valid for the schema. Or if you have binary data
>>
>>
>in a
>
>
>>unicode string, again, not corruption (a unicode string IS binary
>>
>>
>data).
>
>
>>That being said, if apps (including parts of AD itself) hit unexpected
>>
>>
>data,
>
>
>>you will have some issues even if it isn't truly "corruption" it may
>>
>>
>as well
>
>
>>be in some cases. In fact, table corruption is probably better than
>>unexpected data in many cases.
>>
>>You might be able to argue that a USN rollback is corruption but I
>>
>>
>still
>
>
>>don't consider it so. Valid data, just out of step.
>>
>>[DAW]
>>That's an interesting one. If you treat the distributed database as a
>>
>>
>
>
>
>>whole, then USN rollback is indeed a form of corruption even though
>>
>>
>each
>
>
>>instance may deem itself consistent and intact.
>>[/DAW]
>>
>>Again corruption to me is in the underlying tables. Since AD doesn't
>>replicate the table structures, you can't pass that table corruption
>>
>>
>around.
>
>
>>Once AD realizes that some portion of the database is corrupt which
>>
>>
>would
>
>
>>probably be recognized by ESE saying, "that isn't right" and not
>>
>>
>passing
>
>
>>info back up to higher levels, but instead passing an error.
>>
>> joe
>>
>>
>>
>>
>> _____
>>
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of
>>neil.ruston@xxxxxxxxxxxxx
>>Sent: Tuesday, December 06, 2005 3:49 AM
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: RE: [ActiveDir] Ntds.dit file corruption
>>
>>
>>Is this guaranteed? How can we/you be sure that the system will
>>
>>
>recognise
>
>
>>the corruptions and therefore not replicate them? Surely this is akin
>>
>>
>to the
>
>
>>new feature added to e2k3 sp1, but which is (sadly) missing from AD(?)
>>
>>I must be missing a subtle point - please show me the light :)
>>
>>
>>neil
>>
>> _____
>>
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Steve Linehan
>>Sent: 05 December 2005 19:26
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: RE: [ActiveDir] Ntds.dit file corruption
>>
>>
>>We do not replicate corruption so if you have local corruption as
>>
>>
>noted
>
>
>>below there is no worry that it would replicate around to other
>>
>>
>servers in
>
>
>>the environment.
>>
>>Thanks,
>>
>>-Steve
>>
>> _____
>>
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Phil Renouf
>>Sent: Monday, December 05, 2005 1:04 PM
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: Re: [ActiveDir] Ntds.dit file corruption
>>
>>
>>Will Read Only DC's take care of this? I don't know much about them
>>
>>
>yet, but
>
>
>>it makes sense that if the copy of the dit that a DC has is RO that it
>>
>>
>won't
>
>
>>try to replicate that anywhere and would only be the recipient of
>>replication. Anyone with more knowledge about how RO DC's will work to
>>
>>
>
>
>
>>comment on that?
>>
>>Phil
>>
>>
>>On 12/5/05, Medeiros, Jose wrote:
>>
>>Well at least the corruption occurred on just a single DC. One thing
>>
>>
>that
>
>
>>has bugged me about Active Directory is not being able to select if
>>
>>
>you want
>
>
>>a DC in a remote office to not have the ability to replicate back in a
>>
>>
>large
>
>
>>enterprise environment. Since most remote offices only have a few
>>
>>
>people at
>
>
>>the location and a DC is usually placed for improvised logon and
>>authentication time, many companies will either use a very low end
>>
>>
>server or
>
>
>>a very old decommissioned one from their production data center (
>>
>>
>Which is
>
>
>>probably close to useable life ). I am always concerned that once the
>>NTDS.DIT file becomes corrupt it will replicate the corruption to the
>>
>>
>other
>
>
>>DC's in the Forrest.
>>
>>Maybe I am just being a worry wort and this really is not an issue.
>>
>>
>>
>>Sincerely,
>>Jose Medeiros
>>ADP | National Account Services
>>ProBusiness Division | Information Services
>>925.737.7967 | 408-449-6621 CELL
>>
>>
>>
>>
>>-----Original Message-----
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx]On Behalf Of Susan Bradley,
>>
>>
>
>
>
>>CPA aka Ebitz - SBS Rocks [MVP]
>>Sent: Monday, December 05, 2005 8:53 AM
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: Re: [ActiveDir] Ntds.dit file corruption
>>
>>
>>I did? :-) I think I still said all I know is what the poster said
>>
>>
>:-)
>
>
>>I think I need a course in event log reading because even with the
>>
>>
>logs,
>
>
>>and the default size of the logs, I still don't see a smoking gun.
>>
>>
>The
>
>
>>directory services one is filled with events 'post' blow up.
>>
>>What is interesting is that it seems to me big server land goes .. oh
>>yeah... ntds.dit corruption... and sbsland freaks out. Either we do
>>indeed need to ensure we have a secondary DC or we need to park a
>>
>>
>second
>
>
>>copy of a system state offsite [say at the vap/var]
>>
>>Brett Shirley wrote:
>>
>>
>>>She replied offline, very likely a single bit flip, tragedy, they
>>>
>>>
>aren't
>
>
>>>one release later (Longhorn), where this would've probably been
>>>non-disruptively handled, logged, and possibly self-healed:
>>> http://blogs.technet.com/efleis/archive/2005/01.aspx
>>>
>>>Anyway, this kind of thing is usually hardware ...
>>>
>>>While there are much better disk sub-system testers, one that is
>>>
>>>
>freely
>
>
>>>available to any box with Exchange is jetstress. You might give
>>>
>>>
>that a
>
>
>>>try. If you can reproduce the event / error with jetstress I would
>>>
>>>
>not
>
>
>>>use that box in production.
>>>
>>>If you do reproduce the issue several times (several times is key,
>>>
>>>
>as you
>
>
>>>want a trend before you start playing the variable game), some
>>>
>>>
>things
>
>
>>>you might vary (one at a time):
>>>
>>> - Try making sure you have the latest driver and motherboard /
>>>
>>>
>controller
>
>
>>>firmware. Then see if you can reproduce.
>>>
>>> - Try a different RAID configuration, such as RAID1/RAID1+0 if
>>>
>>>
>you're on
>
>
>>>RAID5.
>>>
>>> - Try swapping out the hard drives, one at a time.
>>>
>>> - Adding the jetstress files to the exclude list in the Anti-Virus
>>>software. (A low probablility, I've never heard of Anit-Virus
>>>
>>>
>causing this
>
>
>>>paticular type of error, and I can't imagine the mistake an
>>>
>>>
>anti-virus
>
>
>>>product would have to have to cause this side effect)
>>>
>>> - If you can reproduce it several times, you could followup with
>>>
>>>
>Dell.
>
>
>>>Good luck.
>>>
>>>I'm not sure if I answered your question ...
>>>
>>>Cheers,
>>>BrettSh
>>>
>>>
>>>On Sun, 4 Dec 2005, Eric Fleischman wrote:
>>>
>>>
>>>
>>>
>>>>Going back to the original post, I'm not sure I fully understand
>>>>
>>>>
>the
>
>
>>>>problem yet. Susan, can you define "ntds.dit file corruption" for
>>>>
>>>>
>us?
>
>
>>>>What sort of corruption? What errors/events lead you to believe
>>>>
>>>>
>this?
>
>
>>>>Specifically, I'm interested in errors from NTDS ISAM or ESE if you
>>>>
>>>>
>
>
>
>>>>have any.
>>>>
>>>>
>>>>
>>>>________________________________
>>>>
>>>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx on behalf of Susan
>>>>
>>>>
>Bradley, CPA
>
>
>>aka Ebitz - SBS Rocks [MVP]
>>
>>
>>>>Sent: Sat 12/3/2005 10:58 PM
>>>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>>>Subject: [ActiveDir] Ntds.dit file corruption
>>>>
>>>>
>>>>
>>>>SBS box [with Windows 2003 sp1 since September]
>>>>
>>>>RE: [ActiveDir] Database Corruption:
>>>>
>>>>
>>>>
>http://www.mail-archive.com/activedir@xxxxxxxxxxxxxxxxxx/msg32676.html
>
>
>>>>We have a SBS 2003 sp1 box with a corrupt ntds.dit that the
>>>>
>>>>
>Consultant
>
>
>>>>and PSS have been banging on. Could not get the services back
>>>>
>>>>
>running,
>
>
>>>>changed the RPC service to local system and some service came back
>>>>
>>>>
>up [I
>
>
>>>>don't have all the details but the consultant opened a support case
>>>>
>>>>
>of
>
>
>>>>SRX051202605433].
>>>>
>>>>Bottom line they are about going to give up and start a restore but
>>>>
>>>>
>
>
>
>>>>before they do that I'd like to get the view of the AD gods and
>>>>goddesses around here. From all that I've seen, read, seen in the
>>>>
>>>>
>SBS
>
>
>>>>newsgroup, the corruption of ntds.dit is rare to nil and an
>>>>
>>>>
>underlying
>
>
>>>>cause is hardware issues [raid, disk subsystem]. This doesn't just
>>>>
>>>>
>
>
>
>>>>happen.
>>>>
>>>>The VAP asked if not properly excluding the ad databases from the
>>>>
>>>>
>a/v
>
>
>>>>would cause this/trigger this and my expectation is 'no', given
>>>>
>>>>
>that I
>
>
>>>>doubt the majority of us in SBSland properly set up exclusions Virus

>>>>scanning recommendations on a Windows 2000 or on a Windows
>>>>
>>>>
>Server
>
>
>>>>2003 domain controller:
>>>>http://support.microsoft.com/default.aspx?scid=kb;en-us;822158
>>>>
>>>>If this were my hardware and box, I'd be putting this sucker on the
>>>>
>>>>
>
>
>
>>>>operating table and getting an autopsy before putting it back
>>>>
>>>>
>online.
>
>
>>>>Are we right in being paranoid now about this hardware? For you
>>>>
>>>>
>guys in
>
>
>>>>big server land you'd just slide over another box into that server
>>>>
>>>>
>role.
>
>
>>>>---------------------------------------
>>>>Stupid question alert....
>>>>
>>>>Okay so we know that having a secondary/additional domain
>>>>
>>>>
>controller is
>
>
>>>>a good thing even in SBSland...but question.... many times the
>>>>
>>>>
>second
>
>
>>>>server in SBSland is a terminal server box because we do not
>>>>
>>>>
>support TS
>
>
>>>>in app mode on our PDCs. So we've established that having a domain
>>>>controller and a terminal server is a security issue [see Windows
>>>>Security resource kit, NIST Terminal services hardening guide, etc
>>>>etc....] If our second server is a member server handing out TS
>>>>externally, should that be a candidate for the additional DC? Are
>>>>
>>>>
>the
>
>
>>>>issues of TS on a DC ... true for 'any' DC? Would it be better
>>>>
>>>>
>than to
>
>
>>>>Vserver/VPC a Win2k3 inside a workstation in the network if a third
>>>>
>>>>
>
>
>
>>>>server box was not feasible?
>>>>
>>>>List info : http://www.activedir.org/List.aspx
>>>>
>>>>
>>
>>
>>
>>>>List FAQ : http://www.activedir.org/ListFAQ.aspx
>>>>List archive:
>>>>
>>>>
>http://www.mail-archive.com/activedir%40mail.activedir.org/
>
>
>>
>>
>>
>>>>
>>>>
>>>>
>>>>
>>>List info : http://www.activedir.org/List.aspx
>>>List FAQ : http://www.activedir.org/ListFAQ.aspx
>>>List archive:
>>>
>>>
>http://www.mail-archive.com/activedir%40mail.activedir.org/
>
>
>>>
>>>
>>--
>>Letting your vendors set your risk analysis these days?
>>http://www.threatcode.com
>>
>>List info : http://www.activedir.org/List.aspx
>>
>>List FAQ : http://www.activedir.org/ListFAQ.aspx
>>List archive:
>>
>>
>http://www.mail-archive.com/activedir%40mail.activedir.org/
>
>
>>
>>
>>
>>
>>
>>List info : http://www.activedir.org/List.aspx
>>List FAQ : http://www.activedir.org/ListFAQ.aspx
>>
>>List archive:
>>
>>
>http://www.mail-archive.com/activedir%40mail.activedir.org/
>
>
>>
>>PLEASE READ: The information contained in this email is confidential
>>
>>
>and
>
>
>>intended for the named recipient(s) only. If you are not an intended
>>recipient of this email please notify the sender immediately and
>>
>>
>delete your
>
>
>>copy from your system. You must not copy, distribute or take any
>>
>>
>further
>
>
>>action in reliance on it. Email is not a secure method of
>>
>>
>communication and
>
>
>>Nomura International plc ('NIplc') will not, to the extent permitted
>>
>>
>by law,
>
>
>>accept responsibility or liability for (a) the accuracy or
>>
>>
>completeness of,
>
>
>>or (b) the presence of any virus, worm or similar malicious or
>>
>>
>disabling
>
>
>>code in, this message or any attachment(s) to it. If verification of
>>
>>
>this
>
>
>>email is sought then please request a hard copy. Unless otherwise
>>
>>
>stated
>
>
>>this email: (1) is not, and should not be treated or relied upon as,
>>investment research; (2) contains views or opinions that are solely
>>
>>
>those of
>
>
>>the author and do not necessarily represent those of NIplc; (3) is
>>
>>
>intended
>
>
>>for informational purposes only and is not a recommendation,
>>
>>
>solicitation or
>
>
>>offer to buy or sell securities or related financial instruments.
>>
>>
>NIplc
>
>
>>does not provide investment services to private customers. Authorised
>>
>>
>and
>
>
>>regulated by the Financial Services Authority. Registered in England
>>no. 1550505 VAT No. 447 2492 35. Registered Office: 1 St
>>
>>
>Martin's-le-Grand,
>
>
>>London, EC1A 4NP. A member of the Nomura group of companies.
>>
>>
>>
>
>List info : http://www.activedir.org/List.aspx
>List FAQ : http://www.activedir.org/ListFAQ.aspx
>List archive:
>http://www.mail-archive.com/activedir%40mail.activedir.org/
>List info : http://www.activedir.org/List.aspx
>List FAQ : http://www.activedir.org/ListFAQ.aspx
>List archive:
>http://www.mail-archive.com/activedir%40mail.activedir.org/
>
>List info : http://www.activedir.org/List.aspx
>List FAQ : http://www.activedir.org/ListFAQ.aspx
>List archive:
>http://www.mail-archive.com/activedir%40mail.activedir.org/
>
>
>
>PLEASE READ: The information contained in this email is confidential
>and intended for the named recipient(s) only. If you are not an
>intended recipient of this email please notify the sender immediately
>and delete your copy from your system. You must not copy, distribute or

>take any further action in reliance on it. Email is not a secure method

>of communication and Nomura International plc ('NIplc') will not, to
>the extent permitted by law, accept responsibility or liability for (a)

>the accuracy or completeness of, or (b) the presence of any virus, worm

>or similar malicious or disabling code in, this message or any
>attachment(s) to it. If verification of this email is sought then
>please request a hard copy. Unless otherwise stated this email: (1) is
>not, and should not be treated or relied upon as, investment research;
>(2) contains views or opinions that are solely those of the author and
>do not necessarily represent those of NIplc; (3) is intended for
>informational purposes only and is not a recommendation, solicitation
>or offer to buy or sell securities or related financial instruments.
>NIplc does not provide investment services to private customers.
>Authorised and regulated by the Financial Services Authority.
>Registered in England no. 1550505 VAT No. 447 2492 35. Registered
Office: 1 St Martin's-le-Grand, London, EC1A 4NP. A member of the
Nomura group of companies.
>
>List info : http://www.activedir.org/List.aspx
>List FAQ : http://www.activedir.org/ListFAQ.aspx
>List archive:
>http://www.mail-archive.com/activedir%40mail.activedir.org/
>
>
>
List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive:
http://www.mail-archive.com/activedir%40mail.activedir.org/

PLEASE READ: The information contained in this email is confidential and
intended for the named recipient(s) only. If you are not an intended
recipient of this email please notify the sender immediately and delete your
copy from your system. You must not copy, distribute or take any further
action in reliance on it. Email is not a secure method of communication and
Nomura International plc ('NIplc') will not, to the extent permitted by law,
accept responsibility or liability for (a) the accuracy or completeness of,
or (b) the presence of any virus, worm or similar malicious or disabling
code in, this message or any attachment(s) to it. If verification of this
email is sought then please request a hard copy. Unless otherwise stated
this email: (1) is not, and should not be treated or relied upon as,
investment research; (2) contains views or opinions that are solely those of
the author and do not necessarily represent those of NIplc; (3) is intended
for informational purposes only and is not a recommendation, solicitation or
offer to buy or sell securities or related financial instruments. NIplc
does not provide investment services to private customers. Authorised and
regulated by the Financial Services Authority. Registered in England
no. 1550505 VAT No. 447 2492 35. Registered Office: 1 St Martin's-le-Grand,
London, EC1A 4NP. A member of the Nomura group of companies.

List info : http://www.activedir.org/List.aspx
List FAQ : http://www.activedir.org/ListFAQ.aspx
List archive: http://www.mail-archive.com/activedir%40mail.activedir.org/
michael@xxxx.yyy

12/08/2005 11:13 AM  
The existing mechanism place in Exchange 2003 prior to sp1 was able to
detect problems, and ensure that they didn't cause problems in the
Exchange environment -- however that could mean that a store was shut
down when a -1018 was detected. And that's a real problem to the user
environment!

Correcting a single bit error (which can be caused by hardware failure,
firmware failure, cosmic rays, or mind control (I'm sure)) allows the
store to continue operating about 40% (a significant number) of the
time. This results in a noticable reduction of support calls to PSS. :-)

I've got notes around here somewhere, but my memory vaguely says that
the change was to take the physical page number 32-bit value in the
database record header and turn it into an ECC value. The database is
updated, record by record, as each record gets updated.

Could such a change benefit A/D? I don't see why not. It's probably not
as dramatic an improvement though -- the reaction of Exchange to a
one-bit error was to shut down the entire store. A/D apparently just
fails the current request. Depending on the request, that could be a big
deal - or not.

-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of
neil.ruston@xxxxxxxxxxxxx
Sent: Thursday, December 08, 2005 4:38 AM
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: RE: [ActiveDir] Ntds.dit file corruption

I was referring to this KB:
http://support.microsoft.com/?kbid=867626
neil
-----Original Message-----
From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Susan Bradley,
CPA aka Ebitz - SBS Rocks [MVP]
Sent: 08 December 2005 09:28
To: ActiveDir@xxxxxxxxxxxxxxxxxx
Subject: Re: [ActiveDir] Ntds.dit file corruption

I thought it was Windows 2003 sp1 that had additional database
correction stuff?

neil.ruston@xxxxxxxxxxxxx wrote:

>Maybe I should flip the question around a little...
>
>
>What are the changes made in exch2k3 sp1 (involving ECC corrections)
>and why were they deemed necessary, given what I have read from
>joe/Brett/Eric et al)??
>
>The changes appear to be superfluous. We do not appear to need such a
>(further) check re: AD/ESE(?)
>
>What am I missing guys?
>
>neil
>PS Great thread so far :)
>
>-----Original Message-----
>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of joe
>Sent: 07 December 2005 01:55
>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>Subject: RE: [ActiveDir] Ntds.dit file corruption
>
>Good post ~Eric, thanks for chiming in.
>
>I see where you are coming from with the corruption at the distributed
>level. In terms of corruption at that level I see it as corruption but
>just can't get myself to see it as AD corruption. I am not sure if I
>can put it down in words why. I just don't. :)
>
> joe
>
>-----Original Message-----
>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Eric
>Fleischman
>Sent: Tuesday, December 06, 2005 5:42 PM
>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>Subject: RE: [ActiveDir] Ntds.dit file corruption
>
>
>I would generally not call USN rollback a corruption either, but I
>think Dean make a fair and quasi-valid point that if you consider the
>distributed system, yes such a thing is a corruption. Feel free to
>shim in a "AD Distributed System Logical Layer" in the above stack,
>between AD Logical Layer and App Logical Layer. I'm waffling on this
>point though, as somethign smells differnent that other types of
corruption.
>I'm going to think about that for a long time ... in fact Eric yes the
>~Eric) is at my door and says he would consider it corruption, so there

>is a long debate in my future as well ...
>
>
>Over lunch, Brett and I discussed this some more. My contention is that

>USN rollback would be a form of corruption under a somewhat broad
>definition.
>The reality is that there is a layer that Brett mentioned which
>actually has a two parts when looked at from a high level. Namely, this
layer:
>
>
>>AD Logical Layer
>>
>>
>
>The first piece could be thought of as local logical layer. That is,
>data hierarchy, conforming to the code assumptions of how it should be,

>data conforming to the schema as defined, etc. This is a layer of data
>that clearly need be proper (leaving the definition of proper to
>another day), else we are in some sort of corrupt state. Brett and I
>both agree on this I'm pretty sure.
>
>However, there is then distributed systems corruption. In AD, one of
>the services we aim to provide is convergence. If we do not converge,
>we define this divergence as at a minimum "bad", perhaps "corrupt."
>USN rollback breaks our convergence guarantees, it breaks replication
>such that you will not attain convergence in the system. I would as
>such consider it a form of corruption.
>
>Over Teriyaki a few minutes ago, Brett posited the question "well if
>USN rollback is corruption, what else?" Valid question. I would concede

>that if USN rollback is considered distributed systems corruption, so
>too would be other conditions which yield divergence. Perhaps this is a

>slippery slope that goes too far. I need to think about this some more.
>
>I would also toss out there that corruption should not be confused with

>"forever broken." There are many states in which the directory can
>exist where it is functional, but in some way broken. Such divergences
>can typically be repaired with administrative action, so long as it is
>a savvy administrator. :) If we are willing to assume that divergence
>is corruption, I'd tend to believe that most people on this list have
>recovered from some form of corruption before. The worse the
>corruption, the more help you likely want to recover from it. :)
>
>Anyway, we'll likely debate this for a few months, as we usually do on
>such points. More thoughts to come as we debate further.
>
>~Eric
>
>
>
>-----Original Message-----
>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Brett Shirley
>Sent: Tuesday, December 06, 2005 12:04 PM
>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>Subject: RE: [ActiveDir] Ntds.dit file corruption
>
>I wouldn't say that, joe ...
>
>Lets take another hypothetical real quick, lets say you have a column
>for the RDN of an AD object (well we do) and that value is NULL. From
>AD's perspective this object is well not really an object, it would be
>corrupt, and might even crash lsass.exe (I don't know, it might).
>
>However, from ESE's persepctive though, the table/row/column is valid,
>it has a particular column that doesn't have a value. A column which I

>might add is declared "optional" (real term is tagged) in the ESE layer

>"schema"
>(real term is catalog). ESE is simply a store of data, it passes no
>judgement on the data as long as it fits the schema guidelines for the
>column.
>
>Joe, is the DB corrupt? An AD object without an RDN?
>
>----
>
>I have tendency to think in layers and sources of corruption.
> App Logical Layer
> AD Logical Layer
> ESE Logical Layer
> [ESE] Physical Layer
>
>Corruptions coming top down through that stack are protected by the
>schema configuration/constraints of that layer (as joe astutely pointed

>out).
>
>Corruptions coming bottom up, from disk sub-system hardware, are
>protected by whatever mechanisms those layers have.
>
>----
>
>Dropping back to the above hypothetical as an ESE dev I can say to the
>AD devs that until they can prove that ESE actually lost thier column,
>that it's most likely some sort of AD transactional problem, and the
>source is an AD bug. If I am feeling unbusy I will debug at the AD
>logical layer, because I know what it's supposed to look like.
>
>----
>
>Coming back to the original issue of replicating _this kind_ of
>corruption a normal corruption coming bottom up, because the bits we
>(ESE) sent down the disk subsystem, were not the exact bits we got back

>later from the sub-systems is almost always detected by the fact that
>ESE checksums _every byte_ of it's database pages ... and at this point

>everyone should be very thankful Win2k3 AD isn't on SQL 2000, because
>it has few such protections, though SQL 2005 finally caught up, 10
>years after the fact, it's such a legacy DB, really ... anyway.
>
>When the corruption comes up from the bottom, what happens is ESE
>detects the data is not checksumming, logs an event, and returns a
>-1018 error (in this case), and starts rejecting DB operations (such as
>JetSeek() /
>JetRetrieveColumn()) that involve that corrupt database page. AD then
>responds to these failed DB ops with can't authenticate a user, AD
>can't return the results of a search, or AD can't read or apply data
>during replication (those 3 at least probably being the most common).
>In short the system starts limping, without affecting the rest of the
>distributed system.
>
>----
>
>Coming back to jose's worry of old hardware injecting bad data into the

>distributed system. Fortunately, when the disk subsystem goes bad, ESE

>does a pretty good job of protecting you, but there are other sources
>of corruption, besides corruption, an especially insidious one is the
>bit flip in memory (and yes I see these too) which injects itself in
>the middle of the above stack. This kind of corruption can both end up

>making it's way down to the disk subsystem (with a valid ESE checksum),

>and up and out to the distributed system.
>
>>From the perspective of older hardware though, I would _hypothesize_
>that
>if you're going to have something go bad the disk or the memory over
>time, keep in mind the disk is the only part of the computer that has a

>moving part. I would expect disks to go bad first.
>
>----
>
>I would generally not call USN rollback a corruption either, but I
>think Dean make a fair and quasi-valid point that if you consider the
>distributed system, yes such a thing is a corruption. Feel free to
>shim in a "AD Distributed System Logical Layer" in the above stack,
>between AD Logical Layer and App Logical Layer. I'm waffling on this
>point though, as somethign smells differnent that other types of
corruption.
>I'm going to think about that for a long time ... in fact Eric yes the
>~Eric) is at my door and says he would consider it corruption, so there

>is a long debate in my future as well ...
>
>>From a storage developers perspective, what someone usually calls
>corruption, is when the data layer they own or lower returns the wrong
>result.
>
>>From a non-storage developers perspective, what someone usually calls
>corruption, is when the data layer below them returns the wrong result.
>
>----
>
>I'll wax more philosophically on it later ....
>
>Cheers,
>BrettSh
>
>
>
>On Tue, 6 Dec 2005, Dean Wells wrote:
>
>
>
>>Great topic and, IMO, great answer ... I've only a few comments in
>>
>>
>addition
>
>
>>to Joe's reply (inline).
>>--
>>Dean Wells
>>MSEtechnology
>>* Email: dwells @msetechnology.com
>> http://msetechnology.com
>>
>>
>>
>> _____
>>
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of joe
>>Sent: Tuesday, December 06, 2005 8:56 AM
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: RE: [ActiveDir] Ntds.dit file corruption
>>
>>
>>I may get into trouble with this post as Brett/Eric/Dean/Steve correct
>>
>>
>me...
>
>
>>But that will be good.
>>
>>[DAW]
>>I'm fairly certain Bratt will have something to say on this one (in
>>
>>
>his
>
>
>>shoes, I know I would).
>>[/DAW]
>>
>>I will start with trying to differentiate between types of
>>
>>
>corruption... My
>
>
>>idea of AD corruption is underlying table corruption. However some
>>
>>
>people
>
>
>>may consider bad (really unexpected) values in AD to be corruption.
>>
>>
>The last
>
>
>>isn't corruption, AD is simply a store of data, it passes no judgement
>>
>>
>on
>
>
>>the data as long as it fits the schema guidelines for the attribute.
>>
>>
>If you
>
>
>>have the DN of a user in the siteObject attribute that isn't
>>
>>
>corruption, it
>
>
>>isn't good, but it is valid for the schema. Or if you have binary data
>>
>>
>in a
>
>
>>unicode string, again, not corruption (a unicode string IS binary
>>
>>
>data).
>
>
>>That being said, if apps (including parts of AD itself) hit unexpected
>>
>>
>data,
>
>
>>you will have some issues even if it isn't truly "corruption" it may
>>
>>
>as well
>
>
>>be in some cases. In fact, table corruption is probably better than
>>unexpected data in many cases.
>>
>>You might be able to argue that a USN rollback is corruption but I
>>
>>
>still
>
>
>>don't consider it so. Valid data, just out of step.
>>
>>[DAW]
>>That's an interesting one. If you treat the distributed database as a
>>
>>
>
>
>
>>whole, then USN rollback is indeed a form of corruption even though
>>
>>
>each
>
>
>>instance may deem itself consistent and intact.
>>[/DAW]
>>
>>Again corruption to me is in the underlying tables. Since AD doesn't
>>replicate the table structures, you can't pass that table corruption
>>
>>
>around.
>
>
>>Once AD realizes that some portion of the database is corrupt which
>>
>>
>would
>
>
>>probably be recognized by ESE saying, "that isn't right" and not
>>
>>
>passing
>
>
>>info back up to higher levels, but instead passing an error.
>>
>> joe
>>
>>
>>
>>
>> _____
>>
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of
>>neil.ruston@xxxxxxxxxxxxx
>>Sent: Tuesday, December 06, 2005 3:49 AM
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: RE: [ActiveDir] Ntds.dit file corruption
>>
>>
>>Is this guaranteed? How can we/you be sure that the system will
>>
>>
>recognise
>
>
>>the corruptions and therefore not replicate them? Surely this is akin
>>
>>
>to the
>
>
>>new feature added to e2k3 sp1, but which is (sadly) missing from AD(?)
>>
>>I must be missing a subtle point - please show me the light :)
>>
>>
>>neil
>>
>> _____
>>
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Steve Linehan
>>Sent: 05 December 2005 19:26
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: RE: [ActiveDir] Ntds.dit file corruption
>>
>>
>>We do not replicate corruption so if you have local corruption as
>>
>>
>noted
>
>
>>below there is no worry that it would replicate around to other
>>
>>
>servers in
>
>
>>the environment.
>>
>>Thanks,
>>
>>-Steve
>>
>> _____
>>
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx] On Behalf Of Phil Renouf
>>Sent: Monday, December 05, 2005 1:04 PM
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: Re: [ActiveDir] Ntds.dit file corruption
>>
>>
>>Will Read Only DC's take care of this? I don't know much about them
>>
>>
>yet, but
>
>
>>it makes sense that if the copy of the dit that a DC has is RO that it
>>
>>
>won't
>
>
>>try to replicate that anywhere and would only be the recipient of
>>replication. Anyone with more knowledge about how RO DC's will work to
>>
>>
>
>
>
>>comment on that?
>>
>>Phil
>>
>>
>>On 12/5/05, Medeiros, Jose wrote:
>>
>>Well at least the corruption occurred on just a single DC. One thing
>>
>>
>that
>
>
>>has bugged me about Active Directory is not being able to select if
>>
>>
>you want
>
>
>>a DC in a remote office to not have the ability to replicate back in a
>>
>>
>large
>
>
>>enterprise environment. Since most remote offices only have a few
>>
>>
>people at
>
>
>>the location and a DC is usually placed for improvised logon and
>>authentication time, many companies will either use a very low end
>>
>>
>server or
>
>
>>a very old decommissioned one from their production data center (
>>
>>
>Which is
>
>
>>probably close to useable life ). I am always concerned that once the
>>NTDS.DIT file becomes corrupt it will replicate the corruption to the
>>
>>
>other
>
>
>>DC's in the Forrest.
>>
>>Maybe I am just being a worry wort and this really is not an issue.
>>
>>
>>
>>Sincerely,
>>Jose Medeiros
>>ADP | National Account Services
>>ProBusiness Division | Information Services
>>925.737.7967 | 408-449-6621 CELL
>>
>>
>>
>>
>>-----Original Message-----
>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx
>>
>>[mailto:ActiveDir-owner@xxxxxxxxxxxxxxxxxx]On Behalf Of Susan Bradley,
>>
>>
>
>
>
>>CPA aka Ebitz - SBS Rocks [MVP]
>>Sent: Monday, December 05, 2005 8:53 AM
>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>Subject: Re: [ActiveDir] Ntds.dit file corruption
>>
>>
>>I did? :-) I think I still said all I know is what the poster said
>>
>>
>:-)
>
>
>>I think I need a course in event log reading because even with the
>>
>>
>logs,
>
>
>>and the default size of the logs, I still don't see a smoking gun.
>>
>>
>The
>
>
>>directory services one is filled with events 'post' blow up.
>>
>>What is interesting is that it seems to me big server land goes .. oh
>>yeah... ntds.dit corruption... and sbsland freaks out. Either we do
>>indeed need to ensure we have a secondary DC or we need to park a
>>
>>
>second
>
>
>>copy of a system state offsite [say at the vap/var]
>>
>>Brett Shirley wrote:
>>
>>
>>>She replied offline, very likely a single bit flip, tragedy, they
>>>
>>>
>aren't
>
>
>>>one release later (Longhorn), where this would've probably been
>>>non-disruptively handled, logged, and possibly self-healed:
>>> http://blogs.technet.com/efleis/archive/2005/01.aspx
>>>
>>>Anyway, this kind of thing is usually hardware ...
>>>
>>>While there are much better disk sub-system testers, one that is
>>>
>>>
>freely
>
>
>>>available to any box with Exchange is jetstress. You might give
>>>
>>>
>that a
>
>
>>>try. If you can reproduce the event / error with jetstress I would
>>>
>>>
>not
>
>
>>>use that box in production.
>>>
>>>If you do reproduce the issue several times (several times is key,
>>>
>>>
>as you
>
>
>>>want a trend before you start playing the variable game), some
>>>
>>>
>things
>
>
>>>you might vary (one at a time):
>>>
>>> - Try making sure you have the latest driver and motherboard /
>>>
>>>
>controller
>
>
>>>firmware. Then see if you can reproduce.
>>>
>>> - Try a different RAID configuration, such as RAID1/RAID1+0 if
>>>
>>>
>you're on
>
>
>>>RAID5.
>>>
>>> - Try swapping out the hard drives, one at a time.
>>>
>>> - Adding the jetstress files to the exclude list in the Anti-Virus
>>>software. (A low probablility, I've never heard of Anit-Virus
>>>
>>>
>causing this
>
>
>>>paticular type of error, and I can't imagine the mistake an
>>>
>>>
>anti-virus
>
>
>>>product would have to have to cause this side effect)
>>>
>>> - If you can reproduce it several times, you could followup with
>>>
>>>
>Dell.
>
>
>>>Good luck.
>>>
>>>I'm not sure if I answered your question ...
>>>
>>>Cheers,
>>>BrettSh
>>>
>>>
>>>On Sun, 4 Dec 2005, Eric Fleischman wrote:
>>>
>>>
>>>
>>>
>>>>Going back to the original post, I'm not sure I fully understand
>>>>
>>>>
>the
>
>
>>>>problem yet. Susan, can you define "ntds.dit file corruption" for
>>>>
>>>>
>us?
>
>
>>>>What sort of corruption? What errors/events lead you to believe
>>>>
>>>>
>this?
>
>
>>>>Specifically, I'm interested in errors from NTDS ISAM or ESE if you
>>>>
>>>>
>
>
>
>>>>have any.
>>>>
>>>>
>>>>
>>>>________________________________
>>>>
>>>>From: ActiveDir-owner@xxxxxxxxxxxxxxxxxx on behalf of Susan
>>>>
>>>>
>Bradley, CPA
>
>
>>aka Ebitz - SBS Rocks [MVP]
>>
>>
>>>>Sent: Sat 12/3/2005 10:58 PM
>>>>To: ActiveDir@xxxxxxxxxxxxxxxxxx
>>>>Subject: [ActiveDir] Ntds.dit file corruption
>>>>
>>>>
>>>>
>>>>SBS box [with Windows 2003 sp1 since September]
>>>>
>>>>RE: [ActiveDir] Database Corruption:
>>>>
>>>>
>>>>
>http://www.mail-archive.com/activedir@xxxxxxxxxxxxxxxxxx/msg32676.html
>
>
>>>>We have a SBS 2003 sp1 box with a corrupt ntds.dit that the
>>>>
>>>>
>Consultant
>
>
>>>>and PSS have been banging on. Could not get the services back
>>>>
>>>>
>running,
>
>
>>>>changed the RPC service to local system and some service came back
>>>>
>>>>
>up [I
>
>
>>>>don't have all the details but the consultant opened a support case
>>>>
>>>>
>of
>
>
>>>>SRX051202605433].
>>>>
>>>>Bottom line they are about going to give up and start a restore but
>>>>
>>>>
>
>
>
>>>>before they do that I'd like to get the view of the AD gods and
>>>>goddesses around here. From all that I've seen, read, seen in the
>>>>
>>>>
>SBS
>
>
>>>>newsgroup, the corruption of ntds.dit is rare to nil and an
>>>>
>>>>
>underlying
>
>
>>>>cause is hardware issues [raid, disk subsystem]. This doesn't just
>>>>
>>>>
>
>
>
>>>>happen.
>>>>
>>>>The VAP asked if not properly excluding the ad databases from the
>>>>
>>>>
>a/v
>
>
>>>>would cause this/trigger this and my expectation is 'no', given
>>>>
>>>>
>that I
>
>
>>>>doubt the majority of us in SBSland properly set up exclusions Virus

>>>>scanning recommendations on a Windows 2000 or on a Windows
>>>>
>>>>
>Server
>
>
>>>>2003 domain controller:
>>>>http://support.microsoft.com/default.aspx?scid=kb;en-us;822158
>>>>
>>>>If this were my hardware and box, I'd be putting this sucker on the
>>>>
>>>>
>
>
>
>>>>operating table and getting an autopsy before putting it back
>>>>
>>>>
>online.
>
>
>>>>Are we right in being paranoid now about this hardware? For you
>>>>
>>>>
>guys in
>
>
>>>>big server land you'd just slide over another box into that server
>>>>
>>>>
>role.
>
>
>>>>---------------------------------------
>>>>Stupid question alert....
>>>>
>>>>Okay so we know that having a secondary/additional domain
>>>>
>>>>
>controller is
>
>
>>>>a good thing even in SBSland...but question.... many times the
>>>>
>>>>
>second
>
>
>>>>server in SBSland is a terminal server box because we do not
>>>>
>>>>
>support TS
>
>
>>>>in app mode on our PDCs. So we've established that having a domain
>>>>controller and a terminal server is a security issue [see Windows
>>>>Security resource kit, NIST Terminal services hardening guide, etc
>>>>etc....] If our second server is a member server handing out TS
>>>>externally, should that be a candidate for the additional DC? Are
>>>>
>>>>
>the
>
>
>>>>issues of TS on a DC ... true for 'any' DC? Would it be