Azure Data Lake Storage Gen1
to Gen2 Migration
Azure Data Lake Storage (ADLS) Gen2 is a highly scalable and cost-eective data
lake solution for big data analytics. It combines the power of a high-performance file
system with massive scale and economy to help organizations speed their time to
insight. ADLS Gen2 extends Azure Blob Storage capabilities, is optimized for analytic
workloads, and is the most comprehensive data lake available.
As more customers migrate from ADLS Gen1 to Gen2 they typically follow one of four migration approaches. These approaches are
described in this document, and the final section provides information on WANdisco LiveData Plane, which minimizes the risks and costs
associated with large scale data migration initiatives, and is an ideal and Microsoft recommended solution for bidirectional replication and
for migrating data from ADLS Gen1 to Gen2 with zero downtime during migration, zero data loss and 100% data consistency.
MIGRATION APPROACHES
Migration from ADLS Gen1 to Gen2 typically follows one of four migration patterns,
which are described in more detail below. The patterns are also discussed in the
Microsoft documentation at: https://docs.microsoft.com/en-us/azure/storage/blobs/
data-lake-storage-migrate-gen1-to-gen2#migration-patterns
Lift and Shift
A lift and shift approach migrates an application and data from one environment to
another without redesigning the application for the target environment. A lift and shift
approach is typically the simplest approach requiring the following high level steps:
• Stop all writes to Gen1
• Move the data from Gen1 to Gen2
• Point ingest operations and workloads to Gen2
• Decommission Gen1
Typically, this approach is best suited for small scale migrations, where all applications
can be upgraded to the new environment at one time, and for which downtime is
acceptable. Once organizations need to migrate 100s of TBs or PBs of data, the amount
of time required just to physically move the data is usually longer than the acceptable
downtime that is required. Additionally, while upgrading all applications at one time can
be a pro, many organizations like to phase the migration in order to minimize risk. This
phasing is not possible with a big bang lift and shift approach.
PROS
• Simplest approach
• All applications upgraded at
one time
CONS
• Requires downtime during
migration and cutover
periods
• All applications upgraded at
one time
AZUR E DATA LAKE STOR AGE GEN1 TO GEN 2 MIGR ATION
1
INCREMENTAL COPY
An incremental copy approach is where the new and modified data is periodically
copied from the source to target destination. To execute the incremental copy approach
requires that the destination must have all data from the source system before the
incremental copy process can be initiated. Steps for this approach are as follows:
• Start moving data from Gen1 to Gen2
• Incremental copy of new and modified data from Gen1 to Gen2
• Once incremental copy is complete, stop all writes to Gen1 and point workloads to
Gen2
• Decommission Gen1
An incremental copy approach is typically used when needing to migrate larger data
sets and the copy requires more time. Since it allows writes to continue in the Gen1
environment it does not require as much application downtime. However, just as was
the case for lift and shift, once organizations need to migrate 100s of TBs or PBs of data,
the incremental copy approach is likely also not acceptable. The new and modified
data in Gen1 must continuously be reconciled and incrementally copied to the Gen2
environment. Manual reconciliation becomes unacceptable for large scale data sets, and
the incremental copy process may take too long to complete. In addition, just as for lift
and shift, all applications must be upgraded at one time which may not be acceptable for
many organizations.
PROS
• Requires less downtime than
lift and shift approach
• All applications upgraded at
one
time
CONS
• Requires downtime during
cutover period
• All applications upgraded at
one time
• Requires reconciliation to
identify new & changed data
• Lengthy process for large
scale migrations
DUAL PIPELINE / INGEST
A dual pipeline or dual ingest approach is where new data is ingested simultaneously into
both the Gen1 and Gen2 environments. Steps for this approach are as follows:
• Start moving data from Gen1 to Gen2
• Ingest new data into both Gen1 and Gen2
• Point workloads to Gen2
• Stop all writes to Gen1 and then decommission Gen1
While a dual ingest approach can support a zero downtime migration, and allow for a
phased cutover of applications, it introduces much higher complexity and requires many
more resources to manage this complexity during the setup, maintenance, testing and
validation activities. Once the dual ingest is started in both environments, reconciliation
needs to be continuously performed to identify data changes that occur in Gen1 and
make sure those same changes get applied to Gen2. As discussed previously, manual
reconciliation may not be feasible or acceptable for large scale data sets. The longer that
changes continue in Gen1 the greater the chance of introducing data inconsistency, and
given this approach is typically used for migration of large data sets where downtimes
introduced by the previous patterns would not be acceptable, the amount of time this
approach requires before it is completed can be very lengthy. The migration projects
often exceed expected timelines and budgets.
PROS
• Supports zero downtime
• Allows phased migration of
applications
CONS
• High complexity solution
• Requires more resources
to manage the setup,
maintenance and testing
activities
• Requires reconciliation to
identify data changes in Gen1
from initial copy and while
dual ingest is active
• Higher potential for data
inconsistency
• Lengthy process for large
scale migrations
AZUR E DATA LAKE STOR AGE GEN1 TO GEN 2 MIGR ATION
2