RubyGems - anystyle - Versions diffs - 1.0.0 - Mend

anystyle 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (82) hide show

checksums.yaml +7 -0
data/HISTORY.md +78 -0
data/LICENSE +27 -0
data/README.md +103 -0
data/lib/anystyle.rb +71 -0
data/lib/anystyle/dictionary.rb +132 -0
data/lib/anystyle/dictionary/gdbm.rb +52 -0
data/lib/anystyle/dictionary/lmdb.rb +67 -0
data/lib/anystyle/dictionary/marshal.rb +27 -0
data/lib/anystyle/dictionary/redis.rb +55 -0
data/lib/anystyle/document.rb +264 -0
data/lib/anystyle/errors.rb +14 -0
data/lib/anystyle/feature.rb +27 -0
data/lib/anystyle/feature/affix.rb +43 -0
data/lib/anystyle/feature/brackets.rb +32 -0
data/lib/anystyle/feature/canonical.rb +13 -0
data/lib/anystyle/feature/caps.rb +20 -0
data/lib/anystyle/feature/category.rb +70 -0
data/lib/anystyle/feature/dictionary.rb +16 -0
data/lib/anystyle/feature/indent.rb +16 -0
data/lib/anystyle/feature/keyword.rb +52 -0
data/lib/anystyle/feature/line.rb +39 -0
data/lib/anystyle/feature/locator.rb +18 -0
data/lib/anystyle/feature/number.rb +39 -0
data/lib/anystyle/feature/position.rb +28 -0
data/lib/anystyle/feature/punctuation.rb +22 -0
data/lib/anystyle/feature/quotes.rb +20 -0
data/lib/anystyle/feature/ref.rb +21 -0
data/lib/anystyle/feature/terminal.rb +19 -0
data/lib/anystyle/feature/words.rb +74 -0
data/lib/anystyle/finder.rb +94 -0
data/lib/anystyle/format/bibtex.rb +63 -0
data/lib/anystyle/format/csl.rb +28 -0
data/lib/anystyle/normalizer.rb +65 -0
data/lib/anystyle/normalizer/brackets.rb +13 -0
data/lib/anystyle/normalizer/container.rb +13 -0
data/lib/anystyle/normalizer/date.rb +109 -0
data/lib/anystyle/normalizer/edition.rb +16 -0
data/lib/anystyle/normalizer/journal.rb +14 -0
data/lib/anystyle/normalizer/locale.rb +30 -0
data/lib/anystyle/normalizer/location.rb +24 -0
data/lib/anystyle/normalizer/locator.rb +22 -0
data/lib/anystyle/normalizer/names.rb +88 -0
data/lib/anystyle/normalizer/page.rb +29 -0
data/lib/anystyle/normalizer/publisher.rb +18 -0
data/lib/anystyle/normalizer/pubmed.rb +18 -0
data/lib/anystyle/normalizer/punctuation.rb +23 -0
data/lib/anystyle/normalizer/quotes.rb +14 -0
data/lib/anystyle/normalizer/type.rb +54 -0
data/lib/anystyle/normalizer/volume.rb +26 -0
data/lib/anystyle/parser.rb +199 -0
data/lib/anystyle/support.rb +4 -0
data/lib/anystyle/support/finder.mod +3234 -0
data/lib/anystyle/support/finder.txt +75 -0
data/lib/anystyle/support/parser.mod +15025 -0
data/lib/anystyle/support/parser.txt +75 -0
data/lib/anystyle/utils.rb +70 -0
data/lib/anystyle/version.rb +3 -0
data/res/finder/bb132pr2055.ttx +6803 -0
data/res/finder/bb550sh8053.ttx +18660 -0
data/res/finder/bb599nz4341.ttx +2957 -0
data/res/finder/bb725rt6501.ttx +15276 -0
data/res/finder/bc605xz1554.ttx +18815 -0
data/res/finder/bd040gx5718.ttx +4271 -0
data/res/finder/bd413nt2715.ttx +4956 -0
data/res/finder/bd466fq0394.ttx +6100 -0
data/res/finder/bf668vw2021.ttx +3578 -0
data/res/finder/bg495cx0468.ttx +7267 -0
data/res/finder/bg599vt3743.ttx +6752 -0
data/res/finder/bg608dx2253.ttx +4094 -0
data/res/finder/bh410qk3771.ttx +8785 -0
data/res/finder/bh989ww6442.ttx +17204 -0
data/res/finder/bj581pc8202.ttx +2719 -0
data/res/parser/bad.xml +5199 -0
data/res/parser/core.xml +7924 -0
data/res/parser/gold.xml +2707 -0
data/res/parser/good.xml +34281 -0
data/res/parser/stanford-books.xml +2280 -0
data/res/parser/stanford-diss.xml +726 -0
data/res/parser/stanford-theses.xml +4684 -0
data/res/parser/ugly.xml +33246 -0
metadata +195 -0

data/res/finder/bf668vw2021.ttx ADDED

@@ -0,0 +1,3578 @@
+title         | A LIGHT-WEIGHT 3-D INDOOR ACQUISITION SYSTEM
+              |            USING AN RGB-D CAMERA
+blank         |
+              |
+              |
+              |
+title         |                A DISSERTATION
+              | SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
+              |                 ENGINEERING
+              |   AND THE COMMITTEE ON GRADUATE STUDIES
+              |           OF STANFORD UNIVERSITY
+              | IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
+              |              FOR THE DEGREE OF
+              |            DOCTOR OF PHILOSOPHY
+blank         |
+              |
+              |
+              |
+text          |                  Young Min Kim
+              |                   August 2013
+              |                   © 2013 by Young Min Kim. All Rights Reserved.
+              |          Re-distributed by Stanford University under license with the author.
+blank         |
+              |
+              |
+text          |                          This work is licensed under a Creative Commons Attribution-
+              |                          Noncommercial 3.0 United States License.
+              |                          http://creativecommons.org/licenses/by-nc/3.0/us/
+blank         |
+              |
+              |
+              |
+text          | This dissertation is online at: http://purl.stanford.edu/bf668vw2021
+blank         |
+text          | Includes supplemental files:
+              | 1. Video for Chapter 4 (video_final_medium3.wmv)
+              | 2. Video for Chapter 2 (Reconstruct.mpg)
+blank         |
+              |
+              |
+              |
+meta          |                                           ii
+text          | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
+              | in scope and quality as a dissertation for the degree of Doctor of Philosophy.
+blank         |
+text          |                                                          Leonidas Guibas, Primary Adviser
+blank         |
+              |
+              |
+text          | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
+              | in scope and quality as a dissertation for the degree of Doctor of Philosophy.
+blank         |
+text          |                                                                                   Bernd Girod
+blank         |
+              |
+              |
+text          | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
+              | in scope and quality as a dissertation for the degree of Doctor of Philosophy.
+blank         |
+text          |                                                                               Sebastian Thrun
+blank         |
+              |
+              |
+              |
+text          | Approved for the Stanford University Committee on Graduate Studies.
+              |                              Patricia J. Gumport, Vice Provost for Graduate Education
+blank         |
+              |
+              |
+              |
+text          | This signature page was generated electronically upon submission of this dissertation in
+              | electronic format. An original signed hard copy of the signature page is on file in
+              | University Archives.
+blank         |
+              |
+              |
+              |
+meta          |                                           iii
+title         | Abstract
+blank         |
+text          | Large-scale acquisition of exterior urban environments is by now a well-established
+              | technology, supporting many applications in map searching, navigation, and com-
+              | merce. The same is, however, not the case for indoor environments, where access is
+              | often restricted and the spaces can be cluttered. Recent advances in real-time 3D
+              | acquisition devices (e.g., Microsoft Kinect) enable everyday users to scan complex
+              | indoor environments at a video rate. Raw scans, however, are often noisy, incom-
+              | plete, and significantly corrupted, making semantic scene understanding difficult, if
+              | not impossible. In this dissertation, we present ways of utilizing prior information
+              | to semantically understand the environments from the noisy scans of real-time 3-D
+              | sensors. The presented pipelines are light-weighted, and have the potential to allow
+              | users to provide feedback at interactive rates.
+              |    We first present a hand-held system for real-time, interactive acquisition of res-
+              | idential floor plans. The system integrates a commodity range camera, a micro-
+              | projector, and a button interface for user input and allows the user to freely move
+              | through a building to capture its important architectural elements. The system uses
+              | the Manhattan world assumption, which posits that wall layouts are rectilinear. This
+              | assumption allows generation of floor plans in real time, enabling the operator to
+              | interactively guide the reconstruction process and to resolve structural ambiguities
+              | and errors during the acquisition. The interactive component aids users with no ar-
+              | chitectural training in acquiring wall layouts for their residences. We show a number
+              | of residential floor plans reconstructed with the system.
+              |    We then discuss how we exploit the fact that public environments typically contain
+              | a high density of repeated objects (e.g., tables, chairs, monitors, etc.) in regular or
+blank         |
+              |
+meta          |                                           iv
+text          | non-regular arrangements with significant pose variations and articulations. We use
+              | the special structure of indoor environments to accelerate their 3D acquisition and
+              | recognition. Our approach consists of two phases: (i) a learning phase wherein we
+              | acquire 3D models of frequently occurring objects and capture their variability modes
+              | from only a few scans, and (ii) a recognition phase wherein from a single scan of a
+              | new area, we identify previously seen objects but in different poses and locations at
+              | an average recognition time of 200ms/model. We evaluate the robustness and limits
+              | of the proposed recognition system using a range of synthetic and real-world scans
+              | under challenging settings.
+              |    Last, we present a guided real-time scanning setup, wherein the incoming 3D
+              | data stream is continuously analyzed, and the data quality is automatically assessed.
+              | While the user is scanning an object, the proposed system discovers and highlights
+              | the missing parts, thus guiding the operator (or the autonomous robot) to ’‘where
+              | to scan next”. We assess the data quality and completeness of the 3D scan data
+              | by comparing to a large collection of commonly occurring indoor man-made objects
+              | using an efficient, robust, and effective scan descriptor. We have tested the system
+              | on a large number of simulated and real setups, and found the guided interface to be
+              | effective even in cluttered and complex indoor environments. Overall, the research
+              | presented in the dissertation discusses how low-quality 3-D scans can be effectively
+              | used to understand indoor environments and allow necessary user-interaction in real-
+              | time. The presented pipelines are designed to be quick and effective by utilizing
+              | different geometric priors depending on the target applications.
+blank         |
+              |
+              |
+              |
+meta          |                                           v
+title         | Acknowledgements
+blank         |
+text          | All the work presented in this thesis would not have been possible without help from
+              | many people.
+              |    First of all, I would like to express my sincerest gratitude to my advisor, Leonidas
+              | Guibas. He is not only an intelligent and inspiring scholar in amazingly diverse
+              | topics, but also a very caring advisor with deep insights into various aspects of life.
+              | He guided me through one of the toughest times of my life, and I am lucky to be one
+              | of his students.
+              |    During my life at Stanford, I had the privilege of working with the smartest people
+              | in the world learning not only about research, but also about the different mind-sets
+              | that lead to successful careers. I would like to thank Bernd Girod, Christian Theobalt,
+              | Sebastian Thrun, Vladlen Koltun, Niloy Mitra, Saumitra Das, Stephen Gould, and
+              | Adrian Butscher for being mentors during different stages of my graduate career. I
+              | also appreciate help of wonderful collaborators on exciting projects: Jana Kosecka,
+              | Branislav Miscusik, James Diebel, Mike Sokolsky, Jen Dolson, Dongming Yan, and
+              | Qixing Huang.
+              |    The work presented here was generously supported by the following funding
+              | sources: Samsung Scholarship, MPC-VCC, Qualcomm corporation.
+              |    I adore my officemates for being cheerful and encouraging, and most of all, being
+              | there: Derek Chan, Rahul Biswas, Stephanie Lefevre, Qixing Huang, Jonathan Jiang,
+              | Art Tevs, Michael Kerber, Justin Solomon, Jonathan Huang, Fan Wang, Daniel Chen,
+              | Kyle Heath, Vangelis Kalogerakis, and Sharath Kumar Raghvendra. I often spent
+              | more time with them than with any other people.
+              |    I have to thank all the friends I met at Stanford. In particular, I would like to
+blank         |
+              |
+meta          |                                           vi
+text          | thank Stephanie Kwan, Karen Zhu, Landry Huet, and Yiting Yeh for fun hangouts
+              | and random conversations in my early years. I was also fortunate enough to meet a
+              | wonderful chamber music group led by Dr. Herbert Myers in which I could play early
+              | music with Michael Peterson and Lisa Silverman. I also appreciate for being able to
+              | participate in a wonderful WISE (Women in Science and Engineering) group. WISE
+              | girls have always been smart, tender and supportive. Many Korean friends at Stanford
+              | were like family for me here. I will not attempt to name them all, but I would like to
+              | especially thank Jeongha Park, Soogine Chong, Sun-Hae Hong, Jenny Lee, Ga-Young
+              | Suh, Joyce Lee, Hyeji Kim, Sun Goo Lee, Wookyung Kim, Han Ho Song and Su-In
+              | Lee. While I was enjoying my life at Stanford, I was always connected to my friends
+              | in Korea. I would like to express my thanks for their trust and everlasting friendship.
+              |    Last, I cannot thank to my family enough. I would like to dedicate my thesis to my
+              | parents, Kwang Woo Kim and Mi Ja Lee. Their constant love and trust have helped
+              | me overcome hardships ever since I was born. I also enjoyed having my brother, Joo
+              | Hwan Kim, in the Bay Area. His passion and thoughtful advice always helped me
+              | and cheered me up. I thank my husband, Sung-Boem Park, for being by my side no
+              | matter what happened. He is my best friend, and he made me face and overcome
+              | challenges. I also need to thank my soon-to-be born son (due in August), for allowing
+              | me to accelerate the last stages of my Ph. D.
+              |    Thank you all for making me who I am today.
+blank         |
+              |
+              |
+              |
+meta          |                                           vii
+title         | Contents
+blank         |
+text          | Abstract                                                                                   iv
+blank         |
+text          | Acknowledgements                                                                           vi
+blank         |
+text          | 1 Introduction                                                                              1
+              |   1.1   Background on RGB-D Cameras              . . . . . . . . . . . . . . . . . . . .    3
+              |         1.1.1   Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . .          3
+              |         1.1.2   Noise Characteristics . . . . . . . . . . . . . . . . . . . . . . .         5
+              |   1.2   3-D Indoor Acquisition System . . . . . . . . . . . . . . . . . . . . .             6
+              |   1.3   Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . .         7
+              |         1.3.1   Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . .         9
+blank         |
+text          | 2 Interactive Acquisition of Residential Floor Plans1                                      11
+              |   2.1   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         13
+              |   2.2   System Overview and Usage . . . . . . . . . . . . . . . . . . . . . . .            14
+              |   2.3   Data Acquisition Process . . . . . . . . . . . . . . . . . . . . . . . . .         16
+              |         2.3.1   Pair-Wise Registration . . . . . . . . . . . . . . . . . . . . . .         19
+              |         2.3.2   Plane Extraction . . . . . . . . . . . . . . . . . . . . . . . . .         22
+              |         2.3.3   Global Adjustment . . . . . . . . . . . . . . . . . . . . . . . .          23
+              |         2.3.4   Map Update . . . . . . . . . . . . . . . . . . . . . . . . . . . .         26
+              |   2.4   Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       27
+              |   2.5   Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . .            29
+blank         |
+              |
+              |
+              |
+meta          |                                           viii
+text          | 3 Environments with Variability and Repetition                                          33
+              |   3.1   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      35
+              |         3.1.1   Scanning Technology . . . . . . . . . . . . . . . . . . . . . . .       35
+              |         3.1.2   Geometric Priors for Objects . . . . . . . . . . . . . . . . . . .      35
+              |         3.1.3   Scene Understanding . . . . . . . . . . . . . . . . . . . . . . .       36
+              |   3.2   Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    37
+              |         3.2.1   Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    39
+              |         3.2.2   Hierarchical Structure . . . . . . . . . . . . . . . . . . . . . .      40
+              |   3.3   Learning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      41
+              |         3.3.1   Initializing the Skeleton of the Model . . . . . . . . . . . . . .      43
+              |         3.3.2   Incrementally Completing a Coherent Model . . . . . . . . . .           45
+              |   3.4   Recognition Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . .       46
+              |         3.4.1   Initial Assignment for Parts . . . . . . . . . . . . . . . . . . .      47
+              |         3.4.2   Refined Assignment with Geometry . . . . . . . . . . . . . . .          49
+              |   3.5   Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   51
+              |         3.5.1   Synthetic Scenes . . . . . . . . . . . . . . . . . . . . . . . . .      51
+              |         3.5.2   Real-World Scenes . . . . . . . . . . . . . . . . . . . . . . . .       54
+              |         3.5.3   Comparisons     . . . . . . . . . . . . . . . . . . . . . . . . . . .   57
+              |         3.5.4   Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . .     58
+              |         3.5.5   Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . .    59
+              |   3.6   Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     60
+blank         |
+text          | 4 Guided Real-Time Scanning                                                             64
+              |   4.1   Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      67
+              |         4.1.1   Interactive Acquisition . . . . . . . . . . . . . . . . . . . . . .     67
+              |         4.1.2   Scan Completion . . . . . . . . . . . . . . . . . . . . . . . . .       67
+              |         4.1.3   Part-Based Modeling . . . . . . . . . . . . . . . . . . . . . . .       67
+              |         4.1.4   Template-Based Completion . . . . . . . . . . . . . . . . . . .         68
+              |         4.1.5   Shape Descriptors . . . . . . . . . . . . . . . . . . . . . . . . .     68
+              |   4.2   Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    68
+              |         4.2.1   Scan Acquisition . . . . . . . . . . . . . . . . . . . . . . . . .      70
+blank         |
+              |
+meta          |                                            ix
+text          |         4.2.2   Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . .    70
+              |         4.2.3   Scan Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .    71
+              |   4.3   Partial Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . .    71
+              |         4.3.1   View-Dependent Simulated Scans . . . . . . . . . . . . . . . .         72
+              |         4.3.2   A2h Scan Descriptor . . . . . . . . . . . . . . . . . . . . . . .      73
+              |         4.3.3   Descriptor-Based Shape Matching . . . . . . . . . . . . . . . .        74
+              |         4.3.4   Scan Registration . . . . . . . . . . . . . . . . . . . . . . . . .    75
+              |   4.4   Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   75
+              |   4.5   Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   76
+              |         4.5.1   Model Database . . . . . . . . . . . . . . . . . . . . . . . . . .     76
+              |         4.5.2   Retrieval Results with Simulated Data . . . . . . . . . . . . .        77
+              |         4.5.3   Retrieval Results with Real Data . . . . . . . . . . . . . . . .       78
+              |   4.6   Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    79
+blank         |
+text          | 5 Conclusions                                                                          89
+blank         |
+text          | Bibliography                                                                           91
+blank         |
+              |
+              |
+              |
+meta          |                                            x
+title         | List of Tables
+blank         |
+text          |  2.1   Accuracy comparisons . . . . . . . . . . . . . . . . . . . . . . . . . .     29
+blank         |
+text          |  3.1   Parameters used in our algorithm . . . . . . . . . . . . . . . . . . . .     41
+              |  3.2   Models obtained from the learning phase . . . . . . . . . . . . . . . .      55
+              |  3.3   Statistics for the recognition phase . . . . . . . . . . . . . . . . . . .   56
+              |  3.4   Statistics between objects learned for each scene category . . . . . . .     59
+blank         |
+text          |  4.1   Database and scan statistics . . . . . . . . . . . . . . . . . . . . . . .   76
+blank         |
+              |
+              |
+              |
+meta          |                                          xi
+title         | List of Figures
+blank         |
+text          |  1.1   Triangulation principle . . . . . . . . . . . . . . . . . . . . . . . . . .      4
+              |  1.2   Kinect sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      5
+blank         |
+text          |  2.1   System overview     . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   12
+              |  2.2   System pipeline and usage . . . . . . . . . . . . . . . . . . . . . . . .       15
+              |  2.3   Notation and representation . . . . . . . . . . . . . . . . . . . . . . .       17
+              |  2.4   Illustration for pair-wise registration . . . . . . . . . . . . . . . . . .     19
+              |  2.5   Optical flow and image plane correspondence . . . . . . . . . . . . . .         20
+              |  2.6   Silhouette points . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     21
+              |  2.7   Optimizing the map      . . . . . . . . . . . . . . . . . . . . . . . . . . .   24
+              |  2.8   Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   26
+              |  2.9   Analysis on computational time . . . . . . . . . . . . . . . . . . . . .        27
+              |  2.10 Visual comparisons of the generated floor plans . . . . . . . . . . . .          31
+              |  2.11 An possible example of extensions . . . . . . . . . . . . . . . . . . . .        32
+blank         |
+text          |  3.1   System overview     . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   34
+              |  3.2   Acquisition pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . .    37
+              |  3.3   Hierarchical data structure. . . . . . . . . . . . . . . . . . . . . . . .      39
+              |  3.4   Overview of the learning phase      . . . . . . . . . . . . . . . . . . . . .   42
+              |  3.5   Attachment of the model . . . . . . . . . . . . . . . . . . . . . . . . .       46
+              |  3.6   Overview of the recognition phase . . . . . . . . . . . . . . . . . . . .       47
+              |  3.7   Refining the segmentation . . . . . . . . . . . . . . . . . . . . . . . .       50
+              |  3.8   Recognition results on synthetic scans of virtual scenes . . . . . . . .        52
+              |  3.9   Chair models used in synthetic scenes . . . . . . . . . . . . . . . . . .       53
+blank         |
+              |
+meta          |                                           xii
+text          | 3.10 Precision-recall curve . . . . . . . . . . . . . . . . . . . . . . . . . . .    54
+              | 3.11 Various models learned/used in our test       . . . . . . . . . . . . . . . .   55
+              | 3.12 Recognition results for various office and auditorium scenes . . . . . .        61
+              | 3.13 A close-up office scene . . . . . . . . . . . . . . . . . . . . . . . . . .     62
+              | 3.14 Comparison with an indoor labeling system . . . . . . . . . . . . . .           63
+blank         |
+text          | 4.1   A real-time guided scanning system . . . . . . . . . . . . . . . . . . .       65
+              | 4.2   Pipeline of the real-time guided scanning framework . . . . . . . . . .        69
+              | 4.3   Representative shape retrieval results . . . . . . . . . . . . . . . . . .     80
+              | 4.4   The proposed guided real-time scanning setup . . . . . . . . . . . . .         81
+              | 4.5   Retrieval results with simulated data using a chair data set . . . . . .       82
+              | 4.6   Retrieval results with simulated data using a couch data set . . . . .         83
+              | 4.7   Retrieval results with simulated data using a lamp data set . . . . . .        84
+              | 4.8   Retrieval results with simulated data using a table data set . . . . . .       85
+              | 4.9   Comparison between retrieval with view-dependant and merged scans              86
+              | 4.10 Effect of density-aware sampling . . . . . . . . . . . . . . . . . . . . .      87
+              | 4.11 Effect of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   87
+              | 4.12 Real-time retrieval results on various datasets . . . . . . . . . . . . .       88
+blank         |
+              |
+              |
+              |
+meta          |                                         xiii
+              | Chapter 1
+blank         |
+title         | Introduction
+blank         |
+text          | Acquiring a 3-D model of a real-world object, also known as 3-D reconstruction
+              | technology, has long been a challenge for various applications, including robotics
+              | navigation, 3-D modeling of virtual worlds, augmented reality, computer graphics,
+              | and manufacturing. In the graphics community, a 3-D model is typically acquired in a
+              | carefully calibrated set-up with highly accurate laser scans, followed by a complicated
+              | off-line process from scan registration to surface reconstruction. Because this is a very
+              | long process that requires special equipment, only a limited number of objects can be
+              | modeled, and the method cannot be scaled to larger environments.
+              |    One of the most common applications of a large-scale 3-D reconstruction comes
+              | from modeling of urban environments. To build a model, a vehicle equipped with
+              | different sensors drives along roads and collects a large amount of data from lasers,
+              | GPS signals, wheel counters, cameras, etc. The data is then processed and stored in a
+              | compact form which includes important roads, buildings, parking lots. The mapped
+              | environments are used frequently in cell-phone applications, mapping technology or
+              | navigation tools.
+              |    However, we cannot simply extend the same technology used in the 3-D reconstruc-
+              | tion of urban environments to indoor environments. First, unlike urban environments,
+              | where permanent roads exist, there are no clearly defined pathways that people must
+              | follow in an indoor environment. Occupants walk in various patterns around an in-
+              | door area, and often the space is cluttered, which could result in safety issues if, say,
+blank         |
+              |
+meta          |                                            1
+              | CHAPTER 1. INTRODUCTION                                                              2
+blank         |
+              |
+              |
+text          | a robot with sensors drives within the area. Second, an indoor environment is not
+              | static. As residents and workers of the building engage in daily activities in interior
+              | environments, many objects are moved around or disappear, and new objects can be
+              | introduced. Third, interior shapes are much more complex compared to the outdoor
+              | surfaces of buildings, and it cannot simply be assumed that the objects present in a
+              | space are composed of flat surfaces as is generally the case in outdoor urban settings.
+              | Last, the modality of sensors used for outdoor mapping is not suitable for interior
+              | mapping and needs to be changed. A GPS signal does not work in indoor environ-
+              | ments, and the lighting conditions can vary significantly from one space to another
+              | compared to relatively constant sunlight outdoors.
+              |    Yet, 3-D reconstruction of indoor environments also have a variety of potential
+              | applications. After a 3-D model of an indoor environment is acquired, the model
+              | could be used for interior design, indoor navigation, surveillance, or understanding
+              | the interior layouts and existence of objects in a space. Depending on the applications
+              | for which the reconstructed model would be used, the distance range and level of detail
+              | needed can vary as well.
+              |    Recently, real-time 3-D sensors, such as the RGB-D sensors, a light-weight com-
+              | modity device, have been specifically designed to function in indoor environments and
+              | used to provide real-time 3-D data. Although the data captured from these sensors
+              | suffer from a limited field of view and complex noise characteristics, and therefore
+              | might not be suitable for accurate 3-D reconstruction, it can be used for everyday
+              | users to easily capture and utilize 3-D information of indoor environments. The work
+              | presented in this dissertation uses the data captured from RGB-D cameras with the
+              | goal of providing a useful 3-D acquisition while overcoming the limitations of the
+              | captured data. To do this, we have assumed different geometric priors depending on
+              | the targeted applications.
+              |    In the remainder of this chapter, we first describe the characteristics of RGB-
+              | D camera sensors (Section 1.1). The subsequent section (Section 1.2) presents our
+              | approach to acquire 3-D indoor environments. The chapter concludes with an outline
+              | of the remainder of the dissertation (Section 1.3).
+meta          | CHAPTER 1. INTRODUCTION                                                              3
+blank         |
+              |
+              |
+title         | 1.1     Background on RGB-D Cameras
+text          | Building a 3-D model of actual objects enables the real world to be connected to a
+              | virtual world. After obtaining a digital model from a real-world object, the model can
+              | be used in various applications. A benefit of 3D modeling is that the digital object
+              | can be saved and altered freely without an actual space being damaged or destroyed.
+              |    Until recently, it was not possible for non-expert users to capture real-world envi-
+              | ronments in 3D because of the complexity and cost of the required equipment. RGB-D
+              | cameras, which provide real-time depth and color information, only became available
+              | a few years ago. The pioneering commodity product is the X-Box Kinect [Mic10],
+              | launched on October 2011. Originally developed as a gaming device, the sensor pro-
+              | vides real-time depth streams enabling interaction between a user and a system.
+              |    The Kinect is affordable and easy to operate for non-expert users, and the pro-
+              | duced data can be accessed through open-source drivers. Although the main purpose
+              | of the Kinect by far was motion-sensing, thus providing a real-time interface for gam-
+              | ing or control, the device has served many purposes and has been used as a tool to
+              | develop personalized applications with the help of the drivers. Some developers also
+              | use the device to extend computer vision-related tasks (such as object recognition
+              | or structure from motion) but with depth measurements augmented as an additional
+              | modality of input. In addition, the device can also be viewed as a 3-D sensor that
+              | produces 3-D pointcloud data. In our work, this is how we view the device, and the
+              | goal of the research presented here, as noted above, was to acquire 3-D indoor objects
+              | or environments using the RGB-D cameras of the Kinect sensor.
+blank         |
+              |
+title         | 1.1.1    Technology
+text          | The underlying core technology of the depth-capturing capacity of Kinect comes
+              | from its structured-light 3D scanner. This scanner measures the three-dimensional
+              | shape of an object using projected light patterns and a camera system. A typical
+              | scanner measuring assembly consists of one stripe projector and at least one camera.
+              | Projecting a narrow band of light onto a three-dimensionally shaped surface produces
+              | a line of illumination that appears distorted from perspectives other than that of the
+meta          | CHAPTER 1. INTRODUCTION                                                             4
+blank         |
+              |
+              |
+              |
+text          | Figure 1.1: Triangulation principle shown by one of multiple stripes (image from
+              | http://en.wikipedia.org/wiki/File:1-stripesx7.svg)
+blank         |
+text          | projector, and this line can be used for an exact geometric reconstruction of the
+              | surface shape. A sample setup with the projected line pattern is shown in Figure 1.1.
+              | The displacement of the stripes can be converted into 3D coordinates, which allow
+              | any details on an object’s surface to be retrieved.
+              |    An invisible structured-light scanner scans a 3-D shape of an object by projecting
+              | patterns with light in an invisible spectrum. The Kinect uses projecting patterns
+              | composed of points in infrared (IR) light to generate video data in 3D. As shown in
+              | Figure 1.2, the Kinect is a horizontal bar with an IR light emitter and IR sensor. The
+              | IR emitter emits infrared light beams, and the IR sensor reads the IR beams reflected
+              | back to the sensor. The reflected beams are converted into depth information that
+              | measures the distance between an object and the sensor. This makes capturing a
+              | depth image possible. The color sensor captures normal video (visible light) that is
+              | synchronized with the depth data. The horizontal bar of the Kinect also contains
+              | microphone arrays and is connected to a small base by a tilt motor. While the color
+              | video and microphone provide additional means for a natural user interface, in this
+meta          | CHAPTER 1. INTRODUCTION                                                               5
+blank         |
+              |
+              |
+              |
+text          | Figure 1.2: Kinect sensor (left) and illustration of the integrated hardware (right).
+              | (images from http://i.msdn.microsoft.com/dynimg/IC568992.png and http://
+              | i.msdn.microsoft.com/dynimg/IC584396.png)
+blank         |
+text          | dissertation, we are focused on the depth-sensing capability of the device.
+              |       The Kinect has a limited working range, mainly designed for the volume that a
+              | person will require while playing a game. Kinect’s official documentation1 suggests
+              | a working range from 0.8 m to 4 m from the sensor. The sensor has angular field
+              | of view of 57◦ horizontally and 43◦ vertically. When an object is out of range for
+              | a particular pixel, the system will return no values. The RGB video streams are
+              | produced in a 1280×960 resolution. However, the default RGB video stream uses 8-
+              | bit VGA resolution (640×480 pixels). The monochrome depth sensing video stream
+              | is also in VGA resolution with 11-bit depth, which provides 2,048 levels of sensitivity.
+              | The depth and color stream are produced at the frame rate of 30 Hz.
+              |       The depth data is originally produced as a 2-D grid of raw depth values. The
+              | values in each pixel can then be converted into (x, y, z) coordinates with calibration
+              | data. Depending on the application, the developer can regard the 2-D grid of values
+              | as a depth image, or the scattered points in 3-D ((x, y, z) coordinates) as unstructured
+              | pointcloud data.
+blank         |
+              |
+title         | 1.1.2       Noise Characteristics
+text          | While RGB-D cameras can provide real-time depth information, the obtained mea-
+              | surements exhibit convoluted noise characteristics. The measurements are extracted
+meta          |   1
+text          |       http://msdn.microsoft.com/en-us/library/jj131033.aspx
+meta          | CHAPTER 1. INTRODUCTION                                                                6
+blank         |
+              |
+              |
+text          | from identification of corresponding points of infrared projections in image pixels,
+              | and there are multiple possible sources of errors: (i) calibration error both of the
+              | extrinsic calibration parameters, which are given as the displacement between the
+              | projector and cameras, and the intrinsic calibration parameters, which depend on
+              | the focal points and size of pixels on the sensor grid, vary for each product; (ii)
+              | distance-dependent quantization error – because the accuracy of measurements de-
+              | pends on the resolution of a pixel compared to the details of projected pattern on
+              | the measured object, measurements are more noisy for farther points with more se-
+              | vere quantization artifacts; (iii) error from ambiguous or poor projection, in which
+              | the cameras cannot clearly observe the projected patterns – as the measurements are
+              | made by identifying the projected location of the infrared pattern, the distortion of
+              | the projected patterns on depth boundaries or on reflective material can result in
+              | wrong measurements. Sometimes the system cannot locate the corresponding points
+              | due to occlusion by parallax, or distance range and the data is reported as missing.
+              | In short, the depth data exhibits highly non-linear noise characteristics, and it is very
+              | hard to model all of the noise analytically.
+blank         |
+              |
+title         | 1.2      3-D Indoor Acquisition System
+text          | Given the complex noise characteristics of RGB-D cameras, we assumed that the de-
+              | vice produces noisy pointcloud data. Instead of reverse-engineering and correcting the
+              | noise from each source, we overcame the limitation on data by imposing assumptions
+              | on the 3-D shape of the objects being scanned.
+              |    There are three possible ways to reconstruct 3-D models from noisy data. The first
+              | is to overcome the limitation of data is accumulating multiple frames from slightly dif-
+              | ferent viewpoints [IKH+ 11]. By averaging the noise measurements and merging them
+              | into a single volumetric structure, a very high-quality mesh model can be recovered.
+              | The second is using a machine learning-based method. In this approach, multiple
+              | instances of measurements and actual object labels are first collected. Classifiers are
+              | then trained to produce the object labels given the measurements and later used to
+              | understand the given measurements. The third way is to assume geometric priors on
+meta          | CHAPTER 1. INTRODUCTION                                                                7
+blank         |
+              |
+              |
+text          | the data being captured. Assuming that the underlying scene is not completely ran-
+              | dom, the shape to be reconstructed has a limited degree of freedom, and can thus be
+              | reconstructed by inferring the most probable shape within the scope of the assumed
+              | structure.
+              |    This third way is the method used in our work. By focusing on acquiring the pre-
+              | defined modes or degree of freedom given the geometric priors, the acquired model
+              | naturally capture high-level information of the structure. In addition, the acquisition
+              | pipeline becomes lightweight and the entire process can stay real-time. Because the in-
+              | put data stream is also real-time, there is possibility of incorporating user-interaction
+              | during the capturing process.
+blank         |
+              |
+title         | 1.3      Outline of the Dissertation
+text          | The chapters to follow, outlined below, discuss in detail the specific approaches we
+              | took to mitigate the problems inherent in indoor reconstruction from noisy sensor
+              | data.
+              |    Chapter 2 discusses a pipeline used to acquire floor plans in residential areas. The
+              | proposed system is quick and convenient compared to the common pipeline used to
+              | acquire floor plans from manual sketching and measurements, which are frequently
+              | required for remodeling or selling a property. We posit that the world is composed of
+              | relatively large, flat surfaces that meet at right angles. We focus on continuous collec-
+              | tion of points that occupy large, flat areas and align with the axes and ignoring other
+              | points. Even with very noisy data, the process can be performed at an interactive
+              | rate since the space of possible plane arrangements is sparse given the measurements.
+              | We take advantage of real-time data and allow users to provide intuitive feedback
+              | to assist the acquisition pipeline. The research described in the chapter was first
+              | published as Y.M. Kim, J. Dolson, M. Sokolsky, V. Koltun, S.Thrun, Interactive
+              | Acquisition of Residential Floor Plans, IEEE International Conference on Robotics
+              | and Animation (ICRA), 2012 c 2012 IEEE, and the contents were also replicated
+              | with small modifications.
+meta          | CHAPTER 1. INTRODUCTION                                                             8
+blank         |
+              |
+              |
+text          |    Chapter 3 discusses how we targeted public spaces with many repeating ob-
+              | jects in different poses or variation modes. Even though indoor environments can
+              | frequently change, we can identify patterns and possible movements by reasoning
+              | in object-level. Especially in public buildings (offices, cafeterias, auditoriums, and
+              | seminar rooms), chairs, tables, monitors, etc, are repeatedly used in similar pat-
+              | terns. We first build abstract models of the objects of interest with simple geometric
+              | primitives and deformation modes. We then use the built models to quickly de-
+              | tect the objects of interest within an indoor scene that the objects repeatedly ap-
+              | pear. While the models are simple approximation of actual complex geometry, we
+              | demonstrate that the models are sufficient to detect the object within noisy, par-
+              | tial indoor scene data. The learned variability modes not only factor out nuisance
+              | modes of variability (e.g., motions of chairs, etc.) from meaningful changes (e.g.,
+              | security, where the new scene objects should be flagged), but also provide the func-
+              | tional modes of the object (the status of open drawers, closed laptop, etc.), which
+              | potentially provide high-level understanding of the scene. The study discussed have
+              | first appeared as a publication, Young Min Kim, Niloy J. Mitra, Dong-Ming Yan,
+              | and Leonidas Guibas. 2012. Acquiring 3D indoor environments with variability and
+              | repetition. ACM Trans. Graph. 31, 6, Article 138 (November 2012), 11 pages.
+              | DOI=10.1145/2366145.2366157 http://doi.acm.org/10.1145/2366145.2366157, from
+              | which the major written parts of the chapter were adapted.
+              |    Chapter 4 discusses a reconstruction approach that utilizes 3-D models down-
+              | loaded from the web to assist in understanding the objects being scanned. The data
+              | stream from an RGB-D camera is noisy and exhibit lots of missing data, making it
+              | very hard to accurately build a full model of an object being scanned. We take an
+              | approach to use a large database of 3-D models to match against partial, noisy scan
+              | of the input data stream. To this end, we propose a simple, efficient, yet discrimina-
+              | tive descriptor that can be evaluated in real-time and used to process complex indoor
+              | scenes. The matching models are quickly found from the database with help of our
+              | proposed shape descriptor. This also allows real-time assessment of the quality of the
+              | data captured, and the system provides the user with real-time feedback on where to
+              | scan. Eventually the user can retrieve the closest model as quickly as possible during
+meta          | CHAPTER 1. INTRODUCTION                                                              9
+blank         |
+              |
+              |
+text          | the scanning session. The research and contents of the chapter will be published as
+              | Y.M. Kim, N. Mitra, Q. Huang, L. Guibas, Guided Real-Time Scanning of Indoor
+              | Environments, Pacific Graphics 2013.
+              |    Chapter 5 concludes the dissertation with a summary of our work and a discussion
+              | of future directions this research could take.
+blank         |
+              |
+title         | 1.3.1     Contributions
+text          | The major contribution of the dissertation is to present methods to quickly acquire
+              | 3-D information from noisy, occluded pointcloud data by assuming geometric pri-
+              | ors. The pre-defined modes not only provide high-level understanding of the current
+              | mode, but also allow the data size to stay compact, which, in turn, saves memory
+              | and processing time. The proposed geometric priors have been previously used for
+              | different settings, but our approach incorporate the priors tuned for the practical
+              | tasks at hand with real scans from RGB-D data acquired from actual environments.
+              | The example geometric priors that are covered are as following:
+blank         |
+text          |    • Based on Manhattan world assumption, important architectural elements (walls,
+              |      floor and ceiling) can be retrieved in real-time.
+blank         |
+text          |    • By building an abstract model composed of simple geometric primitives and joint
+              |      information between primitives, objects under severe occlusion and different
+              |      configuration can be located. The bottom-up approach can quickly populate
+              |      large indoor environments with variability and repetition (around 200 ms per
+              |      object).
+blank         |
+text          |    • Online public database of 3-D models recover the structure of objects from
+              |      partial, noisy scans in a matter of seconds. We developed a relation-based
+              |      lightweight descriptor for fast and accurate model retrieval and positioning.
+blank         |
+text          |    We also take an advantage of the representation and demonstrate quick and effi-
+              | cient pipeline, including user-interaction when possible. More specifically, we demon-
+              | strate following novel prototypes of systems:
+meta          | CHAPTER 1. INTRODUCTION                                                            10
+blank         |
+              |
+              |
+text          |    • A new hand-held system that a user can capture the space and automatically
+              |      generate a floor plan. The user does not have to measure distances or manually
+              |      sketch the layout.
+blank         |
+text          |    • A projector attached to the RGB-D camera to communicate current status of
+              |      the acquisition on the physical surface with user, and thus allow user to provide
+              |      intuitive feedback.
+blank         |
+text          |    • A real-time guided scanning setup for online quality assessment of streaming
+              |      RGB-D data obtained with help of 3-D database of models.
+blank         |
+text          |    While the specific geometric priors and prototypes listed above come from under-
+              | standing of the characteristic of the task at hand, the underlying assumptions and
+              | approach provide a direction to allow everyday user to acquire useful 3-D information
+              | in the years to come as real-time 3-D scans become available.
+meta          | Chapter 2
+blank         |
+title         | Interactive Acquisition of
+              | Residential Floor Plans1
+blank         |
+text          | Acquiring an accurate floor plan of a residence is a challenging task, yet one that
+              | is required for many situations, such as remodeling or sale of a property. Original
+              | blueprints can be difficult to find, especially for older residences. In practice, contrac-
+              | tors and interior designers use point-to-point laser measurement devices to acquire
+              | a set of distance measurements. Based on these measurements, an expert creates a
+              | floor plan that respects the measurements and represents the layout of the residence.
+              | Both taking measurements and representing the layout are cumbersome manual tasks
+              | that require experience and time.
+              |        In this chapter, we present a hand-held system for indoor architectural reconstruc-
+              | tion. This system eliminates the manual post-processing necessary for reconstructing
+              | the layout of walls in a residence. Instead, an operator with no architectural exper-
+              | tise can interactively guide the reconstruction process by moving freely through an
+meta          |    1
+text          |      The contents of the chapter was originally published as Y.M. Kim, J. Dolson, M. Sokolsky, V.
+              | Koltun, S.Thrun, Interactive Acquisition of Residential Floor Plans, IEEE International Conference
+              | on Robotics and Animation (ICRA), 2012 c 2012 IEEE.
+              |    In reference to IEEE copyrighted material which is used with permission in this thesis, the
+              | IEEE does not endorse any of Stanford University’s products or services. Internal or personal
+              | use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material
+              | for advertising or promotional purposes or for creating new collective works for resale or redis-
+              | tribution, please go to http://www.ieee.org/publications_standards/publications/rights/
+              | rights_link.html to learn how to obtain a License from RightsLink.
+blank         |
+              |
+meta          |                                                  11
+              | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 12
+blank         |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+              |
+text          | Figure 2.1: Our hand-held system is composed of a projector, a Microsoft Kinect
+              | sensor, and an input button (left). The system uses augmented reality feedback
+              | (middle left) to project the status of the current model onto the environment and to
+              | enable real-time acquisition of residential wall layouts (middle right). The floor plan
+              | (middle right) and visualization (right) were generated using data captured by our
+              | system.
+blank         |
+text          | interior with the hand-held system until all walls have been observed by the sensor
+              | in the system.
+              |    Our system is composed of a laptop connected to an RGB-D camera, a lightweight
+              | optical projector, and an input button interface (Figure 2.1, left). The RGB-D cam-
+              | era is a real-time depth sensor that acts as the main input modality. As noted in
+              | Chapter 1, we use the Microsoft Kinect, a lightweight commodity device that out-
+              | puts VGA-resolution range and color images at video rates. The data is processed
+              | in real time to create the floor plan by focusing on large flat surfaces and ignoring
+              | clutter. The generated floor plan can be used directly for remodeling or real-estate
+              | applications or to produce a 3D model of the interior for applications in virtual envi-
+              | ronments. In Section 2.4, we present and discuss a number of residential wall layouts
+              | reconstructed with our system, captured from actual apartments. Even though the
+              | results presented here were obtained focus on residential spaces, the system can also
+              | be used in other types of interior environments.
+              |    The attached projector is initially calibrated to have an overlapping field of view
+              | with the same image center as the depth sensor. It projects the reconstruction status
+              | onto the surface being scanned. Under normal lighting, the projector does not provide
+              | a sophisticated rendering. Rather, the projection allows the user to visualize the
+              | reconstruction process. The user can then detect reconstruction errors that arise due
+              | to deficiencies in the data capture path and can complete missing data in response.
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 13
+blank         |
+              |
+              |
+text          | The user can also note which walls have been included in the model and easily resolve
+              | ambiguities with a simple input device. The proposed system has advantages over
+              | other previous applications by allowing a new type of user interaction in real time that
+              | focuses only on architectural elements relevant to the task at hand. This difference
+              | is discussed in detail in the following section.
+blank         |
+              |
+title         | 2.1      Related Work
+text          | A number of approaches have been proposed for indoor reconstruction in computer
+              | graphics, computer vision, and robotics. Real-time indoor reconstruction using either
+              | a depth sensor [HKH+ 12] or an optical camera [ND10] has been recently explored.
+              | The results at these studies suggest that the key to real-time performance is the
+              | fast registration of successive frames. Similar to [HKH+ 12], we fuse both color and
+              | depth information to register frames. Furthermore, our approach extends real-time
+              | acquisition and reconstruction by allowing the operator to visualize the current re-
+              | construction status without consulting a computer screen. Because the feedback loop
+              | in our system is immediate, the operator can resolve failures and ambiguities while
+              | the acquisition session is in progress.
+              |    Previous approaches have also been limited to a dense 3-D reconstruction (reg-
+              | istration of point cloud data) with no higher-level information, which is memory
+              | intensive. A few exceptions include [GCCMC08], by means of which high-level fea-
+              | tures (lines and planes) are detected to reduce complexity and noise. The high-level
+              | structures, however, do not necessarily correspond to actual architectural elements,
+              | such as walls, floors, or ceilings. In contrast, our system identifies and focuses on
+              | significant architectural elements using the Manhattan-world assumption, which is
+              | based on the observation that many indoor scenes are largely rectilinear [CY99]. This
+              | assumption is widely made for indoor scene reconstruction from images to overcome
+              | the inherent limitations of image data [FCSS09][VAB10]. While the traditional stereo
+              | method only reconstructs 3-D locations of image feature points, the Manhattan-world
+              | assumption successfully fills an area between the sparse feature points during post-
+              | processing. Our system, based on the Manhattan-world assumption, differentiates
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 14
+blank         |
+              |
+              |
+text          | between architectural features and miscellaneous objects in the space, producing a
+              | clean architectural floor plan and simplifying the representation of the environment.
+              | Even with the Manhattan-world assumption, however, the system still cannot fully
+              | resolve ambiguities introduced by large furniture items and irregular features in the
+              | space without user input. The interactive capability offered by our system allows the
+              | user to easily disambiguate the situation and integrate new input into a global map
+              | of the space in real time.
+              |    Not only does our system simplify the representation of the feature of a space, but
+              | by doing so it reduces the computational burden of processing a map. Employing the
+              | Manhattan-world assumption simplifies the map construction to a one-dimensional,
+              | closed-form problem. Registration of successive point clouds results in an accumula-
+              | tion of errors, especially for a large environment, and requires a global optimization
+              | step in order to build a consistent map. This is similar to reconstruction tasks en-
+              | countered in robotic mapping. In other approaches, the problem is usually solved by
+              | bundle adjustment, a costly off-line process [TMHF00][Thr02].
+              |    The augmented reality component of our system is inspired by the SixthSense
+              | project [MM09]. Instead of simply augmenting a user’s view of the world, however,
+              | our projected output serves to guide an interactive reconstruction process. Directing
+              | the user in this way is similar to re-photography [BAD10], where a user is guided
+              | to capture a photograph from the same viewpoint as in a previous photograph. By
+              | using a micro-projector as the output modality, our system allows the operator to
+              | focus on interacting with the environment.
+blank         |
+              |
+title         | 2.2      System Overview and Usage
+text          | The data acquisition process is initiated by the user pointing the sensor to a corner,
+              | where three mutually orthogonal planes meet. This corner defines the Manhattan-
+              | world coordinate system. The attached projector indicates successful initialization by
+              | overlaying blue-colored planes with white edges onto the scene (Figure 2.2 (a)). After
+              | the initialization, the user scans each room individually as he or she loops around in
+              | it holding the device. If the movement is too fast or if there are not enough features,
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 15
+blank         |
+              |
+              |
+              |
+text          |                                                         Fetch a new frame
+blank         |
+text          |                                                                      Exists
+              |                                                                                      Global
+              |                                          Success
+              |                                                                                    adjustment
+              |                           Pair-wise
+              |    Initialization                                Plane extraction
+              |                          registration
+blank         |
+text          |                                                                                    Map update
+              |                                                                      New
+blank         |
+text          | User interaction
+              |                                               Failure                 Left click                Right click
+              |                          Adjust data                                                     Start a new
+              |   Visual feedback                                                   Select planes
+              |                             path                                                            room
+blank         |
+              |
+              |
+              |
+text          |          (a)                            (b)                                               (c)
+blank         |
+              |
+text          | Figure 2.2: System overview and usage. When an acquisition session is initiated by
+              | observing a corner, the user is notified by a blue projection (a). After the initial-
+              | ization, the system updates the camera pose by registering consecutive frames. If a
+              | registration failure occurs, the user is notified by a red projection and is required to
+              | adjust the data capture path (b). Otherwise, the updated camera configuration is
+              | used to detect planes that satisfy the Manhattan-world assumption in the environ-
+              | ment and to integrate them into the global map. The user interacts with the system
+              | by selecting planes in the space (c). When the acquisition session is completed, the
+              | acquired map is used to construct a floor plan consisting of user-selected planes.
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 16
+blank         |
+              |
+              |
+text          | a red projection on the surface guides the user to recover the position of the device
+              | (Figure 2.2 (b)) and re-acquire that area.
+              |    The system extracts flat surfaces that align with the Manhattan coordinate system
+              | and creates complete rectilinear polygons, even when connectivity between planes is
+              | occluded. At times, the user might not want some of the extracted planes (parts
+              | of furniture or open doors) to be included in the model even if these planes satisfy
+              | the Manhattan-world assumption. In these cases, when the user clicks the input
+              | button (left click), the extracted wall toggles between inclusion (indicated in blue)
+              | and exclusion (indicated in grey) to the model (Figure 2.2 (c)). As the user finishes
+              | scanning a room, he or she can move to another room and scan it. A new rectilinear
+              | polygon is initiated by a right click. Another rectilinear polygon is similarly created
+              | by including the selected planes, and the room is correctly positioned into the global
+              | coordinate system. The model is updated in real time and stored in either a CAD
+              | format or a 3-D mesh format that can be loaded into most 3-D modeling software.
+blank         |
+              |
+title         | 2.3     Data Acquisition Process
+text          | Some notations used throughout the section are introduced in Figure 2.3. At each
+              | time step t, the sensor produces a new frame of data, Ft = {Xt , It }, composed
+              | of a range image Xt (a 2-D array of depth measurements) and a color image It ,
+              | Figure 2.3(a). T t represents the transformation from the frame Ft , measured from
+              | the current sensor position, to the global coordinate system, which is where the map
+              | Mt = {Ltr , Rtr } is defined, Figure 2.3(b). Throughout the data capture session, the
+              | system maintains the global map Mt , and the two most recent frames, Ft−1 and Ft
+              | to update the transformation information. Instead of storing information from all
+              | frames, the system keeps the total computational and memory requirements minimal
+              | by incrementally updating the global map only with components that need to be
+              | added to the final model. Additionally, the frame with the last observed corner Fc is
+              | stored to recover the sensor position when lost.
+              |    After the transformation is found, the relationship between the planes in global
+              | map Mt and the measurement in the current frame Xt is represented as Pt , a 2-D
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 17
+blank         |
+              |
+              |
+              |
+text          |                      Xt
+              |                                                               P3
+              |                                                  P2
+              |                                                                    P6
+              |                          t                            P4
+              |                      I                      P0                           P5
+              |                                                  T t (F t )
+              |                                                               P7    P8
+blank         |
+              |
+text          |                              (a) F t                    (b) Ltr
+              |                                                               P3
+              |                                                  P2
+blank         |
+text          |                                                       P4
+              |                                             P0                           P5
+              |                      P3           P5
+              |                              P6
+              |                                                               P7
+blank         |
+text          |                               (c) P t                   (d) R tr
+blank         |
+text          | Figure 2.3: Notation and representation. Each frame of the sensor Ft is composed of
+              | a 2-D array of depth measurements Xt and color image It (a). The global map Mt
+              | is composed of sequence of observed planes Ltr (b) and loops of rectilinear polygons
+              | built from the planes Rtr (d). After the registration of the current frame T t is found
+              | with respect to the global coordinate system, planes are extracted Pt (c), the system
+              | automatically update the room structure based on the observation Rtr (d).
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 18
+blank         |
+              |
+              |
+text          | array of plane labels for each pixel, Figure 2.3(c). The map Mt is composed of lists of
+              | observed axis-parallel planes Ltr and loops of current room structure Rtr , defined with
+              | subsets of the planes from Ltr . Each plane has its axis label (x, y, or z) and the offset
+              | value (e.g., x = x0 ), as well as its left or right plane if the connectivity is observed. A
+              | plane can be selected (shown as solid line in Figure 2.3(b)) or ignored (dotted line in
+              | Figure 2.3(b)) based on user input. The selected planes are extracted from Ltr as the
+              | loop of the room Rtr , which can be converted into the floor plan as a 2-D rectilinear
+              | polygon. To have a fully connected a rectilinear polygon per room, Rtr is constrained
+              | to have alternating axis labels (x and y). For the z direction (vertical direction), the
+              | system retains only the ceiling and the floor. The system also keeps the sequence of
+              | observation (S x , S y , and S z ) of offset values for each axis direction, and stores the
+              | measured distance and the uncertainty of the measurement between planes.
+              |    The overall reconstruction process is summarized in Figure 2.2. As mentioned in
+              | Sec. 2.2, this process is initiated by extracting three mutually orthogonal planes when
+              | a user points the system to one of the corners or a room. To detect planes in the range
+              | data, our system fits plane equations to groups of range points and their corresponding
+              | normals using the RANSAC algorithm [FB81]: the system first randomly samples a
+              | few points, then fits a plane equation to them. the system then tests the detected
+              | plane by counting the number of points that can be explained by the plane equation.
+              | After convergence, the detected plane is classified as valid only if the detected points
+              | constitute a large, connected portion of the depth information within the frame. If
+              | there are three planes detected, and they are orthogonal to each other, our system
+              | assigns the x, y and z axes to be the normal directions of these three planes, which
+              | form the right-handed coordinate system for our Manhattan world. Now the map Mt
+              | has two planes (the floor or ceiling is excluded), and the transformation T t between
+              | Mt and Ft is also found.
+              |    A new measurement Ft is registered with the previous frame Ft−1 by aligning
+              | depth and color features (Sec. 2.3.1). This registration is used to update T t−1 to a
+              | new transformation T t . The system extracts planes that satisfy the Manhattan-world
+              | assumption from T t (Ft ) (Sec. 2.3.2). If the extracted planes already exist in Ltr , the
+              | current measurement is compared with the global map and the registration is refined
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 19
+blank         |
+              |
+              |
+              |
+text          |            (a)                        (b)                         (c)        (d)
+blank         |
+text          | Figure 2.4: (a) Flat wall features (depicted by the triangle and circle) are observed
+              | from two different locations. Diagram (b) shows both observations with respect to
+              | the camera coordinate system. Without features, using projection-based ICP can
+              | lead to registration errors in the image-plane direction (c), while the use of features
+              | will provide better registration (d).
+blank         |
+text          | (Sec. 2.3.3). If there is a new plane extracted, or if there is user input to specify the
+              | map structure, the map is updated accordingly (Sec. 2.3.4).
+blank         |
+              |
+title         | 2.3.1      Pair-Wise Registration
+text          | To propagate information from previous frames and to detect new planes in the scene,
+              | each incoming frame must be registered with respect to the global coordinate system.
+              | To start this process, the system finds the relative registration between the two most
+              | recent frames, Ft−1 and Ft . By using both the depth point clouds (Xt−1 , Xt ) and
+              | optical images (It−1 , It ), the system can efficiently register frames in real time (about
+              | 15 fps).
+              |    Given two sets of point clouds, Xt−1 = {xt−1 N       t    t N
+              |                                             i }i=1 and X = {xi }i=1 , and the
+              | transformation for the previous point cloud T t−1 , the correct rigid transformation T t
+              | will minimize the error between correspondences in the two sets:
+blank         |
+text          |                                   X
+              |                           mint        kwi (T t−1 (xit−1 ) − T t (yit ))k2            (2.1)
+              |                           yi ,T
+              |                                   i
+blank         |
+text          | yit ∈ Xt is the corresponding point for xt−1
+              |                                          i   ∈ Xt−1 . Once the correspondence is
+              | known, minimizing Eq. (2.1) becomes a closed-form solution [BM92]. In conventional
+              | approaches, correspondence is found by searching for the closest point, which is com-
+              | putationally expensive. Real-time registration methods reduce the cost by projecting
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 20
+blank         |
+              |
+              |
+              |
+text          |     (a) it−1 ∈ It−1                  (b) j t ∈ It             (c) Ht (It−1 )       (d) |I t − Ht (It−1 )|
+blank         |
+text          | Figure 2.5: From optical flow between two consecutive frames, sparse image features
+              | are matched between (a) it−1 ∈ It−1 and (b) j t ∈ It . The matched features are then
+              | used to calculate homography Ht such that the previous image It−1 can be warped to
+              | the space of the current image It and create dense projective correspondences (c). The
+              | difference image (d) shows that most of dense correspondences are within a few-pixel
+              | error in image plane with slight offset around silhouette area.
+blank         |
+text          | the 3-D points onto a 2-D image plane and assigning correspondences to points that
+              | project onto the same pixel locations [RL01]. However, projection will only reduce the
+              | distance in the ray direction; the offset parallel to the image plane cannot be adjusted.
+              | This phenomenon can result in the algorithm not compensating for the translation
+              | parallel to the plane and therefore shrinking the size of the room (Figure 2.4).
+              |    Our pair-wise registration is similar to [RL01], but it compensates for the dis-
+              | placement parallel to the image plane using image features and silhouette points.
+              | Intuitively, the system uses homography to compensate for errors parallel to the
+              | plane if the structure can be approximated into a plane, and silhouette points are
+              | used to compensate for remaining errors when the features are not planar.
+              |    Our system first computes the optical flow between color images It and It−1 and
+              | finds a sparse set of features matched between them, Figure 2.5(a)(b). The sparse set
+              | of features then can be used to create dense projective correspondence between the
+              | two frames, Figure 2.5(c)(d). More specifically, homography is a transform between
+              | 2-D homogeneous coordinates defined by a matrix H ∈ R3×3 :
+blank         |
+text          |                                                                                    
+              |                                                            ui                     wuj
+              |               X
+              |                          kHit−1 − j t k2 , where it−1
+              |                                                                                    
+              |        min                                              = v i
+              |                                                                 ∈ It−1 , j t =  wvj  ∈ It          (2.2)
+              |         H                                                                          
+              |              it−1 ,j t
+              |                                                            1                       w
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 21
+blank         |
+              |
+              |
+              |
+text          | Figure 2.6: Silhouette points. There are two different types of depth discontinuity:
+              | the boundaries of a shadow made on the background by a foreground object (empty
+              | circles), and the boundaries of a foreground object (filled circles). The meaningful
+              | depth features are the foreground points, which are the silhouette points used for our
+              | registration pipeline.
+blank         |
+text          |    Compared to naive projective correspondence used in [RL01], a homography de-
+              | fines a map between two planar surfaces in 3-D space. The homography represents
+              | the displacement parallel to the image plane, and is used to compute dense corre-
+              | spondences between the two frames. While a homography does not represent a full
+              | transformation in 3-D, the planar approximation works well in practice for our sce-
+              | nario, where the scene is mostly composed of flat planes and the relative movement is
+              | small. From the second iteration, the correspondence is found by projecting individual
+              | points onto the image plane, as shown in [RL01].
+              |    Given the correspondence, the registration between the frames for the current iter-
+              | ation can be given as a closed-form solution (Equation 2.1). Additionally, the system
+              | modifies the correspondence for silhouette points (points of depth discontinuity in
+              | the foreground, shown in Figure 2.6). For silhouette points in Xt−1 , the system finds
+              | the closest silhouette points in Xt within a small search window from the original
+              | corresponding location. If the matching silhouette point exists, the correspondence is
+              | weighted more. (We used wi = 100 for silhouette points and wi = 1 for non-silhouette
+              | points.) The process iterates until it converges.
+blank         |
+title         | Registration Failure
+blank         |
+text          | The real-time registration is a crucial part of our algorithm for accurate reconstruc-
+              | tion. Even with the hybrid approach in which both color and depth features are used,
+              | the registration can fail, and it is important to detect the failure immediately and
+              | to recover the position of the sensor. The registration failure is detected either (1)
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 22
+blank         |
+              |
+              |
+text          | if the pair-wise registration does not converge or (2) if there were not enough color
+              | and depth features. The first case can be easily detected as the algorithm runs. The
+              | second case is detected if the optical flow did not find homography (i.e., there is a
+              | lack of color feature) and there were not enough matched silhouette points (i.e., there
+              | is a lack of depth feature).
+              |    In cases of registration failure, the projected image turns red, indicating that the
+              | user should return the system’s viewpoint to the most recently observed corner. This
+              | movement usually takes only a small amount of back-tracking because the failure
+              | is detected within milliseconds of leaving the previous successfully registered area.
+              | Similar to the initialization step, the system extracts planes from Xt using RANSAC
+              | and matches the planes with the desired corner. Figure 2.2 (b) depicts the process of
+              | overcoming a registration failure. The user then deliberately moves the sensor along
+              | the path with richer features or steps farther from a wall to cover a wider view.
+blank         |
+              |
+title         | 2.3.2     Plane Extraction
+text          | Based on the transformation T t , the system extracts axes-aligned planes and asso-
+              | ciated edges. The planes and detected features will provide higher-level information
+              | that relates the raw point cloud Xt to the global map Mt . Because the system only
+              | considers the planes that satisfy the Manhattan-world coordinate system, we were
+              | able to simplify the plane detection procedure.
+              |    The planes from the previous frame that remain visible can be easily found by
+              | using the correspondence. From the pair-wise registration (Sec. 2.3.1), our system
+              | has the point-wise correspondence between the previous frame and the current frame.
+              | The plane label Pt−1 from the previous frame is updated simply by being copied over
+              | to the corresponding location. Then, the system refines Pt by alternating between
+              | fitting points and fitting parameters.
+              |    A new plane can be found by projecting remaining points for the x, y, and z axes.
+              | For each axis direction, a histogram is built with the bin size 20cm. The system then
+              | tests the plane equation for populated bins. Compared to the RANSAC procedure
+              | for initialization, the Manhattan-world assumption reduces the number of degrees of
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 23
+blank         |
+              |
+              |
+text          | freedom from three to one, making plane extraction more efficient.
+              |    For extracted planes, the boundary edges are also extracted; the system detects
+              | groups of boundary points that can be explained by an axis-parallel line segment.
+              | The system also retains the information about relative positions for extracted planes
+              | (left/right). As long as the sensor is not flipped upside-down, this information pro-
+              | vides an important cue to build a room with the correct topology, even when the
+              | connectivity between neighboring planes has not been observed.
+blank         |
+title         | Data Association
+blank         |
+text          | After the planes are extracted, the data association process finds the link between the
+              | global map Mt and the extracted planes to be Pt , a 2-D array of plane labels for each
+              | pixel. The system automatically finds plane labels that existed from the previous
+              | frame and extract the plane by copying over the plane labels using correspondences.
+              |    The plane labels for the newly detected plane can be found by comparing T t (Ft )
+              | and Mt . In addition to the plane equation, the relative position of the newly observed
+              | plane with respect to other observed planes is used to label the plane. If the plane
+              | has not been previously observed, a new plane will be added into Ltr based on the
+              | left-right information.
+              |    After the data association step, the system updates the sequence of observation
+              | S. The planes that have been assigned as previously observed are used for global
+              | adjustment (Sec. 2.3.3). If a new plane is observed, the room Rtr will be updated
+              | accordingly (Sec. 2.3.4).
+blank         |
+              |
+title         | 2.3.3     Global Adjustment
+text          | Due to noise in the point cloud, frame-to-frame registration is not perfect, and er-
+              | ror accumulates over time. This is a common problem in pose estimation. Large-
+              | scale localization approaches use bundle adjustment to compensate error accumula-
+              | tion [TMHF00, Thr02]. Enforcing this global constraint involves detecting landmark
+              | objects, or stationary objects observed at different times during a sequence of mea-
+              | surements. Usually this global adjustment becomes an optimization problem in many
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 24
+blank         |
+              |
+              |
+              |
+text          | Figure 2.7: As errors accumulate in T t and in measurements, the map Mt becomes
+              | inconsistent. By comparing previous and recent measurements, the system can correct
+              | for inconsistency and update the value of c such that c = a.
+blank         |
+text          | dimensions. The problem is formulated by constraining the landmarks to predefined
+              | global locations, and by solving an energy function that encodes noise in a pose es-
+              | timation of both sensor and landmark locations. The Manhattan-world assumption
+              | allows us to reduce the error accumulation efficiently in real time by refining our
+              | registration estimate and by optimizing the global map.
+blank         |
+title         | Refining the Registration
+blank         |
+text          | After data association, the system performs a second round of registration with re-
+              | spect to the global map Mt to reduce the error accumulation in T t by incremental,
+              | pair-wise registration. The extracted planes Pt , if already observed by the system,
+              | have been assigned to the planes in Mt that have associated plane equations. For
+              | example, suppose a point T t (xu,v ) = (x, y, z) has a plane label Pt (u, v) = pk (assigned
+              | to plane k). If plane k has normal parallel to the x axis, the plane equation in the
+              | global map Mt can be written as x = x0 (x0 ∈ R). Consequently, the registration
+              | should be refined to minimize kx − x0 k2 . In other words, the refined registration can
+              | be found by defining the corresponding point for xu,v as (x0 , y, z). The corresponding
+              | points are likewise assigned for every point with a plane assignment in Pt . Given the
+              | correspondence, the system can refine the registration between the current frame Ft
+              | and the global map Mt . This second round of registration reduces the error in the
+              | axis direction. In our example, the refinement is active while the plane x = x0 is
+              | visible and reduces the uncertainty in the x direction with respect to the global map.
+              | The error in the x direction is not accumulated during the interval.
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 25
+blank         |
+              |
+              |
+title         | Optimizing the Map
+blank         |
+text          | As error accumulates, the reconstructed map Mt may also require global adjust-
+              | ment in each axis direction. The Manhattan-world assumption simplifies this global
+              | optimization into two separate, one-dimensional problems (we are excluding the z
+              | direction for now, but the idea can be extended to a 3-D case).
+              |    Figure 2.7 shows a simple example in the x-axis direction. Let us assume that
+              | the figure represents an overhead view of a rectangular room. There should be two
+              | walls whose normals are parallel to the x-axis. The sensor detects the first wall
+              | (x = a), sweeps around the room, observes another wall (x = b), and returns to
+              | the previously observed wall. Because of error accumulation, parts of the same wall
+              | have two different offset values (x = a and x = c), but by observing the left-right
+              | relationship between walls, the system infers that the two walls are indeed the same
+              | wall.
+              |    To optimize the offset values, the system tracks the sequence of observations
+              | S x = {a, b, c} and the variances at the point of observation for each wall, as well as the
+              | constraints represented by the pair of the same offset values C x = {(c11 , c12 ) = (a, c)}.
+              | We introduce two random variables, ∆1 and ∆2 , to constrain the global map op-
+              | timization. ∆1 is a random variable with mean m1 = b − a and variance σ12 that
+              | represents the error between the moment when the sensor observes the x = a wall
+              | and the moment it observes the x = b wall. Likewise, a random variable ∆2 represents
+              | the error with mean m2 = c − b and variance σ22 .
+              |    Whenever a new constraint is added, or when the system observes a plane that
+              | was previously observed, the global adjustment routine is triggered. This is usually
+              | when the user finishes scanning a room by looping around it and returning to the
+              | first wall measured. By confining the axis direction, the global adjustment becomes
+              | a one-dimensional quadratic equation:
+blank         |
+text          |                                           P           k∆i −mi k2
+              |                                  minS x       i           σi2
+              |                                                                                       (2.3)
+              |                                                                        x
+              |                              s. t. cj1 = cj2 , ∀(cj1 , cj2 ) ∈ C .
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 26
+blank         |
+              |
+              |
+              |
+text          | Figure 2.8: Selection. In sequence (a), the user is observing two new planes in the
+              | scene (colored white) and one currently included plane (colored blue). The user selects
+              | one of the new planes by pointing at it and clicking. Then, the second new plane is
+              | added. All planes are blue in the final frame, confirming that all planes have been
+              | successfully selected. Sequence (b) shows a configuration where the user has decided
+              | not to include the large cabinet. Sequence (c) shows successful selection of the ceiling
+              | and the wall despite clutter.
+blank         |
+title         | 2.3.4     Map Update
+text          | Our algorithm ignores most irrelevant features by using the Manhattan-world as-
+              | sumption. However, the system cannot distinguish architectural components from
+              | other axis-aligned objects using the Manhattan-world assumption. For example, fur-
+              | niture, open doors, parts of other rooms that might be visible, or reflections from
+              | mirrors may be detected as axis-aligned planes. The system solves the challenging
+              | cases by allowing the user to manually specify the planes that he or she would like to
+              | include in the final model. This manual specification consists of simply clicking the
+              | input button during scanning when pointing at a plane, as shown in Figure 2.8. If
+              | the user enters a new room, a right click of the button indicates that the user wishes
+              | to include this new room and to optimize it individually. The system creates a new
+              | loop of planes, and any newly observed planes are added to the loop.
+              |    Whenever a new plane is added to Ltr or there is user input to specify the room
+              | structure, the map update routine extracts a 2-D rectilinear polygon Rtr from Ltr with
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 27
+blank         |
+              |
+              |
+text          |  5.797, 5%      0.104,                        data i/o
+              |                           11.845, 10%         prepare image       pre-processing
+              |   6.728, 6%       0%
+              |                          3.318, 3%            optical flow        (25%)
+              |       14.517,             13.203, 12%         pair-wise registration
+              |        13%                                    plane extraction
+              |                                               data association
+              |                               [unit: ms]      refine registration
+              |    58.672, 51%                                optimize map
+blank         |
+text          |         Figure 2.9: The average computational time for each step of the system.
+blank         |
+text          | the help of user input. A valid rectilinear polygon structure should have alternating
+              | axis directions for any pair of adjacent walls (a x = xi wall should be connected to
+              | a y = yj wall). The system starts by adding all selected planes into Rtr as well as
+              | whichever unselected planes in Ltr are necessary to have alternating axis direction.
+              | When planes are added, the planes with observed boundary edges are preferred. If
+              | the two observed walls have the same axis direction, the unobserved wall is added
+              | between them on the boundary of the planes to form a complete loop.
+blank         |
+              |
+title         | 2.4       Evaluation
+text          | The goal of the system is building a floor plan of an any possible interior environment.
+              | In our testing of the system, we mapped different apartments of six different volunteers
+              | ranging approximately 500-2000 ft2 located at Palo Alto. The residents were living
+              | in the scanned places and thus the apartments exhibited different amounts and types
+              | of objects.
+              |       For each data set, we compare the floor plan generated by our system with one
+              | manually-generated using measurements from a commercially available measuring
+              | device.1 The current practice in architecture and real estate is to use a point-to-
+              | point laser device to measure distances between pairs of parallel planes. Making
+              | such measurements requires a clear, level line of sight between two planes, which
+meta          |   1
+text          |     measuring range 0.05 to 40m; average measurement accuracy +/- 1.5mm; measurement duration
+              | < 0.5s to 4s per measurement.
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 28
+blank         |
+              |
+              |
+text          | may be time-consuming to find due to the presence of furniture, windows, and other
+              | obstructions. Moreover, after making all the distance measurements, a user is required
+              | to manually draw a floor plan that respects the measurements. In our tests, roughly
+              | 10-20 minutes were needed to build a floor plan of each apartment in the conventional
+              | way as described.
+              |    Using our system, the data acquisition process took approximately 2-5 minutes per
+              | apartment to initiate, run, and generate the full floor plan. Table 2.1 summarizes the
+              | timing data for each data set. The average frame rate is 7.5 frames per second running
+              | on an Intel 2.50GHz Dual Core laptop. Figure 2.9 depicts the average computational
+              | time for each step of the algorithm. The pair-wise registration routine (Sec.2.3.1)
+              | contributes more than half of the computational time followed by the pre-processing
+              | step of fetching a new frame and calculating optical flow (25%).
+              |    In Figure 2.10, we visually compare the floor plans reconstructed in a conventional
+              | way with those built by our system. The floor plans in blue were reconstructed using
+              | point-to-point laser measurements, and the floor plans in red were reconstructed by
+              | our system. For each apartment, the topology of the reconstructed walls agrees with
+              | the manually-constructed floor plan. In all cases the detection and labeling of planar
+              | surfaces by our algorithm enabled the user to add or remove these surfaces from
+              | the model in real time, allowing the final model to be constructed using only the
+              | important architectural elements from the scene.
+              |    The overlaid floor plans in Figure 2.10(c) show that that the relative placement of
+              | the rooms may be misaligned. This is because our global adjustment routine optimizes
+              | rooms individually, thus errors can accumulate in transitions between rooms. The
+              | algorithm could be extended to enforce global constraints on the relative placement
+              | of rooms, such as maintaining a certain wall thickness and/or aligning the outer-most
+              | walls, but such global constraints may induce other errors.
+              |    Table 2.1 contains a quantitative comparison of the errors. The reported depth
+              | resolution of the sensor is 0.01m at 2m, and for each model we have an average of
+              | 0.075m error per wall. The relative error stays in the range of 2-5%, which shows
+              | that the accumulation of small registration errors continues to accumulate as more
+              | frames are processed.
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 29
+blank         |
+              |
+text          |                    data    no. of      run          average   error
+              |                                               fps
+              |                     set    frames     time            m        %
+              |                      1      1465    2m 56s   8.32   0.115     4.14
+              |                      2      1009    1m 57s   8.66   0.064     1.90
+              |                      3      2830    5m 19s   8.88   0.053     2.40
+              |                      4      1129    2m 39s   7.08   0.088     2.34
+              |                      5      1533    3m 52s   6.59   0.178     3.52
+              |                      6      2811     7m 4s   6.65   0.096     3.10
+              |                    ave.     1795    3m 57s   7.54   0.075     2.86
+blank         |
+text          | Table 2.1: Accuracy comparison between floor plans reconstructed by our system, and
+              | manually constructed floor plans generated from point-to-point laser measurements.
+blank         |
+text          |    Fundamentally, the limitations of our method reflect the limitations of the Kinect
+              | sensor, namely, the processing power of the laptop and the assumptions made in our
+              | approach. Because the accuracy of real-time depth data is worse than that from
+              | visual features, our approach exhibits larger errors compared to visual SLAM (e.g.,
+              | [ND10]). Some of the uncertainty can be reduced by adapting approaches from the
+              | well-explored visual SLAM literature. Still, the system is limited when meaningful
+              | features can not be detected. The Kinect sensor’s reported measurement range is
+              | between 1.2 and 3.5m from an object; outside that range, data is noisy or unavailable.
+              | As a consequence, data in narrow hallways or large atriums was difficult to collect.
+              |    Another source of potential error is a user outpacing the operating rate of approx-
+              | imately 7.5 fps. This frame rate already allows for a reasonable data capture pace,
+              | but with more processing power, the pace of the system could always be guaranteed
+              | to exceed normal human motion.
+blank         |
+              |
+title         | 2.5     Conclusions and Future Work
+text          | We have presented an interactive system that allows a user to capture accurate ar-
+              | chitectural information and to automatically generate a floor plan. Leveraging the
+              | Manhattan-world assumption, we have created a representation that is tractable in
+              | real time while ignoring clutter. In the presented system, the current status of the
+              | reconstruction is projected on the scanned environment to enable the user to provide
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 30
+blank         |
+              |
+              |
+text          | high-level feedback to the system. This feedback helps overcome ambiguous situa-
+              | tions and allows the user to interactively specify the important planes that should be
+              | included in the model.
+              |    If there are not enough features scanned for the system to determine that the
+              | operator has moved, the system will assume that motion has not occurred, leading to
+              | general underestimation of wall lengths when no depth or image features are available.
+              | The challenges can be overcome by including an IMU or other devices to assist in the
+              | pose tracking of the system.
+              |    We have limited our Manhattan-world features to axis-aligned planes in vertical
+              | directions. However, in future work, we could generalize the system to handle rec-
+              | tilinear polyhedra which are not convex in the vertical direction. Furthermore, the
+              | world could be expanded to include walls that are not aligned with the axes of the
+              | global coordinate system.
+              |    More broadly, our interactive system can be extended to other applications in
+              | indoor environments. For example, a user could visualize modifications to the space
+              | shown in Figure 2.11, where we show a user clicking and dragging a cursor across a
+              | plane to “add” a window. This example illustrates the range of possible uses of our
+              | system.
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 31
+blank         |
+              |
+              |
+              |
+text          |        house 1
+blank         |
+              |
+              |
+              |
+text          |        house 2
+blank         |
+              |
+              |
+              |
+text          |        house 3
+blank         |
+              |
+              |
+              |
+text          |        house 4
+blank         |
+              |
+              |
+              |
+text          |        house 5
+blank         |
+              |
+              |
+              |
+text          |        house 6
+              |                          (a)                   (b)                  (c)
+blank         |
+text          | Figure 2.10: (a) Manually constructed floor plans generated from point-to-point laser
+              | measurements, (b) floor plans acquired with our system, and (c) overlay. For house
+              | 4, some parts (pillars in large open space, stairs, and an elevator) are ignored by the
+              | user. The system still uses the measurements from those parts and other objects to
+              | correctly understand the relative positions of the rooms.
+meta          | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 32
+blank         |
+              |
+              |
+              |
+text          | Figure 2.11: The system, having detected the planes in the scene, also allows the user
+              | to interact directly with the physical world. Here the user adds a window to the room
+              | by dragging a cursor across the wall (left). This motion updates the internal model
+              | of the world (right).
+meta          | Chapter 3
+blank         |
+title         | Acquiring 3D Indoor Environments
+              | with Variability and Repetition2
+blank         |
+text          | Unlike mapping of urban environments, interior mapping would focus on interior
+              | objects, which can be geometrically complex, located in cluttered setting and undergo
+              | significant variations. In addition, the indoor 3-D data captured from RGB-D cameras
+              | suffer from limited resolution and data quality. The process is further complicated
+              | when the model deforms between successive acquisitions. The work described in this
+              | chapter focused on acquiring and understanding objects in interiors of public buildings
+              | (e.g., schools, hospitals, hotels, restaurants, airports, train stations) or office buildings
+              | from RGB-D camera scans of such interiors.
+              |       We exploited three observations to make the problem of indoor 3D acquisition
+              | tractable: (i) most such building interiors are composed of basic elements such as
+              | walls, doors, windows, furniture (e.g., chairs, tables, lamps, computers, cabinets),
+              | which come from a small number of prototypes and repeat many times. (ii) such
+              | building components usually consist of rigid parts of simple geometry, i.e., they have
+              | surfaces that are well approximated by planar, cylindrical, conical, spherical proxies.
+              | Further, although variability and articulation are dominant (e.g., a chair is moved
+meta          |   2
+text          |      The contents of the chapter was originally published as Young Min Kim, Niloy J. Mitra,
+              | Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3D indoor environments with vari-
+              | ability and repetition. ACM Trans. Graph. 31, 6, Article 138 (November 2012), 11 pages.
+              | DOI=10.1145/2366145.2366157 http://doi.acm.org/10.1145/2366145.2366157.
+blank         |
+              |
+meta          |                                              33
+              | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 34
+blank         |
+              |
+              |
+              |
+text          |                              office scene
+blank         |
+              |
+              |
+              |
+text          |              input single-view scan          recognized objects     retrieved and posed models
+blank         |
+              |
+text          | Figure 3.1: (Left) Given a single view scan of a 3D environment obtained using a
+              | fast range scanner, the system performs scene understanding by recognizing repeated
+              | objects, while factoring out their modes of variability (middle). The repeating ob-
+              | jects have been learned beforehand as low-complexity models, along with their joint
+              | deformations. The system extracts the objects despite a poor-quality input scan with
+              | large missing parts and many outliers. The extracted parameters can then be used
+              | to pose 3D models to create a plausible scene reconstruction (right).
+blank         |
+text          | or rotated, a lamp arm is bent and adjusted), such variability is limited and low-
+              | dimensional (e.g., translational motion, hinge joint, telescopic joint). (iii) mutual
+              | relationships among the basic objects satisfy strong priors (e.g., a chair stands on the
+              | floor, a monitor rests on the table).
+              |    We present a simple yet practical system to acquire models of indoor objects such
+              | as furniture, together with their variability modes, and discover object repetitions
+              | and exploit them to speed up large-scale indoor acquisition towards high-level scene
+              | understanding. Our algorithm works in two phases. First, in the learning phase, the
+              | system starts from a few scans of individual objects to construct primitive-based 3D
+              | models while explicitly recovering respective joint attributes and modes of variation.
+              | Second, in the fast recognition phase (about 200ms/model), the system starts from a
+              | single-view scan to segment and classify it into plausible objects, recognize them, and
+              | extract the pose parameters for the low-complexity models generated in the learning
+              | phase. Intuitively, our system uses priors for primitive types and their connections,
+              | thus greatly reducing the number of unknowns to enable model fitting even from
+              | very sparse and low-resolution datasets, while hierarchically associating subsets of
+              | scans to parts of objects. We also demonstrate that simple inter- and intra-object
+              | relations simplify segmentation and classification tasks necessary for high-level scene
+              | understanding (see [MPWC12] and references therein).
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 35
+blank         |
+              |
+              |
+text          |    We tested our method on a range of challenging synthetic and real-world scenes.
+              | We present, for the first time, basic scene reconstruction for massive indoor scenes
+              | (e.g., office spaces, building auditoriums on a university campus) from unreliable
+              | sparse data by exploiting the low-complexity variability of common scene objects. We
+              | show how we can now detect meaningful changes in an environment. For example,
+              | our system was able to discover a new object placed in a office space by rescanning the
+              | scene, despite articulations and motions of the previously extant objects (e.g., desk,
+              | chairs, monitors, lamps). Thus, the system factors out nuisance modes of variability
+              | (e.g., motions of the chairs, etc.) from variability modes that has importance in an
+              | application (e.g., security, where the new scene objects should be flagged).
+blank         |
+              |
+title         | 3.1     Related Work
+blank         |
+title         | 3.1.1     Scanning Technology
+text          | Rusinkiewicz et al. [RHHL02] demonstrated the possibility of real-time lightweight 3D
+              | scanning. More generally, surface reconstruction from unorganized pointcloud data
+              | has been extensively studied in computer graphics, computational geometry, and
+              | computer vision (see [Dey07]). Further, powered by recent developments in real-time
+              | range scanning, everyday users can now easily acquire 3D data at high frame-rates.
+              | Researchers have proposed algorithms to accumulate multiple poor-quality individual
+              | frames to obtain better quality pointclouds [MFO+ 07, HKH+ 12, IKH+ 11]. Our main
+              | goal differed, however, because our system focused on recognizing important elements
+              | and semantically understanding large 3D indoor environments.
+blank         |
+              |
+title         | 3.1.2     Geometric Priors for Objects
+text          | Our system utilizes geometry on the level of individual objects, which are possible
+              | abstractions used by humans to understand the environment [MZL+ 09]. Similar to Xu
+              | et al. [XLZ+ 10], we understand an object as a collection of primitive parts and segment
+              | the object based on the prior. Such a prior can successfully fill regions of missing
+              | parts [PMG+ 05], infer plausible part motions of mechanical assemblies [MYY+ 10],
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 36
+blank         |
+              |
+              |
+text          | extract shape by deforming a template model to match silhouette images [XZZ+ 11],
+              | locate an object from photographs [XS12], or semantically edit images based of simple
+              | scene proxies [ZCC+ 12].
+              |    The system focuses on locating 3D deformable objects in unsegmented, noisy,
+              | single-view data in a cluttered environment. Researchers have used non-rigid align-
+              | ment to better align (warped) multiple scans [LAGP09]. Alternately, temporal infor-
+              | mation across multiple frames can be used to track and recover a deformable model
+              | with joints between rigid parts [CZ11]. Instead, our system learns an instance-specific
+              | geometric prior as a collection of simple primitives along with deformation modes from
+              | a very small number of scans. Note that the priors are extracted in the learning stage,
+              | rather than being hard coded in the framework. We demonstrate that such models
+              | are sufficiently representative to extract the essence of real-world indoor scenes (see
+              | also concurrent efforts by Nan et al. [NXS12] and Shao et al [SXZ+ 12].)
+blank         |
+              |
+title         | 3.1.3    Scene Understanding
+text          | In the context of image understanding, Lee et al. [LGHK10] constructed a box-
+              | based reconstruction of indoor scenes using volumetric considerations, while Gupta
+              | et al. [GEH10] applied geometric constraints and physical considerations to obtain a
+              | block-based 3D scene model. In the context of range scans, there have been only a few
+              | efforts: Triebel et al. [TSS10] presented an unsupervised algorithm to detect repeating
+              | parts by clustering on pre-segmented input data, while Koppula et al. [KAJS11] used
+              | a graphical model to learn features and contextual relations across objects. Earlier,
+              | Schnabel et al. [SWWK08] detected features in large point clouds using constrained
+              | graphs that describe configurations of basic shapes (e.g., planes, cylinders, etc.) and
+              | then performed a graph matching, which cannot be directly used in large, cluttered
+              | environments captured at low resolutions.
+              |    Various learning-based approaches have recently been proposed to analyze and
+              | segment 3D geometry, especially towards consistent segmentation and part-label asso-
+              | ciation [HKG11, SvKK+ 11]. While similar MRF or CRF optimization can be applied
+              | in our settings, we found that a fully geometric algorithm can produce comparable
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 37
+blank         |
+              |
+              |
+text          | high-quality recognition results without extensive training. In our setting, learning
+              | amounts to recovering the appropriate deformation model for the scanned model
+              | in terms of arrangement of primitives and their connection types. While most of
+              | machine-learning approaches are restricted to local features and limited viewpoints,
+              | our geometric approach successfully handles the variability of objects and utilizes
+              | extracted high-level information.
+blank         |
+text          |                                                           Learning
+blank         |
+              |
+text          |                                         I11       I12     I13    ...          M1
+blank         |
+text          |                     S                   I 21      I 22    I 23   ...          M2
+              |                                                          Recognition
+blank         |
+              |
+              |
+              |
+text          |                                         o1 , o2 ,...
+blank         |
+text          | Figure 3.2: Our algorithm consists of two main phases: (i) a relatively slow learn-
+              | ing phase to acquire object models as collection of interconnect primitives and their
+              | joint properties and (ii) a fast object recognition phase that takes an average of
+              | 200 ms/model.
+blank         |
+              |
+              |
+              |
+title         | 3.2      Overview
+text          | Our framework works in two main phases: a learning phase and a recognition phase
+              | (see Figure 3.2).
+              |    In the learning phase, our system scans each object of interest a few times (typi-
+              | cally 5-10 scans across different poses). The goal is to consistently segment the scans
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 38
+blank         |
+              |
+              |
+text          | into parts as well as identify the junction between part-pairs to recover the respective
+              | junction attributes. Such a goal, however, is challenging given the input quality. We
+              | address the problem using two scene characteristics: (i) many man-made objects are
+              | well approximated by a collection of simple primitives (e.g., planes, boxes, cylinders)
+              | and (ii) the types of junctions between such primitives are limited (e.g., hinge, trans-
+              | lational) and of low-complexity. First, our system recovers a set of stable primitives
+              | for each individual scan. Then, for each object, the system collectively processes
+              | the scans to extract a primitive-based proxy representation along with the necessary
+              | inter-part junction attributes to build a collection of models {M1 , M2 , . . . }.
+              |     In the recognition phase, the system starts with a single scan S of the scene.
+              | First, the system extracts the dominant planes in the scene – typically they capture
+              | the ground, walls, desks, etc. The system identifies the ground plane by using the
+              | (approximate) up-vector from the acquisition device and noting that the points lie
+              | above the ground. Planes parallel to the ground are tagged as tabletops if they are at
+              | heights as observed in the training phase (typically 1′ -3′ ) by exploiting the fact that
+              | working surfaces have similar heights across rooms. The system removes the points
+              | associated with the ground plane and the candidate tabletop, and perform connected
+              | component analysis on the remaining points (on a kn -nearest neighbor graph) to
+              | extract pointsets {o1 , o2 , . . . }.
+              |     The system tests if each pointset oi can be satisfactorily explained by any of the
+              | object models Mj . Note, however, that this step is difficult since the data is unreliable
+              | and the objects can have large geometric variations due to changes in the position
+              | and pose of objects. The system performs hierarchical matching which uses the
+              | learned geometry, while trying to match individual parts first, and exploits simple
+              | scene priors like (i) placement relations (e.g., monitors are placed on desks, chairs
+              | rest on the ground) and (ii) allowable repetition modes (e.g., monitors usually repeat
+              | horizontally, chairs are repeated on the ground). We assume such priors are available
+              | as domain knowledge (e.g., Fisher et al. [FSH11]).
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 39
+blank         |
+              |
+              |
+              |
+text          |   points super-points                   parts              objects
+              |      I        X = {x1 , x2 ,... }   P = { p1 , p2 ,... }   O = {o1 , o2 ,... }
+blank         |
+              |
+text          | Figure 3.3: Unstructured input point cloud is processed into hierarchical data struc-
+              | ture composed of super-points, parts, and objects.
+blank         |
+title         | 3.2.1     Models
+text          | Our system represents the objects of interest as models that approximate the object
+              | shapes while encoding deformation and relationship information (see also [OLGM11]).
+              | Each model can be thought of as a graph structure, the nodes of which denote the
+              | primitives and the edges of which encode the nodes’ connectivity and relationship
+              | to the environment. Currently, the primitive types are limited to box, cylinder, and
+              | radial structure. A box is used to represent a large flat structure; a cylinder is used to
+              | represent a long and narrow structure; and a radial structure is used to capture parts
+              | with discrete rotational symmetry (e.g., the base of a swivel chair). As an additional
+              | regularization, the system groups parallel cylinders of similar lengths (e.g., legs of
+              | a desk or arms of a chair), which in turn provides valuable cues for possible mirror
+              | symmetries.
+              |    The connectivity between a pair of primitives is represented as their transfor-
+              | mation relative to each other and their possible deformations. Our current imple-
+              | mentation restricts deformations to be 1-DOF translation, 1-DOF rotation, and an
+              | attachment. The system tests for translational joints for the cylinders and rotational
+              | joints for cylinders or boxes (e.g., a hinge joint). An attachment represents the ex-
+              | istence of a whole primitive node and is especially useful when, depending on the
+              | configuration, the segmentation of the primitive is ambiguous. For example, the ge-
+              | ometry of doors or drawers of cabinets is not easily segmented when they are closed,
+              | and thus they are handled as an attachment when opened.
+              |    Additionally, the system detects contact information for the model, i.e., whether
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 40
+blank         |
+              |
+              |
+text          | the object rests on the ground or on a desk. Note that the system assumes that the
+              | vertical direction is known for the scene. Both the direction of the model and the
+              | direction of the ground define a canonical object transformation.
+blank         |
+              |
+title         | 3.2.2     Hierarchical Structure
+text          | For both the learning and recognition phases, the raw input is unstructured point
+              | clouds. The input is hierarchically organized by considering neighboring points and
+              | assign contextual information for each hierarchy level. The scene hierarchy has three
+              | levels of segmentation (see Figure 3.3):
+blank         |
+text          |    • super-points X = {x1 , x2 , ...};
+              |    • parts P = {p1 , p2 , ...} (association Xp = {x : P (x) = p}); and
+              |    • objects O = {o1 , o2 , ...} (association Po = {p : O(p) = o}).
+blank         |
+text          |    Instead of working directly on individual points, our system uses super-points
+              | x ∈ X as the atomic entities (analogous to super-pixels in images). The system
+              | creates super-points by uniformly sampling points from the raw measurements and
+              | associating local neighborhoods with the samples based on the normal consistency
+              | of points. Such super-points, or a group of points within a small neighborhood, are
+              | less noisy, while at the same time they are sufficiently small to capture the input
+              | distribution of points.
+              |    Next, our system aggregates neighboring super-points into primitive parts p ∈ P .
+              | Such parts are expected to relate to individual primitives of models. Each part p
+              | comprises a set of superpoints Xp . The system initially find such parts by merging
+              | neighboring super-points until the region can no longer be approximated by a plane
+              | (in a least squares sense) with average error less than a threshold θdist . Note that the
+              | initial association of super-points with parts can change later.
+              |    Objects form the final hierarchy level during the recognition phase for scenes con-
+              | taining multiple objects. Objects, having been segmented, are mapped to individ-
+              | ual instances of models, while the association between objects and parts (O(p) ∈
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 41
+blank         |
+              |
+              |
+text          | {1, 2, · · · , No } and Po ) are discovered during the recognition process. Note that dur-
+              | ing the learning phase the system deals with only one object at a time and hence
+              | such segmentation is trivial.
+              |    The system creates such a hierarchy in the pre-processing stage using the following
+              | parameters in all our tests: number of nearest neighbor kn used for normal estimation,
+              | sampling rate fs for super-points, and distance threshold θdist , which reflects the
+              | approximate noise level. Table 3.1 shows the actual values.
+blank         |
+text          |                 param.    values   usage
+              |                   kn        50     number of nearest neighbor
+              |                   fs      1/100    sampling rate
+              |                  θdist     0.1m    distance threshold for segmentation
+              |                   Ñp     10-20    Equation 3.1
+              |                 θheight     0.5    Equation 3.5
+              |                 θnormal     20◦    Equation 3.6
+              |                  θsize    2θdist   Equation 3.7
+              |                    λ        0.8    coverage ratio to declare a match
+blank         |
+text          |                      Table 3.1: Parameters used in our algorithm.
+blank         |
+              |
+              |
+              |
+title         | 3.3      Learning Phase
+text          | The input to the learning phase is a set of point clouds {I 1 , . . . , I n } obtained from
+              | the same object in different configurations. Our goal is to build a model M consisting
+              | of primitives that are linked by joints. Essentially, the system has to simultaneously
+              | segment the scans into an unknown number of parts, establish correspondence across
+              | different measurements, and extract relative deformations. We simplify the problem
+              | by assuming that each part can be represented by primitives and that each joint
+              | can be encoded with a simple degree of freedom (see also [CZ11]). This assumption
+              | allows us to approximate many man-made objects, while at the same time it leads to
+              | a lightweight model. Note that, unlike Schnabel et al. [SWWK08], who use patches
+              | of partial primitives, our system uses full primitives to represent parts in the learning
+              | phase.
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 42
+blank         |
+              |
+              |
+text          |   Initialize the skeleton (Sec. 3.3.1)
+              |                      Mark stable parts/       Match marked         Jointly fit primitives
+              |  Update parts
+              |                        part-groups                parts             to matched parts
+blank         |
+              |
+              |
+              |
+text          |   Incrementally complete the coherent model (Sec. 3.3.2)
+              |                                                Match parts by      Jointly fit primitives
+              |  Update parts
+              |                                               relative position     to matched parts
+blank         |
+              |
+              |
+              |
+text          | Figure 3.4: The learning phase starts by initializing the skeleton model, which is
+              | defined from coherent matches of stable parts. After initialization, new primitives are
+              | added by finding groups of parts at similar relative locations, and then the primitives
+              | are jointly fitted.
+blank         |
+text          |    The learning phase starts by detecting large and stable parts to establish a global
+              | reference frame across different measurements I i (Section 3.3.1). The initial corre-
+              | spondences serve as a skeleton of the model, while other parts are incrementally added
+              | to the model until all of the points are covered within threshold θdist (Section 3.3.2).
+              | While primitive fitting is unstable over isolated noisy scans, our system jointly refines
+              | the primitives to construct a coherent model M (see Figure 3.4).
+              |    The final model also contains attributes necessary for robust matching. For ex-
+              | ample, the distribution of height from the ground plane provides a prior for tables;
+              | objects can have preferred a repetition direction, e.g., monitors or auditorium chairs
+              | are typically repeated sidewise; or objects can have preferred orientations. These
+              | learned attributes and relationships act as reliable regularizers in the recognition
+              | phase, when data is typically sparse, incomplete, and noisy.
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 43
+blank         |
+              |
+              |
+title         | 3.3.1     Initializing the Skeleton of the Model
+text          | The initial structure is derived from large, stable parts across different measurements,
+              | whose consistent correspondences define the reference frame that aligns the measure-
+              | ments. In the pre-processing stage, individual scans I i are divided into super-points
+              | X i and parts P i , as described in Section 3.2.2. The system then marks the stable
+              | parts of candidate boxes or candidate cylinders.
+              |    A candidate face of a box is marked by finding parts with a sufficient number of
+              | super-points:
+              |                                     |Xp | > |P|/Ñp ,                              (3.1)
+blank         |
+text          | where Ñp is a user-defined parameter of the approximate number of primitives in the
+              | model. In our tests, a threshold of 10-20 is used. Parallel planes with comparable
+              | heights are grouped together based on their orientation to constitute the opposite
+              | faces of a box primitive.
+              |    The system classifies a part as a candidate cylinder if the ratio of the top two
+              | principle components is greater than 2. Subsequently, parallel cylinders with similar
+              | heights (e.g., legs of chairs) are grouped.
+              |    After candidate boxes and cylinders are marked, the system matches the marked
+              | (sometimes grouped) parts for pairs of measurements P i . The system only uses the
+              | consistent matches to define a reference frame between measurements and jointly fit
+              | primitives to the matched parts (see Section 3.3.2).
+blank         |
+title         | Matching
+blank         |
+text          | After extracting the stable parts P i for each measurement, our goal is to match the
+              | parts across different measurements to build a connectivity structure. The system
+              | picks a seed measurement j ∈ {1, 2, ..., n} at random and compare every other mea-
+              | surement against the seed measurement.
+              |    Our system then uses spectral correspondences [LH05] to match parts in seed
+              | {p, q} ∈ P k and other {p′ , q ′ } ∈ P i . The system builds an affinity matrix A, where
+              | each entry represents the matching score between part pairs. Recall that candidate
+              | parts p have associated types (box or cylinder), say t(p). Intuitively, the system
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 44
+blank         |
+              |
+              |
+text          | assigns a higher matching score for the parts with the same type t(p) at similar
+              | relative positions. If a candidate assignment a = (p, p′ ) assigns p ∈ P j to p′ ∈ P i , the
+              | corresponding entries are defined as the following:
+              |                                (
+              |                                    0                                    if t(p) 6= t(p′ )
+              |                   A(a, a) =                                                                         (3.2)
+              |                                    exp(−(hp − hp′ )2 /2θdist
+              |                                                         2
+              |                                                              ) otherwise,
+blank         |
+text          | where our system uses the height from the ground hp as a feature. The affinity value
+              | for a pair-wise assignment between a = (p, p′ ) and b = (q, q ′ ) (p, q ∈ P j and p′ , q ′ ∈ P i )
+              | is defined as:
+              |                        (
+              |                            0                                 if t(p) 6= t(p′ ) or t(q) 6= t(q ′ )
+              |            A(a, b) =                             ′   ′   2                                          (3.3)
+              |                            exp(− (d(p,q)−d(p
+              |                                        2θ 2
+              |                                              ,q ))
+              |                                                    ) otherwise,
+              |                                           dist
+blank         |
+              |
+              |
+text          | where d(p, q) represents the distance between two parts p, q ∈ P . The system ex-
+              | tracts the most dominant eigenvector of A to establish a correspondence among the
+              | candidate parts.
+              |    After comparing the seed measurement P j against all the other measurements P i ,
+              | the system retains those matches only that are consistent across different measure-
+              | ments. The relative positions of the matched parts define the reference frame of the
+              | object as well as the relative transformation between measurements.
+blank         |
+title         | Joint Primitive Fitting
+blank         |
+text          | Our system jointly fits primitives to the grouped parts, while adding necessary defor-
+              | mation. First, the primitive type is fixed by testing for the three types of primitives
+              | (box, cylinder, and rotational structure) and picking the primitive with the smallest
+              | fitting error. Once the primitive type is fixed, the corresponding primitives from other
+              | measurements are averaged and added to the model as a jointly fitted primitive.
+              |    Our system uses the coordinate frame to position the fitted primitives. More
+              | specifically, the three orthogonal directions of a box are defined by the frame of
+              | reference defined by the ground direction and the relative positions of the matched
+              | parts. If the normal of the largest observed face does not align with the default frame
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 45
+blank         |
+              |
+              |
+text          | of reference, the box is rotated around an axis to align the large plane. The cylinder
+              | is aligned using its axis, while the rotational primitive is tested when the part is at
+              | the bottom of an object.
+              |    Note that unlike a cylinder or a rotational structure, a box can introduce new
+              | faces that are invisible because of the placement rules of objects. For example, the
+              | bottom of a chair seat or the back of a monitor are often missing in the input scans.
+              | Hence, the system retains the information about which of the six faces are visible to
+              | simplify the subsequent recognition phase.
+              |    Our system now encodes the inter-primitive connectivity as an edge of the graph
+              | structure. The joints between primitives are added by comparing the relationship
+              | between the parent and child primitives. The first matched primitive acts as a root
+              | to the model graph. Subsequent primitives are the children of the closest primitive
+              | among those already existing in the model. A translational joint is added if the size
+              | of the primitive node varies over different measurements by more than θdist ; or, a
+              | rotational joint is added when the relative angle between the parent and child node
+              | differs by more than 20◦ .
+blank         |
+              |
+title         | 3.3.2     Incrementally Completing a Coherent Model
+text          | Having built an initial model structure, the system incrementally adds primitives by
+              | processing super-points that could not be explained by the primitives. The remaining
+              | super-points are processed to create parts, and the parts are matched based on their
+              | relative positions. Starting from the bottom-most matches, the system jointly fits
+              | primitives to the matched parts, as described above. The system iterates the process
+              | until all super-points in measurements are explained by the model.
+              |    If there exist some parts that only exist in a subset of measurements, then the
+              | system adds an attachment of the primitive. For example, in Figure 3.5, after each
+              | side of the rectangular shape of a drawer has been matched, the open drawer is added
+              | as an attachment to the base shape.
+              |    The system also maintains the contact point of a model to the ground (or the
+              | bottom-most primitive), the height distribution of each part as histogram, visible face
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 46
+blank         |
+              |
+              |
+              |
+text          |                                      open drawers
+blank         |
+              |
+              |
+              |
+text          |                                    unmatched parts
+blank         |
+text          | Figure 3.5: The open drawers remain as unmatched (grey) after incremental matching
+              | and joint primitive fitting. These parts will be added as an attachment of the model.
+blank         |
+text          | information, and the canonical frame of reference defined during the matching process.
+              | This information, along with the extracted models, is used during the recognition
+              | phase.
+blank         |
+              |
+title         | 3.4      Recognition Phase
+text          | Having learned a set of models (along with their deformation modes) M := {M1 , . . . , Mk }
+              | for a particular environment, the system can quickly collect and understand the envi-
+              | ronment in the recognition phase. This phase is much faster than the learning phase
+              | since there are only a small number of simple primitives and certain deformation
+              | modes from which to search. As an input, the scene S containing the learned models
+              | is collected using the framework from Engelhard et al. [EEH+ 11] which takes a few
+              | seconds. In a pre-processing stage, the system marks the most dominant plane as the
+              | ground plane g. Then, the second most dominant plane that is parallel to the ground
+              | plane is marked as the desk plane d. The system processes the remaining points to
+              | form a hierarchical structure with super-points, parts, and objects (see Section 3.2.2).
+              |    The recognition phase starts from a part-based assignment, which quickly com-
+              | pares parts in the measurement and primitive nodes in each model. The algorithm
+              | infers deformation and transformation of the model from the matched parts, while
+              | filtering the valid match by comparing actual measurement against the underlying
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 47
+blank         |
+              |
+              |
+text          |   Initial assignments for parts (Sec.3.4.1)
+              |                 { p1 , p2 ,L} Î oi                     {m1 , m2 , m3 , l1 , a 3}Î M
+              |                                    p1 = m3
+              |                                                                              m3
+              |                                                        m3       rotational
+              |                                                                    a3
+              |                                                                                        m2
+              |                                                        m2
+              |                                                        m1    translational
+              |                                                                   l1          m1
+              |                                                          g
+              |         {o1 , o2 ,L}Î S                                                 contact    g
+blank         |
+text          |   Refined assignment with geometry (Sec. 3.4.2)
+              |    Solve for deformation                       Find correspondence
+              |    given matches (Sec.5.2.a)                   and segmentation (Sec.5.2.b)
+              |                                        Iterate              p1 = m3
+              |     h( p1 ) = h(m3 ) = f h (l1 , a 3 )
+              |                          n
+              |                                                             p2 = m2
+              |     n( p1 ) = n(m3 ) = f (a 3 )
+              |                                                             p3 = m1
+blank         |
+text          | Figure 3.6: Overview of the recognition phase. The algorithm first finds matched parts
+              | before proceeding to recover the entire model and its corresponding segmentation.
+blank         |
+text          | geometry. If a sufficient portion of measurements can be explained by the model,
+              | the system accepts the match as valid, and the segmentation of both object-level and
+              | part-level is refined to match the model.
+blank         |
+              |
+title         | 3.4.1     Initial Assignment for Parts
+text          | Our system first makes coarse assignments between segmented parts and model nodes
+              | to quickly reduce the search space (see Figure 3.6, top). If a part and a primitive node
+              | form a potential match, the system also induces the relative transformation between
+              | them. The output of the algorithm is a list of triplets composed of part, node from
+              | the model, and transformation groups {(p, m, T )}.
+              |    Our system uses geometric features to decide whether individual parts can be
+              | matched with model nodes. Note that the system does not use color information in
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 48
+blank         |
+              |
+              |
+text          | our setting. As features for individual parts Ap , our system considers the following:
+              | (i) height distribution from ground plane as a histogram vector hp ; (ii) three principal
+              | components of of the region x1p , x2p , x3p (x3p = np ); and (iii) sizes along the directions
+              | lp1 > lp2 > lp3 .
+              |      Similarly, the system infers the counterpart of features for individual visible faces
+              | of model parts Am . Thus, even if one face of a part is visible from the measurement,
+              | our system is still able to detect the matched part of the model. The height histogram
+              | hm is calculated from the relative area per height interval and the dimensions and
+              | principal components are inferred from the shape of the faces.
+              |      All the parts are compared against all the faces of primitive nodes in the model:
+blank         |
+text          |                 E(Ap , Am ) =                                                                           (3.4)
+              |                     ψ height (hp , hm ) · ψ normal (np , nm ; g) · ψ size ({lp1 , lp2 }, {lm
+              |                                                                                            1    2
+              |                                                                                              , lm }).
+blank         |
+text          | Individual potential function ψ returns either 1 (matched) or 0 (not matched) de-
+              | pending on if a feature satisfies the criteria within an allowable threshold. Parts are
+              | matched only if all the features criteria are satisfied. The height potential calculates
+              | the histogram intersection
+              |                                                  X
+              |                         ψ height (hp , hm ) =         min(hp (i), hm (i)) > θheight .                   (3.5)
+              |                                                   i
+blank         |
+              |
+text          | The normal potential calculates the relative angle with the ground plane normal (ng )
+              | as
+              |                 ψ normal (np , nm ; g) = |acos(np · ng ) − acos(nm · ng )| < θnormal .                  (3.6)
+blank         |
+text          | The size potential compares the size of the part
+blank         |
+text          |                                         1
+              |                 ψ size ({lp1 , lp2 }, {lm    2
+              |                                           , lm }) = |lp1 − lm
+              |                                                             1
+              |                                                               | < θsize and |lp2 − lm
+              |                                                                                     2
+              |                                                                                       | < θsize .       (3.7)
+blank         |
+text          | Our system sets the threshold generously to allow false positives and retain multiple
+              | (or none) matched parts per object (see Table 3.1). In effect, the system first guesses
+              | potential object-model associations and later prunes out the incorrect associations
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 49
+blank         |
+              |
+              |
+text          | in the refinement step using the full geometry (see Section 3.4.2). If Equation 3.4
+              | returns 1, then the system can obtain a good estimate of the relative transformation
+              | T between the model and the part by using the position, normal, and the ground
+              | plane direction to create a triplet (p, m, T ).
+blank         |
+              |
+title         | 3.4.2      Refined Assignment with Geometry
+text          | Starting from the list of part, node, and transformation triplets {(p, m, T )}, the sys-
+              | tem verifies the assignments with a full model by comparing a segmented object
+              | o = O(p) against models Mi . The goal is to produce accurate part assignments for
+              | observable parts, transformation, and the deformation parameters. Intuitively, the
+              | system finds a local minimum from the suggested starting point (p, m, T ) with the
+              | help of the models extracted in the learning phase. The system then optimizes by
+              | alternately refining the model pose and updating the segmentation (see Figure 3.6,
+              | bottom).
+              |    Given the assignment between p and m, the system first refines the registration and
+              | deformation parameters and places the model M to best explain the measurements.
+              | If the placed model covers most of the points that belong to the object (ratio λ = 0.8
+              | in our tests) within the distance threshold θdist , then the system confirms that the
+              | model is matched to the object. Note that, compared to the generous threshold in
+              | part-matching in Section 5.1, the system now sets a conservative threshold to prune
+              | false-positives.
+              |    In the case of a match, the geometry is fixed and the system refines the segmen-
+              | tation, i.e., the part and object boundaries are modified to match the underlying
+              | geometry. The process is iterated until convergence.
+blank         |
+title         | Refining Deformation and Registration
+blank         |
+text          | Our system finds the deformation parameters using the relative location and orien-
+              | tation of parts and the contact plane (e.g., desk top, the ground plane). Given any
+              | pair of parts, or a part and the ground plane, their mutual distance and orientation
+              | are formulated as functions of deformation parameters existing between the path of
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 50
+blank         |
+              |
+              |
+title         | Input points         Models matched            Parts assigned
+blank         |
+              |
+              |
+              |
+text          |  Initial objects                                        Refined objects
+blank         |
+text          | Figure 3.7: The initial object-level segmentation can be imperfect especially between
+              | distant parts. For example, the top and base of a chair initially appeared to be sep-
+              | arate objects, but were eventually understood as the same object after the segments
+              | were refined based on the geometry of the matched model.
+blank         |
+text          | the two parts. For example, if our system starts from matched part-primitive pair p1
+              | and m3 in Figure 3.6, then the height and the normal of the part can be expressed as
+              | function of the deformation parameters l1 and α3 of the model. The system solves a
+              | set of linear equations given for the observed parts and the contact location to solve
+              | for the deformation parameters. Then, the registration between the scan and the
+              | deformed model is refined by Iterative Closest Point (ICP) [BM92].
+              |    Ideally, part p in the scene measurement should be explained by the assigned
+              | part geometry within the distance threshold θdist . The model is matched to the
+              | measurement if the proportion of points within θdist is more than λ. (Note that not
+              | all faces of the part need to be explained by the region measurement as only a subset
+              | of the model is measured by the sensor.) Otherwise, the triplet (p, m, T ) is an invalid
+              | assignment and the algorithm returns false. After initial matching (Section 3.4.1),
+              | multiple parts of an object can match to different primitives of many models. If there
+              | are multiple successful matches for an object, the system retains the assignment with
+              | the most number of points.
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 51
+blank         |
+              |
+              |
+title         | Refine Segmentation
+blank         |
+text          | After a model is picked and positioned in the configuration, its location is fixed
+              | while the system refines the segmentation based on the underlying model. Recall
+              | that the initial segment of parts P merge super-points with similar normals and
+              | objects O group neighboring parts using the distance threshold. Although the initial
+              | segmentations provide a sufficient approximation to roughly locate the models, they
+              | do not necessarily coincide with the actual part and object boundaries without being
+              | compared against the geometry.
+              |    First, the system updates the association between super-points and the parts by
+              | finding the closest primitive node of the model for each super-point. The super-points
+              | that belong to the same model node are grouped to the same part (see Figure 3.7).
+              | In contrast, super-points that are farther away than the distance threshold θdist from
+              | any of the primitives are separated to form a new segment with a null assignment.
+              |    After the part assignment, the system searches for the missing primitives by merg-
+              | ing neighboring objects (see Figure 3.7). In the initial segmentation, objects which
+              | are close to each other in the scene can lead to multiple objects grouped into a sin-
+              | gle segment. Further, particular viewpoints of an object can cause parts within the
+              | model to appear farther apart, leading to spurious multiple segments. Hence, the
+              | super-points are assigned to an object only after the existence of the object is verified
+              | with the underlying geometry.
+blank         |
+              |
+title         | 3.5      Results
+text          | In this section, we present the performance results obtained from testing our system
+              | on various synthetic and real-world scenes.
+blank         |
+              |
+title         | 3.5.1     Synthetic Scenes
+text          | We tested our framework on synthetic scans of 3D scenes obtained from the Google
+              | 3D Warehouse (see Figure 3.8). We implemented a virtual scanner to generate the
+              | synthetic data: once the user specifies a viewpoint, we read the depth buffer to recover
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 52
+blank         |
+              |
+              |
+text          | 3D range data of the virtual scene from the specified viewpoint. We control the scan
+              | quality using three parameters: (i) scanning density d to control the fraction points
+              | that are retained, (ii) noise level g to control the zero mean Gaussian noise added to
+              | each point along the current viewing direction, and (iii) the angle noise a to perturb
+              | the position in the local tangent plane using zero mean Gaussian noise. Unless stated,
+              | we used default values of d = 0.4, g = 0.01, and a = 5◦ .
+              |    In Figure 3.8, we present typical recognition results using our framework. The
+              | system learned different models of chairs and placed them with varying deformations
+              | (see Table 3.2). We exaggerated some of the deformation modes, including very
+              | high chairs and severely tilted monitors, but could still reliably detect them all (see
+              | Table 3.3). Beyond recognition, our system reliably recovered both positions and
+              | pose parameters within 5% error margin of the object size. Incomplete data can,
+              | however, result in ambiguities: for example, in synthetic #2 our system correctly
+              | detected a chair, but displayed it in a flipped position, since the scan contained data
+blank         |
+              |
+              |
+              |
+text          |  synthetic 1
+blank         |
+              |
+              |
+              |
+text          |  synthetic 2
+blank         |
+              |
+              |
+              |
+text          |  synthetic 3
+blank         |
+text          | Figure 3.8: Recognition results on synthetic scans of virtual scenes: (left to right) syn-
+              | thetic scenes, virtual scans, and detected scene objects with variations. Unmatched
+              | points are shown in gray.
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 53
+blank         |
+              |
+              |
+text          | only from the chair’s back. While specific volume-based reasoning can be used to
+              | give preference to chairs in an upright position, our system avoided such case-specific
+              | rules in the current implementation.
+blank         |
+              |
+              |
+              |
+text          |                 similar                         different
+blank         |
+              |
+text          |                   Figure 3.9: Chair models used in synthetic scenes.
+blank         |
+text          |    In practice, acquired data sets suffer from varying sampling resolution, noise, and
+              | occlusion. While it is difficult to exactly mimic real-world scenarios, we ran synthetic
+              | tests to access the stability of our algorithm. We placed two classes of chairs (see
+              | Figure 3.9) on a ground plane, 70-80 chairs of each type, and created scans from
+              | 5 different viewpoints with varying density and noise parameters. For both classes,
+              | we used our recognition framework to measure precision and recall while varying
+              | parameter λ. Note that precision represents how many of the detected objects are
+              | correctly classified out of total number of detections, while recall represents how many
+              | objects were correctly detected out of the total number of placed objects. In other
+              | words, a precision measure of 1 indicates no false positives, while a recall measure of
+              | 1 indicates there are no false negatives.
+              |    Figure 3.10 shows the corresponding precision-recall curves. The first two plots
+              | show precision-recall curves using a similar pair of models, where the chairs have sim-
+              | ilar dimensions, which is expected to result in high false-positive rates (see Figure 3.9,
+              | left). Not surprisingly, recognition improves with a lower noise margin and/or higher
+              | sampling density. Performance, however, is saturated with Gaussian noise lower than
+              | 0.3 and density higher than 0.6 since both our model- and part-based components
+              | are approximations of the true data, resulting in an inherent discrepancy between
+              | measurement and the model, even in absence of noise. Note that as long as the parts
+              | and dimensions are captured, our system still detects objects even under high noise
+              | and sparse sampling.
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 54
+blank         |
+              |
+              |
+text          |                Density (a similar pair)                             Noise (a similar pair)                                Data type
+              |          1.2                                             1.2                                                 1.2
+blank         |
+text          |           1                                               1                                                    1
+blank         |
+text          |          0.8                                             0.8                                                 0.8
+blank         |
+              |
+              |
+              |
+text          |                                                                                                     Recall
+              |                                                 Recall
+              | Recall
+blank         |
+              |
+              |
+              |
+text          |          0.6                                             0.6                                                 0.6
+blank         |
+text          |          0.4                                             0.4                                                 0.4
+blank         |
+text          |          0.2                                             0.2                                                 0.2
+blank         |
+text          |           0                                               0                                                    0
+              |                0   0.2 0.4 0.6 0.8    1       1.2               0     0.2 0.4 0.6 0.8    1   1.2                    0   0.2 0.4 0.6 0.8     1      1.2
+              |                          Precision                                          Precision                                         Precision
+              |                 density 0.4     density 0.5                    Gaussian 0.004      Gaussian 0.008                  Gaussian 0.004     Gaussian 0.004
+              |                 density 0.6     density 0.7                    Gaussian 0.3        Gaussian 0.5                    Gaussian 0.3       Gaussian 0.3
+              |                 density 0.8                                    Gaussian 1.0        Gaussian 2.0                    Gaussian 1.0       Gaussian 1.0
+              |                                                                                                              Different pair         Similar pair
+blank         |
+text          |                         Figure 3.10: Precision-recall curve with varying parameter λ.
+blank         |
+text          |           Our algorithm has higher robustness when the pair of models are sufficiently
+              | different (see Figure 3.10, right). We tested with two pairs of chairs (see Figure 3.9):
+              | the first pair had chairs of similar dimensions as before (in solid lines), while the
+              | second pair had a chair and a sofa with large geometric differences (in dotted lines).
+              | When tested with the different pairs, our system achieved precision higher than 0.98
+              | for recall larger than 0.9. Thus, as long as the geometric space of the objects is sparsely
+              | populated, our algorithm has a high accuracy in quickly acquiring the geometry of
+              | environment without assistance from data-driven or machine-learning techniques.
+blank         |
+              |
+title         | 3.5.2                Real-World Scenes
+text          | The more practical test of our system is its performance on real scanned data since
+              | it is difficult to synthetically recreate all the artifacts encountered during scanning
+              | of a actual physical space. We tested our framework on a range of real-world ex-
+              | amples, each consisting of multiple objects arranged over large spaces (e.g., office
+              | areas, seminar rooms, auditoriums) at a university. For both the learning and the
+              | recognition phases, we acquired the scenes using a Microsoft Kinect scanner with an
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 55
+blank         |
+              |
+text          |                                           points   no. of   no. of   no. of
+              |               scene         model
+              |                                         per scan    scans   prim.    joints
+              |                             chair         28445        7       10        4
+              |               synthetic1    stool         19944        7        3        2
+              |                             monitor       60933        7        3        2
+              |                             chaira       720364        7        9        5
+              |               synthetic2
+              |                             chairb       852072        1        6        0
+              |               synthetic3    chair        253548        4       10        2
+              |                             chair         41724        7        8        4
+              |                             monitor       20011        5        3        2
+              |               office
+              |                             trash bin     28348        2        4        0
+              |                             whitebrd.    356231        1        3        0
+              |               auditorium    chair         31534        5        4        2
+              |               seminar rm.   chair        141301        1        4        0
+blank         |
+text          |        Table 3.2: Models obtained from the learning phase (see Figure 3.11).
+blank         |
+text          | open source scanning library [EEH+ 11]. The scenes were challenging, especially due
+              | to the amount of variability in the individual model poses (see our project page for
+              | the input scans and recovered models). Table 3.2 summarizes all the models built
+              | during the learning stage for these scenes ranging from 3-10 primitives with 0-5 joints
+              | extracted from only a few scans (see Figure 3.11). While we evaluated our framework
+              | based on the raw Kinect output rather than on processed data (e.g., [IKH+ 11]), the
+              | performance limits should be similar when calibrated to the data quality and physical
+              | size of the objects.
+blank         |
+              |
+              |
+              |
+text          |         Figure 3.11: Various models learned/used in our test (see Table 3.2).
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 56
+blank         |
+              |
+              |
+text          |    Our recognition phase was lightweight and fast, taking on average 200ms to com-
+              | pare a point cluster to a model on a 2.4Hz CPU with 6GB RAM. For example, in
+              | Figure 3.1, our system detected all 5 chairs present and 4 of the 5 monitors, along with
+              | their poses. Note that objects that were not among the learned models remained un-
+              | detected, including a sofa in the middle of the space and other miscellaneous clutter.
+              | We overlaid the unresolved points on the recognized parts for comparison. Note that
+              | our algorithm had access to only the geometry of objects, not any color or texture
+              | attributes. The complexity of our problem setting can be appreciated by looking at
+              | the input scan, which is difficult even for a human to parse visually. We observed
+              | Kinect data to exhibit highly non-linear noise effects that were not simulated in our
+              | synthetic scans; data also went missing when an object was narrow or specular (e.g.,
+              | monitor), with flying pixels along depth discontinuities, and severe quantization noise
+              | for distant objects.
+              |                       number of input    points       objects      objects
+              |             scene
+              |                         ave.  min.        max.        present   detected*
+              |             syn. 1     3227 1168           9967     5c 3s 5m     5c 3s 5m
+              |             syn. 2     2422 1393           3427       4ca 4cb       4ca 4cb
+              |             syn. 3     1593    948         2704     14 chairs    14 chairs
+              |             teaser     6187 2575         12083      5c 5m 0t     5c 4m 0t
+              |             office 1 3452 1129             7825 5c 2m 1t 2w 5c 2m 1t 2w
+              |             office 2 3437 1355           10278 8c 5m 0t 2w 6c 3m 0t 2w
+              |             aud. 1 19033 11377           29260      26 chairs    26 chairs
+              |             aud. 2     9381 2832         13317      21 chairs    19 chairs
+              |             sem. 1     4326    840       11829      13 chairs    11 chairs
+              |             sem. 2     6257 2056         12467      18 chairs    16 chairs
+              |              *c: chair, m: monitor, t:   trash bin, w: whiteboard, s: stool
+              | Table 3.3: Statistics for the recognition phase. For each scene, we also indicate the
+              | corresponding scene in Figure 3.8 and Figure 3.12, when applicable.
+blank         |
+text          |    Figure 3.12 compiles the results for cluttered office setups, auditoriums, and sem-
+              | inar rooms. Although we tested with different scenes, we present only representative
+              | examples as the performance on all types of scenes was comparable. Our system
+              | detected the chairs, computer monitors, whiteboards, and trash bins across different
+              | rooms, and the rows of auditorium chairs in different configurations. Our system
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 57
+blank         |
+              |
+              |
+text          | missed some of the monitors because the material property of the screens were proba-
+              | bly not favorable to Kinect capture. The missed monitors (as in Figure 3.1 and office
+              | #2 in Figure 3.12) have big rectangular holes within the screen in the scans. In office
+              | #2, the system also missed two of the chairs that were mostly occluded and beyond
+              | what our framework can handle.
+              |    Even under such demanding data quality, our system can recognize the models
+              | and recover poses from data sets an order of magnitude sparser than those required
+              | in the learning phase. Surprisingly, the system could also detect the small tables in
+              | the two auditorium scenes (1 in auditorium #1, and 3 in auditorium #2) and also
+              | identify pose changes in the auditorium seats. Figure 3.13 shows a close-up office
+              | scene to better illustrate the deformation modes that our system captured. All of the
+              | recognized object models have one or more deformation modes, and we can visually
+              | compare the quality of data to the recovered pose and deformation.
+              |    The segmentation of real-world scenes are challenging with naturally cluttered
+              | set-ups. The challenge is well demonstrated in the seminar rooms because of closely
+              | spaced chairs or chairs leaning against the wall. In contrast to the auditorium scenes,
+              | where the rows of chairs are detected together making the segmentation trivial, in
+              | the seminar room setting chairs often occlude each other. The quality of data also
+              | deteriorates because of thin metal legs with specular highlights. Nevertheless, our
+              | system correctly recognized most of the chairs along with correct configurations by
+              | first detecting the larger parts. Although only 4-6 chairs were detected in the initial
+              | iteration, our system eventually detected most of chairs in the seminar rooms by
+              | refining the segmentation based on the learned geometry (in 3-4 iterations).
+blank         |
+              |
+title         | 3.5.3    Comparisons
+text          | In the learning phase, our system requires multiple scans of an object to build a proxy
+              | model along with its deformation modes. Unfortunately, the existing public data sets
+              | do not provide such multiple scans. Instead, we compared our recognition routine
+              | to the algorithm proposed by Koppula et al. [KAJS11] using author provided code
+              | to recognize objects from a real-time stream of Kinect data after the user manually
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 58
+blank         |
+              |
+              |
+text          | marks the ground plane. We fixed the device location and qualitatively compared
+              | the recognition results of the two algorithms (see Figure 3.14). We observed that
+              | Koppula et al. reliably detect floors, table tops and front-facing chairs, but often fail
+              | to detect chairs facing backwards, or distant ones. They also miss all the monitors,
+              | which usually are very noisy. In contrast, our algorithm being pose- and variation-
+              | aware is more stable across multiple frames, even with access to less information (we
+              | do not use color). Note that while our system detected some monitors, their poses are
+              | typically biased toward parts where measurements exist. In summary, for partial and
+              | noisy point-clouds, the probabilistic formulation coupled with geometric reasoning
+              | results in robust semantic labeling of the objects.
+blank         |
+              |
+title         | 3.5.4     Limitations
+text          | While in our tests the recognition results were mostly satisfactory (see Table 3.3),
+              | we observed two main failure modes. First, our system failed to detect objects when
+              | large amounts of data were missing. In real-world scenarios, our object scans could
+              | easily exhibit large holes because of occlusions, specular materials, or thin structures.
+              | Further, scans can be sparse and distorted for distant objects. Second, our system
+              | cannot overcome the limitations of our initial segmentation. For example, if objects
+              | are closer than θdist , our system groups them as a single object; while a single object
+              | can be confused for multiple objects if its measurements are separated by more than
+              | θdist from a particular viewpoint. Although in certain cases the algorithm can recover
+              | segmentations with the help of other visible parts, this recovery becomes difficult
+              | because our system allows objects to deform and hence have variable extent.
+              |    However, even with these limitations, our system overall reliably recognized scans
+              | with 1000-3000 points per scan since in the learning phase the system extracted
+              | the important degrees of variation, thus providing a compact, yet powerful, model
+              | (and deformation) abstraction. In a real office settings, the simplicity and speed
+              | of our framework would allow a human operator to immediately notice missed or
+              | misclassified objects and quickly re-scan those areas under more favorable conditions.
+              | We believe that such a progressive scanning possibility to become more common place
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 59
+blank         |
+              |
+              |
+text          | in future acquisition setups.
+blank         |
+              |
+title         | 3.5.5     Applications
+text          | Our results suggest that our system is also useful for obtaining a high-level under-
+              | standing of recognized objects, e.g., relative position, orientation, frequency of learned
+              | objects. Specifically, as our system progressively scans multiple rooms populated with
+              | the same objects, the system gathers valuable co-occurrence statistics (see Table 3.4).
+              | For example, from the collected data, the system extracts that the orientation of audi-
+              | torium chairs are consistent (i.e., face a single direction), or observe a pattern among
+              | the relative orientation between a chair and its neighboring monitor. Not surprisingly,
+              | our system found chairs to be more frequent in seminar rooms rather than in offices.
+              | In the future, we plan to incorporate such information to handle cluttered datasets
+              | while scanning similar environments but with differently shaped objects.
+blank         |
+text          |                                            distance (m)     angle   (◦ )
+              |                   scene    relationship
+              |                                            mean      std   mean      std
+              |                            chair-chair     1.207 0.555      78.7    74.4
+              |                   office
+              |                            chair-monitor   0.943 0.164       152    39.4
+              |                   aud.     chair-chair     0.548       0       0         0
+              |                   sem.     chair-chair     0.859 0.292      34.1    47.4
+blank         |
+text          |         Table 3.4: Statistics between objects learned for each scene category.
+blank         |
+text          |    As an exciting possibility, the system can efficiently detect change. By change, we
+              | mean introduction of a new object, previously not seen in the learning phase while
+              | factoring out variations due to different spatial arrangements or changes in individual
+              | model poses. For example, in the auditorium #2, a previously unobserved chair
+              | is successfully detected (highlighted in yellow). Such a mode is particularly useful
+              | for surveillance and automated investigation of indoor environments, or for disaster
+              | planning in environments that are unsafe for humans to venture.
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 60
+blank         |
+              |
+              |
+title         | 3.6     Conclusions
+text          | We have presented a simple system for recognizing man-made objects in cluttered 3D
+              | indoor environments, while factoring out low-dimensional deformations and pose vari-
+              | ations, on a scale previously not demonstrated. Our pipeline can be easily extended
+              | to more complex environments primarily requiring reliable acquisition of additional
+              | object models and their variability modes.
+              |    Several future challenges and opportunities remain: (i) With an increasing number
+              | of object prototypes, the system will need more sophisticated search data structures
+              | in the recognition phase. We hope to benefit from recent advances in shape search.
+              | (ii) We have focused on a severely restricted form of sensor input, namely, poor and
+              | sparse geometry alone. We intentionally left out color and texture, which can be quite
+              | beneficial, especially if appearance variations can be accounted for. (iii) A natural
+              | extension would be to take the recognized models along with their pose and joint
+              | attributes to create data-driven, high-quality interior CAD models for visualization,
+              | or more schematic representations, that may be sufficient for indoor navigation, or
+              | simply for scene understanding (see Figure 3.1, rightmost image, and recent efforts
+              | in scene modeling [NXS12, SXZ+ 12]).
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 61
+blank         |
+              |
+              |
+text          |  office 1                    chair                monitor
+              |                                                                desk
+blank         |
+              |
+              |
+              |
+text          |                                      trash bin   whiteboard
+              |  office 2
+blank         |
+              |
+              |
+              |
+text          |  auditorium 1
+blank         |
+              |
+              |
+              |
+text          |                                                                                     change
+              |  auditorium 2                                                         open tables
+              |                                                                                     detection
+blank         |
+              |
+              |
+              |
+text          |                                                                       open
+              |                                                                       seat
+blank         |
+text          |  seminar room 1
+blank         |
+              |
+              |
+              |
+text          |  seminar room 2                                       missed chairs
+blank         |
+              |
+              |
+              |
+text          | Figure 3.12: Recognition results on various office and auditorium scenes. Since the
+              | input scans have limited viewpoints and thus are too poor to provide a clear represen-
+              | tation of the scene complexity, we include scene images for visualization (these were
+              | unavailable to the algorithm). Note that for the auditorium examples, our system
+              | even detected the small tables attached to the chairs — this was possible since the
+              | system extracted this variation mode in the learning phase.
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 62
+blank         |
+              |
+              |
+              |
+text          |                                   missed monitor                laptop     monitor
+              |                                            chair
+blank         |
+              |
+              |
+              |
+text          |                                         drawer deformations
+blank         |
+text          | Figure 3.13: A close-up office scene. All of the recognized objects have one or more
+              | deformation modes. The algorithm inferred the angles of the laptop screen and the
+              | chair back, heights of the chair seat, the arm rests and the monitor. Note that our
+              | system also captured the deformation modes of open drawers.
+meta          | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 63
+blank         |
+              |
+              |
+              |
+text          |        input scene 1                              input scene 2
+blank         |
+              |
+              |
+              |
+text          |                       shifted                                 wrong labels
+blank         |
+              |
+              |
+              |
+text          |                      missed                                       missed
+blank         |
+              |
+              |
+              |
+text          | [Koppula et al.]                ours       [Koppula et al.]           ours
+              |  table top    wall     floor      chair base   table leg   monitor     chair back
+blank         |
+text          | Figure 3.14: We compared our algorithm and Koppula et al. [KAJS11] using multiple
+              | frames of scans from the same viewpoint. Our recognition results are more stable
+              | across different frames.
+meta          | Chapter 4
+blank         |
+title         | Guided Real-Time Scanning of
+              | Indoor Objects3
+blank         |
+text          | Acquiring 3-D models of the indoor environments is a critical component for under-
+              | standing and mapping the environments. For successful 3-D acquisition in indoor
+              | scenes, it is necessary to simultaneously scan the environment, interpret the incom-
+              | ing data stream, and plan subsequent data acquisition, all in a real-time fashion. The
+              | challenge is, however, that individual frames from portable commercial 3-D scanners
+              | (RGB-D cameras) can be of poor quality. Typically, complex scenes can only be
+              | acquired by accumulating multiple scans. Information integration is done in a post-
+              | scanning phase, when such scans are registered and merged, leading eventually to
+              | useful models of the environment. Such a workflow, however, is limited by the fact
+              | that poorly scanned or missing regions are only identified after the scanning process
+              | is finished, when it may be costly to revisit the environment being acquired to per-
+              | form additional scans. In the study presented in this chapter, we focused on real-time
+              | 3D model quality assessment and data understanding, that could provide immediate
+              | feedback for guidance in subsequent acquisition.
+              |    Evaluating acquisition quality without having any prior knowledge about an un-
+              | known environment, however, is an ill-posed problem. We observe that although the
+meta          |   3
+text          |     The contents of the chapter will be published as Y.M. Kim, N. Mitra, Q. Huang, L. Guibas,
+              | Guided Real-Time Scanning of Indoor Environments, Pacific Graphics 2013.
+blank         |
+              |
+              |
+meta          |                                              64
+              | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                  65
+blank         |
+              |
+              |
+text          | target scene itself maybe unknown, in many cases, the scene consists of objects from
+              | a well-prescribed pre-defined set of object categories. Moreover, these categories are
+              | well represented in publicly available 3-D shape repositories (e.g., Trimble 3D Ware-
+              | house). For example, an office setting typically consists of various tables, chairs,
+              | monitors, etc., all of which have thousands of instances in the Trimble 3D Ware-
+              | house. In our approach, instead of attempting to reconstruct detailed 3D geometry
+              | from low-quality inconsistent 3D measurements, we focus on parsing the input scans
+              | into simpler geometric entities, and use existing 3D model repositories like Trimble
+              | 3D warehouse as proxies to assist the process of assessing data quality. Thus, we
+              | defined two key tasks that an effective acquisition method would need to complete:
+              | (i) given a partially scanned object, reliably and efficiently retrieve appropriate proxy
+blank         |
+              |
+              |
+              |
+text          | Figure 4.1: We introduce a real-time guided scanning system. As streaming 3D
+              | data is progressively accumulated (top), the system retrieves the top matching mod-
+              | els (bottom) along with their pose to act as geometric proxies to assess the current
+              | scan quality, and provide guidance for subsequent acquisition frames. Only a few
+              | intermediate frames with corresponding retrieved models are shown in this figure.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                 66
+blank         |
+              |
+              |
+text          | models of it from the database; and (ii) position the retrieved models in the scene
+              | and provide real-time feedback (e.g., missing geometry that still needs to be scanned)
+              | to guide subsequent data gathering.
+              |    We introduce a novel partial shape retrieval approach for finding similar shapes
+              | of a query partial scan. In our setting, we used the Microsoft Kinect to acquire
+              | the scans of real objects. The proposed approach, which combines both descriptor-
+              | based retrieval and registration-based verification, is able to search in a database of
+              | thousands of models in real-time. To account for partial similarity between the input
+              | scan and the models in a database, we created simulated scans of each database model
+              | and compared a scan of real setting to a scan of simulated setting. This allowed us to
+              | efficiently compare shapes using global descriptors even in the presence of only partial
+              | similarity; and the approach remains robust in the case of occlusions or missing data
+              | about the object being scanned.
+              |    Once our system finds a match, to mark out missing parts in the current merged
+              | scan, the system aligns it with the retrieved model and highlights the missing part
+              | or places where the scan density is low. This visual feedback allows the operator
+              | to quickly adjust the scanning device for subsequent scans. In effect, our 3D model
+              | database and matching algorithms make it possible for the operator to assess the
+              | quality of the data being acquired and discover badly scanned or missing areas while
+              | the scan is being performed, thus allowing corrective actions to be taken immediately.
+              |    We extensively evaluated the robustness and accuracy of our system using syn-
+              | thetic data sets with available ground truth. Further, we tested our system on physical
+              | environments to achieve real-time scene understanding (see the supplementary video
+              | that includes the actual scanning session recorded). In summary, in this chapter, we
+              | present a novel guided scanning interface and introduce a relation-based light-weight
+              | descriptor for fast and accurate model retrieval and positioning to provide real-time
+              | guidance for scanning.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                67
+blank         |
+              |
+              |
+title         | 4.1     Related Work
+blank         |
+title         | 4.1.1    Interactive Acquisition
+text          | Fast, accurate, and autonomous model acquisition have long been primary goals in
+              | robotics, computer graphics, and computer vision. With the introduction of afford-
+              | able, portable, commercial RGBD cameras, there has been a pressing need to simplify
+              | scene acquisition workflows to allow less experienced individuals to acquire scene ge-
+              | ometries. Recent efforts fall into two broad categories: (i) combining individual
+              | frames of low-quality point-cloud data with SLAM algorithms [EEH+ 11, HKH+ 12] to
+              | improve scan quality [IKH+ 11]; (ii) using supervised learning to train classifiers for
+              | scene labeling [RBF12] with applications to robotics [KAJS11]. Previously, [RHHL02]
+              | aggregated scans at interactive rates to provide visual feedback to the user. This work
+              | was recently expanded by [DHR+ 11]. [KDS+ 12] extracted simple planes and recon-
+              | struct floor plans with guidance from a projector pattern. While our goal is also to
+              | provide real-time feedback, our system differs from previous efforts in that it uses
+              | retrieved proxy models to automatically access the current scan quality, enabling
+              | guided scanning.
+blank         |
+              |
+title         | 4.1.2    Scan Completion
+text          | Various strategies have been proposed to improve noisy scans or plausibly fill in miss-
+              | ing data due to occlusion: researchers have exploited repetition [PMW+ 08], symme-
+              | try [TW05, MPWC12], or used primitives to complete missing parts [SWK07]. Other
+              | approaches have focused on using geometric proxies and abstractions including curves,
+              | skeletons, planar abstractions, etc. In the context of image understanding, indoor
+              | scenes have been abstracted and modeled as a collection of simple cuboids [LGHK10,
+              | ZCC+ 12] to capture a variety of man-made objects.
+blank         |
+              |
+title         | 4.1.3    Part-Based Modeling
+text          | Simple geometric primitives, however, are not always sufficiently expressive for com-
+              | plex shapes. Meanwhile, such objects can still be split into simpler parts that aid
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                 68
+blank         |
+              |
+              |
+text          | shape understanding. For example, parts can act as entities for discovering rep-
+              | etitions [TSS10], training classifiers [SFC+ 11, XS12], or facilitating shape synthe-
+              | sis [JTRS12]. Alternately, a database of part-based 3D model templates can be used
+              | to detect shapes from incomplete data [SXZ+ 12, NXS12, KMYG12]. Such methods
+              | often rely on expensive matching, and thus do not lend themselves to low-memory
+              | footprint real-time realizations.
+blank         |
+              |
+title         | 4.1.4     Template-Based Completion
+text          | Our system also uses database of 3D models (e.g., chairs, lamps, tables) to retrieve
+              | shape from 3D scans. However, by defining a novel simple descriptor, our sys-
+              | tem, compared to previous efforts, can reliably handle much larger model databases.
+              | Specifically, instead of geometrically matching templates [HCI+ 11], or using templates
+              | to complete missing parts [PMG+ 05], our system initially searches for consistency in
+              | distribution of relation among primitive faces.
+blank         |
+              |
+title         | 4.1.5     Shape Descriptors
+text          | In the context of shape retrieval, various descriptors have been investigated for group-
+              | ing, classification, or retrieval of 3D geometry. For example, the method proposed by
+              | [CTSO03] uses light-field descriptors based on silhouettes, the method by [OFCD02]
+              | uses shape distributions to categorize different object classes, etc. The silhouette
+              | method requires an expensive rotational alignment search, limiting its usefulness in
+              | our setting to a small number of models (100-200). Both methods assume access
+              | to nearly complete models to match against. In contrast, for guided scanning, our
+              | approach can support much larger model sets (about 2000 models) and, more impor-
+              | tantly, focus on handling poor and incomplete point sets as inputs to the matcher.
+blank         |
+              |
+title         | 4.2      Overview
+text          | Figure 4.2 illustrates the pipeline of our guided real-time scanning system, which con-
+              | sists of a scanning device (Kinect in our case) and a database of 3D shapes containing
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                               69
+blank         |
+              |
+              |
+              |
+text          |  Off-line process
+blank         |
+text          |    Database of        Simulated                             Similarity
+              |                                      A2h descriptor
+              |    3D models            scans                               measure
+blank         |
+              |
+              |
+text          |                                                             Retrieved
+              |                                                              shape
+              |                          …
+blank         |
+              |
+              |
+              |
+text          |                                            …
+              |         …
+blank         |
+              |
+              |
+              |
+text          |                       registered      Density voxel
+blank         |
+text          |                                                                             Retrieved
+              |                                                                             model +
+              |                      Segmented,                                             pose
+              |    Frames of
+              |                       registered     A2h descriptor        Align shape
+              |   measurement
+              |                      pointcloud
+              |         …
+blank         |
+              |
+              |
+              |
+text          |                       registered       Densityvoxel                  provide
+              |                                                                      guidance
+blank         |
+              |
+text          |          Figure 4.2: Pipeline of the real-time guided scanning framework.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                  70
+blank         |
+              |
+              |
+text          | the categories of the shapes present in the environment. In each iteration, the sys-
+              | tem performs three tasks: (i) scan acquisition from a set of viewpoints specified by a
+              | user (or a planning algorithm); (ii) shape retrieval using distribution of relations; and
+              | (iii) comparison of the scanned pointset with the best retrieved model. The system
+              | iterates these steps until a sufficiently good match is found (see supplementary video).
+              | The challenge is how to maintain real-time response.
+blank         |
+              |
+title         | 4.2.1     Scan Acquisition
+text          | The input stream of a real-time depth sensor (in our case, the Kinect was used) is col-
+              | lected and processed using an open-source implementation [EEH+ 11] that calibrates
+              | the color and depth measurements and outputs the pointcloud data. The color fea-
+              | tures of individual frames are then extracted and matched from consecutive frames.
+              | The corresponding depth values are used to incrementally register the depth mea-
+              | surements [HKH+ 12]. The pointcloud that belongs to the object is segmented as the
+              | system detects the ground plane and exclude the points that belong to the plane. We
+              | will refer to the segmented, registered set of depth measurements as a merged scan.
+              | Whenever each new frame is processed, the system calculates the descriptor and the
+              | density voxels from the pointcloud data for the merged scan.
+blank         |
+              |
+title         | 4.2.2     Shape Retrieval
+text          | Our goal is to find shapes in the database that are similar to the merged scan. Since
+              | the merged scan may contain only partial information about the object being scanned,
+              | our system internally generates simulated views of both the merged scan as well as
+              | shapes in the database, and then compare their point clouds associated with these
+              | views. The key observation is that although the merged scan may still have missing
+              | geometry, it is likely that it contains all the visible geometry of the object being
+              | scanned when the object is viewed from a particular point of view (i.e., the self-
+              | occlusions are predictable); it thus becomes comparable to database model views
+              | from the same or nearby viewpoints. Hence, the system measures shape similarity
+              | between such point-cloud views. For shape retrieval, our system first performs a
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                   71
+blank         |
+              |
+              |
+text          | descriptor-based similarity search against the entire database to obtain a candidate
+              | set of similar models. Finally, the system performs registration of each model with
+              | the merged scan and returns the model with the best alignment score.
+              |    We note here that past research on global shape descriptors has mostly focused on
+              | broad differentiation of shape classes, e.g., separating shapes of vehicles from those
+              | of furniture or of people, etc. In our case, since the system is looking for potentially
+              | modest amounts of missing geometry in the scans, we aim more for fine variability
+              | differentiation among a particular object class, such as chairs. We have therefore
+              | developed and exploited a novel histogram descriptor based on the angles between
+              | the shape normals for this task (see Section 4.3.2).
+blank         |
+              |
+title         | 4.2.3     Scan Evaluation
+text          | Once the retrieved model is computed, the retrieved proxy is displayed for the user.
+              | The system also highlights voxels with missing data when compared with the best
+              | matching model, and finishes when the retrieved best match model is close enough to
+              | the current measurement (when the missing voxels are less than 1% of total number
+              | of voxels). In Section 4.3.4, we elaborate on this guided scanning interface.
+blank         |
+              |
+title         | 4.3      Partial Shape Retrieval
+text          | Our goal is to quickly assess the quality of the current scan and guide the user in
+              | subsequent scans. This is challenging on the following counts: (i) the system has
+              | to assess model quality without necessarily knowing which model is being scanned;
+              | (ii) the scans are potentially incomplete, with large parts of data missing; and (iii) the
+              | system should respond in real-time.
+              |    We observe that existing database models such as Trimble 3D warehouse models
+              | can be used as proxies for evaluating scan quality of similar objects being scanned,
+              | thus addressing the first challenge. Hence, for any merged query scan (i.e., point-
+              | cloud) S, the system looks for a match among similar models in the database M =
+              | {M1 , · · · MN }. For simplicity, we assume that the up-right orientation of each model
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                   72
+blank         |
+              |
+              |
+text          | in the model database is available in existing database.
+              |    To handle the second challenge, we note that missing data, even in large chunks,
+              | are mostly the result of self occlusion, and hence are predictable. To address this
+              | problem, our system synthetically scans the models Mi from different viewpoints to
+              | simulate such self occlusions. This greatly simplifies the problem by allowing us to
+              | directly compare S to the simulated scans of Mi , thus automatically accounting for
+              | missing data in S.
+              |    Finally, to achieve real-time performance, we propose a simple, robust, yet effective
+              | descriptor to match S to view-dependent scans of Mi . Subsequently, the system
+              | performs registration to verify the match between each matched simulated scan and
+              | the query scan, and returns the most similar simulated scan and the corresponding
+              | model Mi . The following subsections provide further details of the each step for
+              | partial shape retrieval.
+blank         |
+              |
+title         | 4.3.1     View-Dependent Simulated Scans
+text          | For each model Mi , the system generates simulated scans S k (Mi ) from multiple cam-
+              | era positions. Let dup denote the up-right orientation for model Mi . Our system takes
+              | dup as the z-axis and arbitrarily fixes any orthogonal direction di (i.e., dTi dup = 0) as
+              | the x-axis. The system also translates the centroid of Mi to the origin.
+              |    The system then virtually positions the cameras at the surface of a view-sphere
+              | around the origin. Specifically, the camera is placed at
+blank         |
+text          |                         ci := (2d cos θ sin φ, 2d sin θ sin φ, 2d cos φ)
+blank         |
+text          | where d denotes the length of the diagonal of the bounding box of Mi , and φ denotes
+              | the camera altitude. The camera up-vector is defined as
+blank         |
+text          |                               dup − < dup , ci > ci
+              |                      ui :=                              with ci = ci /kci k
+              |                              kdup − < dup , ci > ci k
+blank         |
+text          | and the gaze point is defined as the origin. The fields of view are set to π/2 in both
+              | the up and horizontal directions.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                         73
+blank         |
+              |
+              |
+text          |     For each such camera location, our system obtains a synthetic scan using the z-
+              | buffer with a grid setting of 200 × 200. Such a grid results in vertices where the grid
+              | rays intersect the model. The system generates the simulated scan by computing one
+              | surfel (pf , nf , df ) (i.e., a point, normal, and density, respectively) from each quad
+              | face f = (qf 1 , qf 2 , qf 3 , qf 4 ), as follows,
+blank         |
+text          |                                4
+              |                                X                                  X
+              |                        pf :=         qf i /4,   nf :=                           nijk /4,   (4.1)
+              |                                i=1                      ijk∈{123,234,341,412}
+              |                                             X
+              |                        df := 1/                           area(qf i , qf j , qf k )        (4.2)
+              |                                   ijk∈{123,234,341,412}
+blank         |
+              |
+text          | where, nijk denotes the normal of the triangular face (qf i , qf j , qf k ) and nf ← nf /knf k.
+              | Thus the simulated scans simply collects surfels generated from all the quad faces of
+              | the sampling grid.
+              |     Our system places K samples of θ, i.e., θ = 2kπ/K where k ∈ [0, K) and φ =
+              | {π/6, π/3} to obtain view-dependent simulated scans for each model Mi . Empirically,
+              | we set K = 6 to balance between efficiency and quality when comparing simulated
+              | scans and the merged scan S.
+blank         |
+              |
+title         | 4.3.2       A2h Scan Descriptor
+text          | Our goal is to design a descriptor that (i) is efficient to compute, (ii) is robust to
+              | noise and outliers, and (iii) has a low-memory footprint. We draw inspiration from
+              | shape distributions [OFCD02] that computes statistics about geometric quantities
+              | that are invariant to global transforms, e.g., distances between pairs of points on
+              | the models. Shape distribution descriptors, however, were designed to be resilient to
+              | local geometric changes. Hence, they are ineffective in our setting, where shapes are
+              | distinguished by subtle local features. Instead, our system computes the distributions
+              | of angles between point normals, which better capture the local geometric features.
+              | Further, since the system knows the upright direction of each shape,this information
+              | is incorporated into the design of the descriptor.
+              |     Specifically, for each scan S (real or simulated), our system first allocates the
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                     74
+blank         |
+              |
+              |
+text          | points into three bins based on their height along the z-axis, i.e., the up-right direction.
+              | Then, among the points within each bin, the system computes the distribution of
+              | angles between normals of all pairs of points. The angle space is discretized using 50
+              | bins between [0, π], e.g., each bin counts the frequency of normal angles within each
+              | bin. We call this the A2h scan descriptor, which for each point cloud is a 50 × 3 = 150
+              | dimensional vector; this collects the angle distribution within each height bin.
+              |    In practice, for pointclouds belonging to any merged scan, our system randomly
+              | samples 10, 000 pairs of points within each height bin to speed-up the computation. In
+              | our extensive tests, we found this simple descriptor to perform better than distance-
+              | only histograms in distinguishing fine variability within a broad shape class (see
+              | Figure 4.3).
+blank         |
+              |
+title         | 4.3.3     Descriptor-Based Shape Matching
+text          | A straightforward way to compare two descriptor vectors f1 of f2 is to take the Lp
+              | norm of their difference vector f1 − f2 . However, the Lp norm can be sensitive to
+              | noise and does not account for the similarity of distribution between similar curves.
+              | Instead, our system uses the Earth Mover’s distance (EMD) to compare a pair of
+              | distributions [RTG98]. Intuitively, given two distributions, one distribution can be
+              | seen as a mass of earth properly spread in space, the other distribution as a collection
+              | of holes that need to be filled with that earth. Then, the EMD measures the least
+              | amount of work needed to fill the holes with earth. Here, a unit of work corresponds to
+              | transporting a unit of earth by a unit of ground distance. The costs of “moving earth”
+              | reflect the notion of nearness between bins; therefore the distortion, due to noise is
+              | minimized. In a 1D setting, EMD with L1 norms is equivalent to calculating an L1
+              | norm for cumulative distribution functions (CDF) of the distribution [Vil03]. Hence,
+              | our system achieves robustness to noise at the same time complexity as calculating
+              | an L1 norm between the A2h distributions. For all of the results presented below, our
+              | system used EMD with L1 norms of the CDFs computed from the A2h distributions.
+              |    Because there are 2K view-dependent pointclouds associated with each model Mi ,
+              | the system matches the query S with each such pointcloud S k (Mi ) (k = 1, 2, ..., 2K)
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                   75
+blank         |
+              |
+              |
+text          | and records the best matching score. In the end, the system returns the top 25
+              | matches across the models in M.
+blank         |
+              |
+title         | 4.3.4     Scan Registration
+text          | Our system overlays the retrieved model Mi over merged scan S as follows: the system
+              | first aligns the centroid of the simulated scan S k (Mi ) to match the centroid of S (note
+              | that we do not force the model Mi to touch the ground), while scaling model Mi to
+              | match the data. To fix the remaining 1DOF rotational ambiguity, the angle space is
+              | discretized into 10◦ intervals, and the system picks the angle for which the rotated
+              | model best matches the scan S. In practice, we found this refinement step necessary
+              | since our view-dependent scans have coarse angular resolution (K = 6).
+              |    Finally, the system uses the positioned proxy model Mi to assess the quality of the
+              | current scan. Specifically, the bounding box of Mi is discretized into 9 × 9 × 9 voxels
+              | and the density of points that falls within the voxel location is calculated. Those
+              | voxels are highlighted where the matched model has high density of points (more
+              | than the average) but where there are insufficient points coming from the scan S,
+              | thus providing guidance for subsequent acquisitions. The process is terminated when
+              | there is less than 10 such highlighted voxels, and the best matching model is simply
+              | displayed.
+blank         |
+              |
+title         | 4.4      Interface Design
+text          | The real-time system guides the user to scan an object and retrieve the closest match.
+              | In our study, we used the Kinect scanner for the acquisition and the retrieval process
+              | took 5-10 seconds/iteration on our unoptimized implementation. The user scans an
+              | object from an operating distance of about 1-3m. The sensor data of real-time video
+              | stream of depth pointcloud and color images are visible to the user at all times (see
+              | Figure 4.4).
+              |    The user starts scanning by pointing the sensor to the ground plane. The ground
+              | plane is detected if the sensor captures a dominant plane that covers more than 50% of
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                               76
+blank         |
+              |
+              |
+text          | the scene. Our system uses this plane to extract the upright direction of the captured
+              | scene. When the ground plane is successfully detected, the user receives an indication
+              | on the screen(Figure 4.4 top-right).
+              |    In a separate window, the pointcloud data corresponding to the object being cap-
+              | tured is continuously displayed. The system registers the points using image features
+              | and segments the object by extracting the groundplane. The displayed pointcloud
+              | data is also used to calculate the descriptor and the voxel density. At the end of
+              | the retrieval stage (see Section 4.3), the system retains the information between the
+              | closest match of the model and the current pointcloud data. The pointcloud is over-
+              | laid with two additional cues: (i) missing data in voxels as compared with the closest
+              | matched model, and (ii) the 3D model of the closest match of the object. Based on
+              | this guidance, the user can then acquire the next scan. The system automatically
+              | stops when the matched model is similar to the captured pointcloud.
+blank         |
+              |
+title         | 4.5     Evaluation
+text          | We tested the robustness of the proposed A2h descriptor on synthetically generated
+              | data against available groundtruth. Further, we let novice users use our system
+              | to scan different indoor environments. The real-time guidance allowed the users to
+              | effectively capture the indoor scenes (see supplementary video).
+blank         |
+text          |                     dataset   # models    average # points/scan
+              |                     chair        2138                     45068
+              |                     couch        1765                   129310
+              |                     lamp         1805                     11600
+              |                     table        5239                     61649
+blank         |
+text          |                       Table 4.1: Database and scan statistics.
+blank         |
+              |
+              |
+title         | 4.5.1    Model Database
+text          | We considered four categories of objects (i.e., chairs, couches, lamps, tables) in our
+              | implementation. For each category, we downloaded a large number of models from
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                 77
+blank         |
+              |
+              |
+text          | the Trimble 3D Warehouse (see Table 4.1) to act as proxy geometry in the online
+              | scanning phase. The models were pre-scaled and moved to the origin. We syntheti-
+              | cally scanned each such model from 12 different viewpoints and computed the A2h
+              | descriptor for each such scan. Note that we placed the camera only above the objects
+              | (altitudes of π/6 and π/3) as the input scans rarely capture the underside of the ob-
+              | jects. We used the Kinect scanner to gather streaming data and used an open source
+              | library [EEH+ 11] to accumulate the input data to produce merged scans.
+blank         |
+              |
+title         | 4.5.2     Retrieval Results with Simulated Data
+text          | The proposed A2h descriptor is effective in retrieving similar shapes in fractions of
+              | seconds. Figure 4.5, 4.6, 4.7, and 4.8 show typical retrieval results. In our tests, we
+              | found the retrieval results to be useful for chairs and couches, which have a wider
+              | variation of angles compared to lamps or tables, the shape of which is almost always
+              | very symmetric.
+blank         |
+title         | Effect of Viewpoints
+blank         |
+text          | The scanned data often have significant parts missing, mainly due to self-occlusion.
+              | We simulated this effect on the A2h descriptor-based retrieval and compared the
+              | performance against retrieval with merged (simulated) scans, Figure 4.9. We found
+              | the retrieval results to be robust and the models sufficiently representative to be used
+              | as proxies for subsequent model assessment.
+blank         |
+title         | Comparison with Other Descriptors
+blank         |
+text          | We also tested existing shape descriptors: silhouette-based light field descriptor [CTSO03],
+              | local spin image [Joh97], and the D2 descriptor [OFCD02]. In all the cases, we found
+              | our A2h descriptor to be more effective in quickly resolving local geometric changes,
+              | particularly for low quality partial pointclouds. In contrast, we found the light field
+              | descriptor to be more susceptible to noise, local spin image more expensive to com-
+              | pute, and the D2 descriptor less able to distinguish between local variations than our
+              | A2h descriptor (see Figure 4.3).
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                 78
+blank         |
+              |
+              |
+text          |    We next evaluated the degradation in the retrieval results under perturbations in
+              | sampling density and noise.
+blank         |
+title         | Effect of Density
+blank         |
+text          | During scanning, points are sampled uniformly on the sensor grid, instead of uniformly
+              | on the model surface. This uniform sampling on the sensor grid results in varying
+              | densities of scanned points depending on the viewpoint. Our system compensates for
+              | this effect by assigning probabilities that are inversely proportional to the density of
+              | sample points.
+              |    Figure 4.10 shows the effect of density compensation on the histogram distribu-
+              | tions. We tested two different combination of viewpoints and compared the distribu-
+              | tions, using sampling based on uniform distribution or inversely proportional to the
+              | density. Density-aware sampling are indicated by dotted lines. The overall shapes
+              | of the graphs are similar for uniform and density-aware samplings. However, the ab-
+              | solute values on the peaks are observed at similar heights while using density-aware
+              | sampling. Hence, our system uses density-aware sampling to achieve robustness to
+              | sampling variations.
+blank         |
+title         | Effect of Noise
+blank         |
+text          | In Figure 4.11, we show the robustness of A2h histograms under noise. Generally, the
+              | histograms become smoother under increasing noise as subtle orientation variations
+              | get masked. For reference, the Kinect measurements from a distance range of 1-2m
+              | have noise perturbations comparable to 0.005 noise in the simulated data. We added
+              | synthetic Gaussian noise on the simulated data to calculate the A2h descriptors to
+              | better simulate the shape of the histogram.
+blank         |
+              |
+title         | 4.5.3     Retrieval Results with Real Data
+text          | Figure 4.12 shows retrieval results on a range of objects (i.e., chairs, couches, lamps,
+              | and tables). Overall we found the guided interface to work well in practice. The
+              | performance was better for chairs and couches, while for lamps and tables, the thin
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                79
+blank         |
+              |
+              |
+text          | structures led to some failure cases. In all cases, the system successfully handled
+              | missing data as high as 40-60% of the object surface (or half of the object surface
+              | invisible) and the response of the system was at interactive rates. Note that for
+              | testing purposes we manually pruned the input database models to leave out models
+              | (if any) that looked very similar to the target objects to be scanned. Please refer to
+              | the supplementary video for the system in action.
+blank         |
+              |
+title         | 4.6     Conclusions
+text          | We have presented a real-time guided scanning setup for online quality assessment of
+              | streaming RGBD data obtained while acquiring indoor environments. The proposed
+              | approach is motivated by three key observations: (i) indoor scenes largely consist of
+              | a few different types of objects, each of which can be reasonably approximated by
+              | commonly available 3D model sets; (ii) data is often missed due to self-occlusions,
+              | and hence such missing regions can be predicted by comparisons against synthetically
+              | scanned database models from multiple viewpoints; and (iii) streaming scan data can
+              | be robustly and effectively compared against simulated scans by a direct comparison
+              | of the distribution of relative local orientations in the two types of scans. The best
+              | retrieved model is then used as a proxy to evaluate the quality of the current scan and
+              | guide subsequent acquisition frames. We have demonstrated the real-time system on
+              | a large number of synthetic and real-world examples with a database of 3D models,
+              | often ranging in a few thousands.
+              |    In the future, we would like to extend our guided system to create online recon-
+              | structions while specifically focusing on generating semantically valid scene models.
+              | Using context information in the form of co-occurrence cues (e.g., a keyboard and
+              | mouse are usually near each other) can prove to be effective. Finally, we plan to use
+              | GPU-based optimized codes to handle additional categories of 3D models.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                                80
+blank         |
+              |
+              |
+              |
+text          |                                                                         D2
+blank         |
+              |
+              |
+              |
+text          |                                                                        A2h
+blank         |
+              |
+              |
+text          |  query
+              |                                                                  aligned model
+blank         |
+              |
+              |
+              |
+text          |                                                                         D2
+blank         |
+              |
+              |
+              |
+text          |                                                                        A2h
+blank         |
+              |
+              |
+text          |  query
+              |                                                                  aligned model
+blank         |
+text          | Figure 4.3: Representative shape retrieval results using the D2 descriptor( [OFCD02],
+              | first row), the A2h descriptor introduced in this chapter (Section 4.3.2, second row),
+              | and the aligned models after scan registration (Section 4.3.4, third row) on the top 25
+              | matches from A2h. For each method, we only show the top 4 matches. The D2 and
+              | A2h descriptor (first two rows) are compared by histogram distributions, which is a
+              | quick and efficient. Empirically, we observed the A2h descriptor to better capture
+              | local geometric features compared to the D2 descriptor, with local registration further
+              | improving the retrieval quality. The comparison based on 3D alignment (third row)
+              | is more accurate, but require more computation time, and cannot be performed in
+              | real-time given the size of our database of models.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                               81
+blank         |
+              |
+              |
+text          |  scanning setup
+blank         |
+              |
+              |
+              |
+text          |                                                         detected groundplane
+blank         |
+              |
+              |
+text          |                                              scanning guidance
+blank         |
+              |
+              |
+              |
+text          |  current scan
+blank         |
+              |
+              |
+              |
+text          |  current scan                                retreived model
+blank         |
+text          | Figure 4.4: The proposed guided real-time scanning setup is simple to use. The
+              | user starts by scanning using a Microsoft Kinect (top-left). The system first detects
+              | the ground plane and the user is notified (top-right). The current pointcloud corre-
+              | sponding to the target object is displayed in the 3D view window, the best matching
+              | database model is retrieved (overlaid in transparent white), and the predicted missing
+              | voxels are highlighted as yellow voxels (middle-right). Based on the provided guid-
+              | ance, the user acquires the next frame of data, and the process continues. Our method
+              | stops when the retrieved shape explains well the captured pointcloud. Finally, the
+              | overlaid 3D shape is highlighted in white (bottom-right). Note that the accumulated
+              | scans have significant parts missing in most scanning steps.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                            82
+blank         |
+              |
+              |
+              |
+text          | Figure 4.5: Retrieval results with simulated data using a chair data set. Given the
+              | model in the first column, the database of 2138 models are matched using the A2h
+              | descriptor, and the top 5 matches are shown.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                            83
+blank         |
+              |
+              |
+              |
+text          | Figure 4.6: Retrieval results with simulated data using a couch data set. Given the
+              | model in the first column, the database of 1765 models are matched using the A2h
+              | descriptor, and the top 5 matches are shown.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                           84
+blank         |
+              |
+              |
+              |
+text          | Figure 4.7: Retrieval results with simulated data using a lamp data set. Given the
+              | model in the first column, the database of 1805 models are matched using the A2h
+              | descriptor, and the top 5 matches are shown.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                            85
+blank         |
+              |
+              |
+              |
+text          | Figure 4.8: Retrieval results with simulated data using a table data set. Given the
+              | model in the first column, the database of 5239 models are matched using the A2h
+              | descriptor, and the top 5 matches are shown.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                             86
+blank         |
+              |
+              |
+              |
+text          |                         View-dependent
+blank         |
+              |
+              |
+text          |     Query object        Merged scan
+blank         |
+              |
+              |
+              |
+text          |                         View-dependent
+blank         |
+              |
+              |
+text          |     Query object
+              |                         Merged scan
+blank         |
+              |
+              |
+              |
+text          |                         View-dependent
+blank         |
+              |
+text          |     Query object
+              |                         Merged scan
+blank         |
+text          | Figure 4.9: Comparison between retrieval with view-dependant and merged scans.
+              | The models are sorted by matching scores, with lower scores denoting better matches.
+              | The leftmost images show the query scans. Note that the view-dependent scan-based
+              | retrieval are robust even with significant missing regions (∼30-50%). The numbers
+              | in parenthesis denote the view index.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                            87
+blank         |
+              |
+              |
+              |
+text          | Figure 4.10: Effect of density-aware sampling on two different combination of views
+              | (comb1 and comb2). The sampling that considers the density of points are comb1d
+              | and comb2d , respectively.
+blank         |
+              |
+              |
+              |
+text          | Figure 4.11: Effect of noise. The shape of histogram becomes smoother as the level
+              | of noise increases.
+meta          | CHAPTER 4. GUIDED REAL-TIME SCANNING                                            88
+blank         |
+              |
+              |
+text          |     image       accumulated
+              |     proxy model     scan
+              |       retrieved
+blank         |
+              |
+              |
+              |
+text          |                                         chairs
+blank         |
+              |
+              |
+              |
+text          |                                            couches
+blank         |
+              |
+              |
+              |
+text          |                                         lamps
+blank         |
+              |
+              |
+              |
+text          |                                         tables
+blank         |
+              |
+text          | Figure 4.12: Real-time retrieval results on various datasets. For each set, we show
+              | the image of the object being scanned, the accumulated pointcloud, and the closest
+              | shape retrieved model, along with the top 25 candidates that are picked from the
+              | database of thousands of models using the proposed A2h descriptor.
+meta          | Chapter 5
+blank         |
+title         | Conclusions
+blank         |
+text          | 3-D reconstruction in indoor environment is a challenging problem because of the
+              | complexity and variety of the objects present, and frequent changes in positions of
+              | objects made by the people who inhabit space. Based on recent technology, the
+              | work presented in this dissertation frames the reconstruction of indoor environment
+              | as light-weight systems.
+              |    RGB-D cameras (e.g., Microsoft Kinect) are a new type of sensor and the standard
+              | for utilizing the data is not yet fully established. Still, the sensor is revolutionary
+              | because it is an affordable technology that can capture the 3-D data of everyday
+              | environments at video frame rate. This dissertation covers quick pipelines that allow
+              | possible real-time interaction between the user and the system. However, such data
+              | comes at the price of complex noise characteristics.
+              |    To reconstruct the challenging indoor structures with limited data, we imposed
+              | different geometric priors depending on the target applications and aimed for high-
+              | level understanding. In chapter 2, we present a pipeline to acquire floor plans using
+              | large planes as a geometric prior. We followed the well-known Manhattan-world
+              | assumption and utilized user feedback to overcome ambiguous situations and specify
+              | the important planes to be included in the model. Chapter 3 described our use
+              | of simple models of repeating objects with deformation modes. Public places with
+              | many of repeating objects can be reconstructed by recovering the low-dimensional
+              | deformation and placement information. Chapter 4 showed how we retrieve complex
+blank         |
+              |
+meta          |                                           89
+              | CHAPTER 5. CONCLUSIONS                                                             90
+blank         |
+              |
+              |
+text          | shape of objects with the help of a large database of 3-D models, as we develop a
+              | descriptor that can be computed and searched efficiently and allow online quality
+              | assessment to be presented to the user.
+              |    Each of the pipelines presented in these chapters targets at a specific application
+              | and has been evaluated accordingly. The work of the dissertation can be extended
+              | into other possible real-life applications that can connect actual environments with
+              | the virtual world. The depth data from RGB-D cameras is easy to acquire, but we
+              | still do not know how to make full use of the massive amount of information produced.
+              | The potential applications can benefit from better understanding and handling of the
+              | data. As one extension, we are interested in scaling the database of models and data
+              | with special attention paid to data structure. The research community and others
+              | would also benefit from the advances made in the use of reliable depth and color
+              | features in the new type of data obtained from the RGB-D sensors in addition to the
+              | presented descriptor.
+meta          | Bibliography
+blank         |
+ref           | [BAD10]     Soonmin Bae, Aseem Agarwala, and Fredo Durand. Computational
+              |             rephotography. ACM Trans. Graph., 29(5), 2010.
+blank         |
+ref           | [BM92]      Paul J. Besl and Neil D. McKay. A method for registration of 3-D
+              |             shapes. IEEE PAMI, 14(2):239–256, 1992.
+blank         |
+ref           | [CTSO03]    Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On
+              |             visual similarity based 3d model retrieval. CGF, 22(3):223–232, 2003.
+blank         |
+ref           | [CY99]      James M. Coughlan and A. L. Yuille. Manhattan world: Compass
+              |             direction from a single image by bayesian inference. In ICCV, pages
+              |             941–947, 1999.
+blank         |
+ref           | [CZ11]      Will Chang and Matthias Zwicker. Global registration of dynamic range
+              |             scans for articulated model reconstruction. ACM TOG, 30(3):26:1–
+              |             26:15, 2011.
+blank         |
+ref           | [Dey07]     T. K. Dey. Curve and Surface Reconstruction : Algorithms with Math-
+              |             ematical Analysis. Cambridge University Press, 2007.
+blank         |
+ref           | [DHR+ 11]   Hao Du, Peter Henry, Xiaofeng Ren, Marvin Cheng, Dan B. Goldman,
+              |             Steven M. Seitz, and Dieter Fox. Interactive 3d modeling of indoor
+              |             environments with a consumer depth camera. In Proc. Ubiquitous com-
+              |             puting, pages 75–84, 2011.
+blank         |
+ref           | [EEH+ 11]   Nikolas Engelhard, Felix Endres, Jürgen Hess, Jürgen Sturm, and Wol-
+              |             fram Burgard. Real-time 3D visual SLAM with a hand-held RGB-D
+blank         |
+meta          |                                          91
+              | BIBLIOGRAPHY                                                                       92
+blank         |
+              |
+              |
+ref           |             camera. In Proc. of the RGB-D Workshop on 3D Perception in Robotics
+              |             at the European Robotics Forum, 2011.
+blank         |
+ref           | [FB81]      Martin A. Fischler and Robert C. Bolles. Random sample consensus:
+              |             a paradigm for model fitting with applications to image analysis and
+              |             automated cartography. Commun. ACM, 24(6):381–395, June 1981.
+blank         |
+ref           | [FCSS09]    Y. Furukawa, B. Curless, S.M. Seitz, and R. Szeliski. Reconstructing
+              |             building interiors from images. In ICCV, pages 80–87, 2009.
+blank         |
+ref           | [FSH11]     Matthew Fisher, Manolis Savva, and Pat Hanrahan. Characterizing
+              |             structural relationships in scenes using graph kernels.      ACM TOG,
+              |             30(4):34:1–34:11, 2011.
+blank         |
+ref           | [GCCMC08] Andrew P. Gee, Denis Chekhlov, Andrew Calway, and Walterio Mayol-
+              |             Cuevas. Discovering higher level structure in visual slam. IEEE Trans-
+              |             actions on Robotics, 24(5):980–990, October 2008.
+blank         |
+ref           | [GEH10]     Abhinav Gupta, Alexei A. Efros, and Martial Hebert. Blocks world re-
+              |             visited: Image understanding using qualitative geometry and mechan-
+              |             ics. In ECCV, pages 482–496, 2010.
+blank         |
+ref           | [HCI+ 11]   S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab,
+              |             and V. Lepetit. Multimodal templates for real-time detection of texture-
+              |             less objects in heavily cluttered scenes. ICCV, 2011.
+blank         |
+ref           | [HKG11]     Qixing Huang, Vladlen Koltun, and Leonidas Guibas. Joint-shape seg-
+              |             mentation with linear programming. ACM TOG (SIGGRAPH Asia),
+              |             30(6):125:1–125:11, 2011.
+blank         |
+ref           | [HKH+ 12]   Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, and Dieter
+              |             Fox. RGBD mapping: Using kinect-style depth cameras for dense 3D
+              |             modeling of indoor environments. I. J. Robotic Res., 31(5):647–663,
+              |             2012.
+meta          | BIBLIOGRAPHY                                                                     93
+blank         |
+              |
+              |
+ref           | [IKH+ 11]   Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard
+              |             Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Free-
+              |             man, Andrew Davison, and Andrew Fitzgibbon. Kinectfusion: real-time
+              |             3D reconstruction and interaction using a moving depth camera. In
+              |             Proc. UIST, pages 559–568, 2011.
+blank         |
+ref           | [Joh97]     Andrew Johnson.     Spin-Images: A Representation for 3-D Surface
+              |             Matching. PhD thesis, Robotics Institute, CMU, 1997.
+blank         |
+ref           | [JTRS12]    Arjun Jain, Thorsten Thormahlen, Tobias Ritschel, and Hans-Peter Sei-
+              |             del. Exploring shape variations by 3d-model decomposition and part-
+              |             based recombination. CGF (EUROGRAPHICS), 31(2):631–640, 2012.
+blank         |
+ref           | [KAJS11]    H.S. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic la-
+              |             beling of 3D point clouds for indoor scenes. In NIPS, pages 244–252,
+              |             2011.
+blank         |
+ref           | [KDS+ 12]   Young Min Kim, Jennifer Dolson, Michael Sokolsky, Vladlen Koltun,
+              |             and Sebastian Thrun. Interactive acquisition of residential floor plans.
+              |             In ICRA, pages 3055–3062, 2012.
+blank         |
+ref           | [KMYG12]    Young Min Kim, Niloy J. Mitra, Dong-Ming Yan, and Leonidas Guibas.
+              |             Acquiring 3d indoor environments with variability and repetition. ACM
+              |             TOG, 31(6), 2012.
+blank         |
+ref           | [LAGP09]    Hao Li, Bart Adams, Leonidas J. Guibas, and Mark Pauly. Robust
+              |             single-view geometry and motion reconstruction. ACM TOG (SIG-
+              |             GRAPH), 28(5):175:1–175:10, 2009.
+blank         |
+ref           | [LGHK10]    David Changsoo Lee, Abhinav Gupta, Martial Hebert, and Takeo
+              |             Kanade. Estimating spatial layout of rooms using volumetric reasoning
+              |             about objects and surfaces. In NIPS, pages 1288–1296, 2010.
+blank         |
+ref           | [LH05]      Marius Leordeanu and Martial Hebert. A spectral technique for cor-
+              |             respondence problems using pairwise constraints. In ICCV, volume 2,
+              |             pages 1482–1489, 2005.
+meta          | BIBLIOGRAPHY                                                                 94
+blank         |
+              |
+              |
+ref           | [MFO+ 07]   Niloy J. Mitra, Simon Flory, Maks Ovsjanikov, Natasha Gelfand,
+              |             Leonidas Guibas, and Helmut Pottmann. Dynamic geometry registra-
+              |             tion. In Symp. on Geometry Proc., pages 173–182, 2007.
+blank         |
+ref           | [Mic10]     MicroSoft. Kinect for xbox 360. http://www.xbox.com/en-US/kinect,
+              |             November 2010.
+blank         |
+ref           | [MM09]      Pranav Mistry and Pattie Maes. Sixthsense: a wearable gestural in-
+              |             terface. In SIGGRAPH ASIA Art Gallery & Emerging Technologies,
+              |             page 85, 2009.
+blank         |
+ref           | [MPWC12]    Niloy J. Mitra, Mark Pauly, Michael Wand, and Duygu Ceylan. Symme-
+              |             try in 3d geometry: Extraction and applications. In EUROGRAPHICS
+              |             State-of-the-art Report, 2012.
+blank         |
+ref           | [MYY+ 10]   N. Mitra, Y.-L. Yang, D.-M. Yan, W. Li, and M. Agrawala. Illus-
+              |             trating how mechanical assemblies work. ACM TOG (SIGGRAPH),
+              |             29(4):58:1–58:12, 2010.
+blank         |
+ref           | [MZL+ 09]   Ravish Mehra, Qingnan Zhou, Jeremy Long, Alla Sheffer, Amy Gooch,
+              |             and Niloy J. Mitra. Abstraction of man-made shapes. ACM TOG
+              |             (SIGGRAPH Asia), 28(5):#137, 1–10, 2009.
+blank         |
+ref           | [ND10]      Richard A. Newcombe and Andrew J. Davison. Live dense reconstruc-
+              |             tion with a single moving camera. In CVPR, 2010.
+blank         |
+ref           | [NXS12]     Liangliang Nan, Ke Xie, and Andrei Sharf. A search-classify approach
+              |             for cluttered indoor scene understanding. ACM TOG (SIGGRAPH
+              |             Asia), 31(6), 2012.
+blank         |
+ref           | [OFCD02]    Robert Osada, Thomas Funkhouser, Bernard Chazelle, and David
+              |             Dobkin.    Shape distributions.   ACM Transactions on Graphics,
+              |             21(4):807–832, October 2002.
+meta          | BIBLIOGRAPHY                                                                    95
+blank         |
+              |
+              |
+ref           | [OLGM11]     Maks Ovsjanikov, Wilmot Li, Leonidas Guibas, and Niloy J. Mitra.
+              |              Exploration of continuous variability in collections of 3D shapes. ACM
+              |              TOG (SIGGRAPH), 30(4):33:1–33:10, 2011.
+blank         |
+ref           | [PMG+ 05]    Mark Pauly, Niloy J. Mitra, Joachim Giesen, Markus Gross, and
+              |              Leonidas J. Guibas. Example-based 3D scan completion. In Symp.
+              |              on Geometry Proc., pages 23–32, 2005.
+blank         |
+ref           | [PMW+ 08]    M. Pauly, N. J. Mitra, J. Wallner, H. Pottmann, and L. Guibas. Discov-
+              |              ering structural regularity in 3D geometry. ACM TOG (SIGGRAPH),
+              |              27(3):43:1–43:11, 2008.
+blank         |
+ref           | [RBF12]      Xiaofeng Ren, Liefeng Bo, and D. Fox. RGB-D scene labeling: Features
+              |              and algorithms. In CVPR, pages 2759 – 2766, 2012.
+blank         |
+ref           | [RHHL02]     Szymon Rusinkiewicz, Olaf Hall-Holt, and Marc Levoy. Real-time 3D
+              |              model acquisition. ACM TOG (SIGGRAPH), 21(3):438–446, 2002.
+blank         |
+ref           | [RL01]       Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the icp
+              |              algorithm. In Proc. 3DIM, 2001.
+blank         |
+ref           | [RTG98]      Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for
+              |              distributions with applications to image databases. In ICCV, pages
+              |              59–, 1998.
+blank         |
+ref           | [SFC+ 11]    Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark
+              |              Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-
+              |              time human pose recognition in parts from a single depth image. In
+              |              CVPR, pages 1297–1304, 2011.
+blank         |
+ref           | [SvKK+ 11]   Oana Sidi, Oliver van Kaick, Yanir Kleiman, Hao Zhang, and Daniel
+              |              Cohen-Or.    Unsupervised co-segmentation of a set of shapes via
+              |              descriptor-space spectral clustering. ACM TOG (SIGGRAPH Asia),
+              |              30(6):126:1–126:10, 2011.
+meta          | BIBLIOGRAPHY                                                                    96
+blank         |
+              |
+              |
+ref           | [SWK07]     Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Efficient RANSAC
+              |             for point-cloud shape detection. CGF (EUROGRAPHICS), 26(2):214–
+              |             226, 2007.
+blank         |
+ref           | [SWWK08] Ruwen Schnabel, Raoul Wessel, Roland Wahl, and Reinhard Klein.
+              |             Shape recognition in 3D point-clouds. In Proc. WSCG, pages 65–72,
+              |             2008.
+blank         |
+ref           | [SXZ+ 12]   Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and
+              |             Baining Guo. An interactive approach to semantic modeling of indoor
+              |             scenes with an RGBD camera. ACM TOG (SIGGRAPH Asia), 31(6),
+              |             2012.
+blank         |
+ref           | [Thr02]     S. Thrun. Robotic mapping: A survey. In G. Lakemeyer and B. Nebel,
+              |             editors, Exploring Artificial Intelligence in the New Millenium. Morgan
+              |             Kaufmann, 2002.
+blank         |
+ref           | [TMHF00]    Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W.
+              |             Fitzgibbon. Bundle adjustment - a modern synthesis. In Proceedings of
+              |             the International Workshop on Vision Algorithms: Theory and Practice,
+              |             ICCV ’99. Springer-Verlag, 2000.
+blank         |
+ref           | [TSS10]     R. Triebel, J. Shin, and R. Siegwart. Segmentation and unsupervised
+              |             part-based discovery of repetitive objects. In Proceedings of Robotics:
+              |             Science and Systems, 2010.
+blank         |
+ref           | [TW05]      Sebastian Thrun and Ben Wegbreit. Shape from symmetry. In ICCV,
+              |             pages 1824–1831, 2005.
+blank         |
+ref           | [VAB10]     Carlos A. Vanegas, Daniel G. Aliaga, and Bedrich Benes. Building
+              |             reconstruction using manhattan-world grammars. In CVPR, pages 358–
+              |             365, 2010.
+blank         |
+ref           | [Vil03]     C. Villani. Topics in Optimal Transportation. Graduate Studies in
+              |             Mathematics. American Mathematical Society, 2003.
+meta          | BIBLIOGRAPHY                                                                    97
+blank         |
+              |
+              |
+ref           | [XLZ+ 10]   Kai Xu, Honghua Li, Hao Zhang, Daniel Cohen-Or, Yueshan Xiong,
+              |             and Zhiquan Cheng. Style-content separation by anisotropic part scales.
+              |             ACM TOG (SIGGRAPH Asia), 29(5):184:1–184:10, 2010.
+blank         |
+ref           | [XS12]      Yu Xiang and Silvio Savarese. Estimating the aspect layout of object
+              |             categories. In CVPR, pages 3410–3417, 2012.
+blank         |
+ref           | [XZZ+ 11]   Kai Xu, Hanlin Zheng, Hao Zhang, Daniel Cohen-Or, , Ligang Liu, and
+              |             Yueshan Xiong. Photo-inspired model-driven 3D object modeling. ACM
+              |             TOG (SIGGRAPH), 30(4):80:1–80:10, 2011.
+blank         |
+ref           | [ZCC+ 12]   Youyi Zheng, Xiang Chen, Ming-Ming Cheng, Kun Zhou, Shi-Min Hu,
+              |             and Niloy J. Mitra. Interactive images: Cuboid proxies for smart image
+              |             manipulation. ACM TOG (SIGGRAPH), 31(4):99:1–99:11, 2012.
+blank         |